public class OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware
modelFile
from the values found in
any matching source
field into a configured dest
field, after
first tokenizing the source text using the index analyzer on the configured
analyzerFieldType
, which must include solr.OpenNLPTokenizerFactory
as the tokenizer. E.g.:
<fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-tokenizer.bin"/> </analyzer> </fieldType>
See the OpenNLP website for information on downloading pre-trained models.
Note that in order to use model files larger than 1MB on SolrCloud, ZooKeeper server and client configuration is required.
The source
field(s) can be configured as either:
<str>
<arr>
of <str>
<lst>
containing
FieldMutatingUpdateProcessorFactory style selector arguments
The dest
field can be a single <str>
containing the literal name of a destination field, or it may be a <lst>
specifying a
regex pattern
and a replacement
string. If the pattern + replacement option
is used the pattern will be matched against all fields matched by the source selector, and the replacement
string (including any capture groups specified from the pattern) will be evaluated a using
Matcher.replaceAll(String)
to generate the literal name of the destination field. Additionally,
an occurrence of the string "{EntityType}" in the dest
field specification, or in the
replacement
string, will be replaced with the entity type(s) returned for each entity by
the OpenNLP NER model; as a result, if the model extracts more than one entity type, then more than one
dest
field will be populated.
If the resolved dest
field already exists in the document, then the
named entities extracted from the source
fields will be added to it.
In the example below:
text
field and added
to the names_ss
fieldtitle
and
subtitle
fields and added into the titular_people
field_txt
-- except for notes_txt
-- and added into the people_ss
fieldsummary
field and added
to the summary_person_ss
field, assuming that the modelFile only extracts
entities of type "person".<updateRequestProcessorChain name="multiple-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text</str> <str name="dest">people_s</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <arr name="source"> <str>title</str> <str>subtitle</str> </arr> <str name="dest">titular_people</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">.*_txt$</str> <lst name="exclude"> <str name="fieldName">notes_txt</str> </lst> </lst> <str name="dest">people_s</str> </processor> <processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">^desc(.*)s$</str> </lst> <lst name="dest"> <str name="pattern">^desc(.*)s$</str> <str name="replacement">key_desc$1_people</str> </lst> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">summary</str> <str name="dest">summary_{EntityType}_s</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
UpdateRequestProcessorFactory.RunAlways
Modifier and Type | Field and Description |
---|---|
static String |
ANALYZER_FIELD_TYPE_PARAM |
static String |
DEST_PARAM |
static String |
ENTITY_TYPE |
static String |
MODEL_PARAM |
static String |
PATTERN_PARAM |
static String |
REPLACEMENT_PARAM |
static String |
SOURCE_PARAM |
Constructor and Description |
---|
OpenNLPExtractNamedEntitiesUpdateProcessorFactory() |
Modifier and Type | Method and Description |
---|---|
UpdateRequestProcessor |
getInstance(SolrQueryRequest req,
SolrQueryResponse rsp,
UpdateRequestProcessor next) |
protected FieldMutatingUpdateProcessor.FieldNameSelector |
getSourceSelector() |
void |
inform(SolrCore core) |
void |
init(NamedList args) |
public static final String SOURCE_PARAM
public static final String DEST_PARAM
public static final String PATTERN_PARAM
public static final String REPLACEMENT_PARAM
public static final String MODEL_PARAM
public static final String ANALYZER_FIELD_TYPE_PARAM
public static final String ENTITY_TYPE
public OpenNLPExtractNamedEntitiesUpdateProcessorFactory()
protected final FieldMutatingUpdateProcessor.FieldNameSelector getSourceSelector()
public void init(NamedList args)
init
in interface NamedListInitializedPlugin
init
in class UpdateRequestProcessorFactory
public void inform(SolrCore core)
inform
in interface SolrCoreAware
public final UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
getInstance
in class UpdateRequestProcessorFactory
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.