public class OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware
modelFile from the values found in
any matching source field into a configured dest field, after
first tokenizing the source text using the index analyzer on the configured
analyzerFieldType, which must include solr.OpenNLPTokenizerFactory
as the tokenizer. E.g.:
<fieldType name="opennlp-en-tokenization" class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
</analyzer>
</fieldType>
See the OpenNLP website for information on downloading pre-trained models.
Note that in order to use model files larger than 1MB on SolrCloud, ZooKeeper server and client configuration is required.
The source field(s) can be configured as either:
<str><arr> of <str><lst> containing
FieldMutatingUpdateProcessorFactory style selector argumentsThe dest field can be a single <str>
containing the literal name of a destination field, or it may be a <lst> specifying a
regex pattern and a replacement string. If the pattern + replacement option
is used the pattern will be matched against all fields matched by the source selector, and the replacement
string (including any capture groups specified from the pattern) will be evaluated a using
Matcher.replaceAll(String) to generate the literal name of the destination field. Additionally,
an occurrence of the string "{EntityType}" in the dest field specification, or in the
replacement string, will be replaced with the entity type(s) returned for each entity by
the OpenNLP NER model; as a result, if the model extracts more than one entity type, then more than one
dest field will be populated.
If the resolved dest field already exists in the document, then the
named entities extracted from the source fields will be added to it.
In the example below:
text field and added
to the names_ss fieldtitle and
subtitle fields and added into the titular_people field_txt
-- except for notes_txt -- and added into the people_ss fieldsummary field and added
to the summary_person_ss field, assuming that the modelFile only extracts
entities of type "person".
<updateRequestProcessorChain name="multiple-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text</str>
<str name="dest">people_s</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<arr name="source">
<str>title</str>
<str>subtitle</str>
</arr>
<str name="dest">titular_people</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<lst name="source">
<str name="fieldRegex">.*_txt$</str>
<lst name="exclude">
<str name="fieldName">notes_txt</str>
</lst>
</lst>
<str name="dest">people_s</str>
</processor>
<processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<lst name="source">
<str name="fieldRegex">^desc(.*)s$</str>
</lst>
<lst name="dest">
<str name="pattern">^desc(.*)s$</str>
<str name="replacement">key_desc$1_people</str>
</lst>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">summary</str>
<str name="dest">summary_{EntityType}_s</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
UpdateRequestProcessorFactory.RunAlways| Modifier and Type | Field and Description |
|---|---|
static String |
ANALYZER_FIELD_TYPE_PARAM |
static String |
DEST_PARAM |
static String |
ENTITY_TYPE |
static String |
MODEL_PARAM |
static String |
PATTERN_PARAM |
static String |
REPLACEMENT_PARAM |
static String |
SOURCE_PARAM |
| Constructor and Description |
|---|
OpenNLPExtractNamedEntitiesUpdateProcessorFactory() |
| Modifier and Type | Method and Description |
|---|---|
UpdateRequestProcessor |
getInstance(SolrQueryRequest req,
SolrQueryResponse rsp,
UpdateRequestProcessor next) |
protected FieldMutatingUpdateProcessor.FieldNameSelector |
getSourceSelector() |
void |
inform(SolrCore core) |
void |
init(NamedList args) |
public static final String SOURCE_PARAM
public static final String DEST_PARAM
public static final String PATTERN_PARAM
public static final String REPLACEMENT_PARAM
public static final String MODEL_PARAM
public static final String ANALYZER_FIELD_TYPE_PARAM
public static final String ENTITY_TYPE
public OpenNLPExtractNamedEntitiesUpdateProcessorFactory()
protected final FieldMutatingUpdateProcessor.FieldNameSelector getSourceSelector()
public void init(NamedList args)
init in interface NamedListInitializedPlugininit in class UpdateRequestProcessorFactorypublic void inform(SolrCore core)
inform in interface SolrCoreAwarepublic final UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
getInstance in class UpdateRequestProcessorFactoryCopyright © 2000-2020 Apache Software Foundation. All Rights Reserved.