Class OpenNLPExtractNamedEntitiesUpdateProcessorFactory
- All Implemented Interfaces:
NamedListInitializedPlugin,SolrCoreAware
modelFile from the values found in any
matching source field into a configured dest field, after first
tokenizing the source text using the index analyzer on the configured analyzerFieldType
, which must include solr.OpenNLPTokenizerFactory as the tokenizer. E.g.:
<fieldType name="opennlp-en-tokenization" class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-tokenizer.bin"/>
</analyzer>
</fieldType>
See the OpenNLP website for information on downloading pre-trained models. Note that in order to use model files larger than 1 MB on SolrCloud, ZooKeeper server and client configuration is required.
The source field(s) can be configured as either:
- One or more
<str> - An
<arr>of<str> - A
<lst>containingFieldMutatingUpdateProcessorFactory style selector arguments
The dest field can be a single <str> containing the literal
name of a destination field, or it may be a <lst> specifying a regex
pattern and a replacement string. If the pattern + replacement option is used
the pattern will be matched against all fields matched by the source selector, and the
replacement string (including any capture groups specified from the pattern) will be evaluated a
using Matcher.replaceAll(String) to generate the literal name of the destination field.
Additionally, an occurrence of the string "{EntityType}" in the dest field
specification, or in the replacement string, will be replaced with the entity
type(s) returned for each entity by the OpenNLP NER model; as a result, if the model extracts
more than one entity type, then more than one dest field will be populated.
If the resolved dest field already exists in the document, then the named
entities extracted from the source fields will be added to it.
In the example below:
- Named entities will be extracted from the
textfield and added to thenames_ssfield - Named entities will be extracted from both the
titleandsubtitlefields and added into thetitular_peoplefield - Named entities will be extracted from any field with a name ending in
_txt-- except fornotes_txt-- and added into thepeople_ssfield - Named entities will be extracted from any field with a name beginning with "desc" and ending in "s" (e.g. "descs" and "descriptions") and added to a field prefixed with "key_", not ending in "s", and suffixed with "_people". (e.g. "key_desc_people" or "key_description_people")
- Named entities will be extracted from the
summaryfield and added to thesummary_person_ssfield, assuming that the modelFile only extracts entities of type "person".
<updateRequestProcessorChain name="multiple-extract">
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">text</str>
<str name="dest">people_s</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<arr name="source">
<str>title</str>
<str>subtitle</str>
</arr>
<str name="dest">titular_people</str>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<lst name="source">
<str name="fieldRegex">.*_txt$</str>
<lst name="exclude">
<str name="fieldName">notes_txt</str>
</lst>
</lst>
<str name="dest">people_s</str>
</processor>
<processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<lst name="source">
<str name="fieldRegex">^desc(.*)s$</str>
</lst>
<lst name="dest">
<str name="pattern">^desc(.*)s$</str>
<str name="replacement">key_desc$1_people</str>
</lst>
</processor>
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-test-ner-person.bin</str>
<str name="analyzerFieldType">opennlp-en-tokenization</str>
<str name="source">summary</str>
<str name="dest">summary_{EntityType}_s</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
- Since:
- 7.3.0
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.solr.update.processor.UpdateRequestProcessorFactory
UpdateRequestProcessorFactory.RunAlways -
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionfinal UpdateRequestProcessorgetInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) protected final FieldMutatingUpdateProcessor.FieldNameSelectorvoidvoidinit(org.apache.solr.common.util.NamedList<?> args)
-
Field Details
-
SOURCE_PARAM
- See Also:
-
DEST_PARAM
- See Also:
-
PATTERN_PARAM
- See Also:
-
REPLACEMENT_PARAM
- See Also:
-
MODEL_PARAM
- See Also:
-
ANALYZER_FIELD_TYPE_PARAM
- See Also:
-
ENTITY_TYPE
- See Also:
-
-
Constructor Details
-
OpenNLPExtractNamedEntitiesUpdateProcessorFactory
public OpenNLPExtractNamedEntitiesUpdateProcessorFactory()
-
-
Method Details
-
getSourceSelector
-
init
public void init(org.apache.solr.common.util.NamedList<?> args) - Specified by:
initin interfaceNamedListInitializedPlugin
-
inform
- Specified by:
informin interfaceSolrCoreAware
-
getInstance
public final UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) - Specified by:
getInstancein classUpdateRequestProcessorFactory
-