Class OpenNLPExtractNamedEntitiesUpdateProcessorFactory
- java.lang.Object
-
- org.apache.solr.update.processor.UpdateRequestProcessorFactory
-
- org.apache.solr.update.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory
-
- All Implemented Interfaces:
NamedListInitializedPlugin,SolrCoreAware
public class OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware
Extracts named entities using an OpenNLP NERmodelFilefrom the values found in any matchingsourcefield into a configureddestfield, after first tokenizing the source text using the index analyzer on the configuredanalyzerFieldType, which must includesolr.OpenNLPTokenizerFactoryas the tokenizer. E.g.:<fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-tokenizer.bin"/> </analyzer> </fieldType>See the OpenNLP website for information on downloading pre-trained models. Note that in order to use model files larger than 1MB on SolrCloud, ZooKeeper server and client configuration is required.
The
sourcefield(s) can be configured as either:- One or more
<str> - An
<arr>of<str> - A
<lst>containingFieldMutatingUpdateProcessorFactory style selector arguments
The
destfield can be a single<str>containing the literal name of a destination field, or it may be a<lst>specifying a regexpatternand areplacementstring. If the pattern + replacement option is used the pattern will be matched against all fields matched by the source selector, and the replacement string (including any capture groups specified from the pattern) will be evaluated a usingMatcher.replaceAll(String)to generate the literal name of the destination field. Additionally, an occurrence of the string "{EntityType}" in thedestfield specification, or in thereplacementstring, will be replaced with the entity type(s) returned for each entity by the OpenNLP NER model; as a result, if the model extracts more than one entity type, then more than onedestfield will be populated.If the resolved
destfield already exists in the document, then the named entities extracted from thesourcefields will be added to it.In the example below:
- Named entities will be extracted from the
textfield and added to thenames_ssfield - Named entities will be extracted from both the
titleandsubtitlefields and added into thetitular_peoplefield - Named entities will be extracted from any field with a name ending in
_txt-- except fornotes_txt-- and added into thepeople_ssfield - Named entities will be extracted from any field with a name beginning with "desc" and ending in "s" (e.g. "descs" and "descriptions") and added to a field prefixed with "key_", not ending in "s", and suffixed with "_people". (e.g. "key_desc_people" or "key_description_people")
- Named entities will be extracted from the
summaryfield and added to thesummary_person_ssfield, assuming that the modelFile only extracts entities of type "person".
<updateRequestProcessorChain name="multiple-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text</str> <str name="dest">people_s</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <arr name="source"> <str>title</str> <str>subtitle</str> </arr> <str name="dest">titular_people</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">.*_txt$</str> <lst name="exclude"> <str name="fieldName">notes_txt</str> </lst> </lst> <str name="dest">people_s</str> </processor> <processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">^desc(.*)s$</str> </lst> <lst name="dest"> <str name="pattern">^desc(.*)s$</str> <str name="replacement">key_desc$1_people</str> </lst> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">summary</str> <str name="dest">summary_{EntityType}_s</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>- Since:
- 7.3.0
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.solr.update.processor.UpdateRequestProcessorFactory
UpdateRequestProcessorFactory.RunAlways
-
-
Field Summary
Fields Modifier and Type Field Description static StringANALYZER_FIELD_TYPE_PARAMstatic StringDEST_PARAMstatic StringENTITY_TYPEstatic StringMODEL_PARAMstatic StringPATTERN_PARAMstatic StringREPLACEMENT_PARAMstatic StringSOURCE_PARAM
-
Constructor Summary
Constructors Constructor Description OpenNLPExtractNamedEntitiesUpdateProcessorFactory()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description UpdateRequestProcessorgetInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)protected FieldMutatingUpdateProcessor.FieldNameSelectorgetSourceSelector()voidinform(SolrCore core)voidinit(org.apache.solr.common.util.NamedList<?> args)
-
-
-
Field Detail
-
SOURCE_PARAM
public static final String SOURCE_PARAM
- See Also:
- Constant Field Values
-
DEST_PARAM
public static final String DEST_PARAM
- See Also:
- Constant Field Values
-
PATTERN_PARAM
public static final String PATTERN_PARAM
- See Also:
- Constant Field Values
-
REPLACEMENT_PARAM
public static final String REPLACEMENT_PARAM
- See Also:
- Constant Field Values
-
MODEL_PARAM
public static final String MODEL_PARAM
- See Also:
- Constant Field Values
-
ANALYZER_FIELD_TYPE_PARAM
public static final String ANALYZER_FIELD_TYPE_PARAM
- See Also:
- Constant Field Values
-
ENTITY_TYPE
public static final String ENTITY_TYPE
- See Also:
- Constant Field Values
-
-
Method Detail
-
getSourceSelector
protected final FieldMutatingUpdateProcessor.FieldNameSelector getSourceSelector()
-
init
public void init(org.apache.solr.common.util.NamedList<?> args)
- Specified by:
initin interfaceNamedListInitializedPlugin
-
inform
public void inform(SolrCore core)
- Specified by:
informin interfaceSolrCoreAware
-
getInstance
public final UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
- Specified by:
getInstancein classUpdateRequestProcessorFactory
-
-