Class OpenNLPExtractNamedEntitiesUpdateProcessorFactory
- java.lang.Object
-
- org.apache.solr.update.processor.UpdateRequestProcessorFactory
-
- org.apache.solr.update.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory
-
- All Implemented Interfaces:
NamedListInitializedPlugin
,SolrCoreAware
public class OpenNLPExtractNamedEntitiesUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware
Extracts named entities using an OpenNLP NERmodelFile
from the values found in any matchingsource
field into a configureddest
field, after first tokenizing the source text using the index analyzer on the configuredanalyzerFieldType
, which must includesolr.OpenNLPTokenizerFactory
as the tokenizer. E.g.:<fieldType name="opennlp-en-tokenization" class="solr.TextField"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="en-sent.bin" tokenizerModel="en-tokenizer.bin"/> </analyzer> </fieldType>
See the OpenNLP website for information on downloading pre-trained models. Note that in order to use model files larger than 1MB on SolrCloud, ZooKeeper server and client configuration is required.
The
source
field(s) can be configured as either:- One or more
<str>
- An
<arr>
of<str>
- A
<lst>
containingFieldMutatingUpdateProcessorFactory style selector arguments
The
dest
field can be a single<str>
containing the literal name of a destination field, or it may be a<lst>
specifying a regexpattern
and areplacement
string. If the pattern + replacement option is used the pattern will be matched against all fields matched by the source selector, and the replacement string (including any capture groups specified from the pattern) will be evaluated a usingMatcher.replaceAll(String)
to generate the literal name of the destination field. Additionally, an occurrence of the string "{EntityType}" in thedest
field specification, or in thereplacement
string, will be replaced with the entity type(s) returned for each entity by the OpenNLP NER model; as a result, if the model extracts more than one entity type, then more than onedest
field will be populated.If the resolved
dest
field already exists in the document, then the named entities extracted from thesource
fields will be added to it.In the example below:
- Named entities will be extracted from the
text
field and added to thenames_ss
field - Named entities will be extracted from both the
title
andsubtitle
fields and added into thetitular_people
field - Named entities will be extracted from any field with a name ending in
_txt
-- except fornotes_txt
-- and added into thepeople_ss
field - Named entities will be extracted from any field with a name beginning with "desc" and ending in "s" (e.g. "descs" and "descriptions") and added to a field prefixed with "key_", not ending in "s", and suffixed with "_people". (e.g. "key_desc_people" or "key_description_people")
- Named entities will be extracted from the
summary
field and added to thesummary_person_ss
field, assuming that the modelFile only extracts entities of type "person".
<updateRequestProcessorChain name="multiple-extract"> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">text</str> <str name="dest">people_s</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <arr name="source"> <str>title</str> <str>subtitle</str> </arr> <str name="dest">titular_people</str> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">.*_txt$</str> <lst name="exclude"> <str name="fieldName">notes_txt</str> </lst> </lst> <str name="dest">people_s</str> </processor> <processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <lst name="source"> <str name="fieldRegex">^desc(.*)s$</str> </lst> <lst name="dest"> <str name="pattern">^desc(.*)s$</str> <str name="replacement">key_desc$1_people</str> </lst> </processor> <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory"> <str name="modelFile">en-test-ner-person.bin</str> <str name="analyzerFieldType">opennlp-en-tokenization</str> <str name="source">summary</str> <str name="dest">summary_{EntityType}_s</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
- Since:
- 7.3.0
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.solr.update.processor.UpdateRequestProcessorFactory
UpdateRequestProcessorFactory.RunAlways
-
-
Field Summary
Fields Modifier and Type Field Description static String
ANALYZER_FIELD_TYPE_PARAM
static String
DEST_PARAM
static String
ENTITY_TYPE
static String
MODEL_PARAM
static String
PATTERN_PARAM
static String
REPLACEMENT_PARAM
static String
SOURCE_PARAM
-
Constructor Summary
Constructors Constructor Description OpenNLPExtractNamedEntitiesUpdateProcessorFactory()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description UpdateRequestProcessor
getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
protected FieldMutatingUpdateProcessor.FieldNameSelector
getSourceSelector()
void
inform(SolrCore core)
void
init(org.apache.solr.common.util.NamedList<?> args)
-
-
-
Field Detail
-
SOURCE_PARAM
public static final String SOURCE_PARAM
- See Also:
- Constant Field Values
-
DEST_PARAM
public static final String DEST_PARAM
- See Also:
- Constant Field Values
-
PATTERN_PARAM
public static final String PATTERN_PARAM
- See Also:
- Constant Field Values
-
REPLACEMENT_PARAM
public static final String REPLACEMENT_PARAM
- See Also:
- Constant Field Values
-
MODEL_PARAM
public static final String MODEL_PARAM
- See Also:
- Constant Field Values
-
ANALYZER_FIELD_TYPE_PARAM
public static final String ANALYZER_FIELD_TYPE_PARAM
- See Also:
- Constant Field Values
-
ENTITY_TYPE
public static final String ENTITY_TYPE
- See Also:
- Constant Field Values
-
-
Method Detail
-
getSourceSelector
protected final FieldMutatingUpdateProcessor.FieldNameSelector getSourceSelector()
-
init
public void init(org.apache.solr.common.util.NamedList<?> args)
- Specified by:
init
in interfaceNamedListInitializedPlugin
-
inform
public void inform(SolrCore core)
- Specified by:
inform
in interfaceSolrCoreAware
-
getInstance
public final UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next)
- Specified by:
getInstance
in classUpdateRequestProcessorFactory
-
-