Class PhrasesIdentificationComponent

  • All Implemented Interfaces:
    AutoCloseable, SolrInfoBean, SolrMetricProducer, NamedListInitializedPlugin

    public class PhrasesIdentificationComponent
    extends SearchComponent
    A component that can be used in isolation, or in conjunction with QueryComponent to identify & score "phrases" found in the input string, based on shingles in indexed fields.

    The most common way to use this component is in conjunction with field that use ShingleFilterFactory on both the index and query analyzers. An example field type configuration would be something like this...

     <fieldType name="phrases" class="solr.TextField" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="3" outputUnigrams="true"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="7" outputUnigramsIfNoShingles="true" outputUnigrams="true"/>
       </analyzer>
     </fieldType>
     

    ...where the query analyzer's maxShingleSize="7" determines the maximum possible phrase length that can be hueristically deduced, the index analyzer's maxShingleSize="3" determines the accuracy of phrases identified. The large the indexed maxShingleSize the higher the accuracy. Both analyzers must include minShingleSize="2" outputUnigrams="true".

    With a field type like this, one or more fields can be specified (with weights) via a phrases.fields param to request that this component identify possible phrases in the input q param, or an alternative phrases.q override param. The identified phrases will include their scores relative each field specified, as well an overal weighted score based on the field weights provided by the client. Higher score values indicate a greater confidence in the Phrase.

    NOTE: In a distributed request, this component uses a single phase (piggy backing on the ShardRequest.PURPOSE_GET_TOP_IDS generated by QueryComponent if it is in use) to collect all field & shingle stats. No "refinement" requests are used.

    WARNING: This API is experimental and might change in incompatible ways in the next release.