Working with External Files and Processes

The ExternalFileField Type

The ExternalFileField type makes it possible to specify the values for a field in a file outside the Solr index. For such a field, the file contains mappings from a key field to the field value. Another way to think of this is that, instead of specifying the field in documents as they are indexed, Solr finds values for this field in the external file.

External fields are not searchable. They can be used only for function queries or display. For more information on function queries, see the section on Function Queries.

The ExternalFileField type is handy for cases where you want to update a particular field in many documents more often than you want to update the rest of the documents. For example, suppose you have implemented a document rank based on the number of views. You might want to update the rank of all the documents daily or hourly, while the rest of the contents of the documents might be updated much less frequently. Without ExternalFileField, you would need to update each document just to change the rank. Using ExternalFileField is much more efficient because all document values for a particular field are stored in an external file that can be updated as frequently as you wish.

In schema.xml, the definition of this field type might look like this:

<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField"/>

The keyField attribute defines the key that will be defined in the external file. It is usually the unique key for the index, but it doesn’t need to be as long as the keyField can be used to identify documents in the index. A defVal defines a default value that will be used if there is no entry in the external file for a particular document.

Format of the External File

The file itself is located in Solr’s index directory, which by default is $SOLR_HOME/data. The name of the file should be external_fieldname_ or external_fieldname_.*. For the example above, then, the file could be named external_entryRankFile or external_entryRankFile.txt.

If any files using the name pattern .* (such as .txt) appear, the last (after being sorted by name) will be used and previous versions will be deleted. This behavior supports implementations on systems where one may not be able to overwrite a file (for example, on Windows, if the file is in use).

The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are a few example entries:

doc33=1.414
doc34=3.14159
doc40=42

The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able to perform the lookup faster if it is.

Reloading an External File

It’s possible to define an event listener to reload an external file when either a searcher is reloaded or when a new searcher is started. See the section Query-Related Listeners for more information, but a sample definition in solrconfig.xml might look like this:

<listener event="newSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
<listener event="firstSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>

The PreAnalyzedField Type

The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token attributes).

The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two out-of-the-box implementations:

  • JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field’s content. This is the default parser to use if the field type is not configured otherwise.
  • SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier to create than JSON.

There is only one configuration parameter, parserImpl. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is org.apache.solr.schema.JsonPreAnalyzedParser.

By default, the query-time analyzer for fields of this type will be the same as the index-time analyzer, which expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in order to perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer expects the default JSON serialization format, and the query-time analyzer will employ StandardTokenizer/LowerCaseFilter:

<fieldType name="pre_with_query_analyzer" class="solr.PreAnalyzedField">
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

JsonPreAnalyzedParser

This is the default serialization format used by PreAnalyzedField type. It uses a top-level JSON map with the following keys:

KeyDescriptionRequired
vVersion key. Currently the supported version is 1.required
strStored string value of a field. You can use at most one of str or bin.optional
binStored binary value of a field. The binary value has to be Base64 encoded.optional
tokensserialized token stream. This is a JSON list.optional

Any other top-level key is silently ignored.

Token Stream Serialization

The token stream is expressed as a JSON list of JSON maps. The map for each token consists of the following keys and values:

KeyDescriptionLucene AttributeValueRequired?
ttokenCharTermAttributeUTF-8 string representing the current tokenrequired
sstart offsetOffsetAttributeNon-negative integeroptional
eend offsetOffsetAttributeNon-negative integeroptional
iposition incrementPositionIncrementAttributeNon-negative integer - default is 1optional
ppayloadPayloadAttributeBase64 encoded payloadoptional
ylexical typeTypeAttributeUTF-8 stringoptional
fflagsFlagsAttributeString representing an integer value in hexadecimal formatoptional

Any other key is silently ignored.

JsonPreAnalyzedParser Example

{
  "v":"1",
  "str":"test ąćęłńóśźż",
  "tokens": [
    {"t":"two","s":5,"e":8,"i":1,"y":"word"},
    {"t":"three","s":20,"e":22,"i":1,"y":"foobar"},
    {"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"}
  ]
}

SimplePreAnalyzedParser

The fully qualified class name to use when specifying this format via the parserImpl configuration parameter is org.apache.solr.schema.SimplePreAnalyzedParser.

SimplePreAnalyzedParser Syntax

The serialization format supported by this parser is as follows:

Serialization format
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text

Special characters in "text" values can be escaped using the escape character \. The following escape sequences are recognized:

EscapeSequenceDescription
\literal space character
\,literal , character
\=literal = character
\\literal \ character
\nnewline
\rcarriage return
\thorizontal tab

Please note that Unicode sequences (e.g., \u0001) are not supported.

Supported Attributes

The following token attributes are supported, and identified with short symbolic names:

NameDescriptionLucene attributeValue format
iposition incrementPositionIncrementAttributeinteger
sstart offsetOffsetAttributeinteger
eend offsetOffsetAttributeinteger
ylexical typeTypeAttributestring
fflagsFlagsAttributehexadecimal integer
ppayloadPayloadAttributebytes in hexadecimal format; whitespace is ignored

Token positions are tracked and implicitly added to the token stream - the start and end offsets consider only the term text and whitespace, and exclude the space taken by token attributes.

Example Token Streams

1 one two three
  • version: 1
  • stored: null
  • token: (term=one,startOffset=0,endOffset=3)
  • token: (term=two,startOffset=4,endOffset=7)
  • token: (term=three,startOffset=8,endOffset=13)
1 one  two    three
  • version: 1
  • stored: null
  • token: (term=one,startOffset=0,endOffset=3)
  • token: (term=two,startOffset=5,endOffset=8)
  • token: (term=three,startOffset=11,endOffset=16)
1 one,s=123,e=128,i=22 two three,s=20,e=22
  • version: 1
  • stored: null
  • token: (term=one,positionIncrement=22,startOffset=123,endOffset=128)
  • token: (term=two,positionIncrement=1,startOffset=5,endOffset=8)
  • token: (term=three,positionIncrement=1,startOffset=20,endOffset=22)
1 \ one\ \,,i=22,a=\, two\=

\n,\ =\ \
  • version: 1
  • stored: null
  • token: (term=one ,,positionIncrement=22,startOffset=0,endOffset=6)
  • token: (term=two= ,positionIncrement=1,startOffset=7,endOffset=15)
  • token: (term=\,positionIncrement=1,startOffset=17,endOffset=18)

Note that unknown attributes and their values are ignored, so in this example, the “a” attribute on the first token and the " " (escaped space) attribute on the second token are ignored, along with their values, because they are not among the supported attribute names.

1 ,i=22 ,i=33,s=2,e=20 ,
  • version: 1
  • stored: null
  • token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
  • token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
  • token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
1 =This is the stored part with \=
\n \t escapes.=one two three
  • version: 1
  • stored: This is the stored part with = \t escapes.
  • token: (term=one,startOffset=0,endOffset=3)
  • token: (term=two,startOffset=4,endOffset=7)
  • token: (term=three,startOffset=8,endOffset=13)

Note that the \t in the above stored value is not literal; it’s shown that way to visually indicate the actual tab char that is in the stored value.

1 ==
  • version: 1
  • stored: ""
  • (no tokens)
1 =this is a test.=
  • version: 1
  • stored: this is a test.
  • (no tokens)