Working with External Files and Processes
The ExternalFileField Type
The ExternalFileField
type makes it possible to specify the values for a field in a file outside the Solr index. For such a field, the file contains mappings from a key field to the field value. Another way to think of this is that, instead of specifying the field in documents as they are indexed, Solr finds values for this field in the external file.
External fields are not searchable. They can be used only for function queries or display. For more information on function queries, see the section on Function Queries. |
The ExternalFileField
type is handy for cases where you want to update a particular field in many documents more often than you want to update the rest of the documents. For example, suppose you have implemented a document rank based on the number of views. You might want to update the rank of all the documents daily or hourly, while the rest of the contents of the documents might be updated much less frequently. Without ExternalFileField
, you would need to update each document just to change the rank. Using ExternalFileField
is much more efficient because all document values for a particular field are stored in an external file that can be updated as frequently as you wish.
In schema.xml
, the definition of this field type might look like this:
<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false" indexed="false" class="solr.ExternalFileField"/>
The keyField
attribute defines the key that will be defined in the external file. It is usually the unique key for the index, but it doesn’t need to be as long as the keyField
can be used to identify documents in the index. A defVal
defines a default value that will be used if there is no entry in the external file for a particular document.
Format of the External File
The file itself is located in Solr’s index directory, which by default is $SOLR_HOME/data
. The name of the file should be external_fieldname_
or external_fieldname_.*
. For the example above, then, the file could be named external_entryRankFile
or external_entryRankFile.txt
.
If any files using the name pattern |
The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are a few example entries:
doc33=1.414
doc34=3.14159
doc40=42
The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able to perform the lookup faster if it is.
Reloading an External File
It’s possible to define an event listener to reload an external file when either a searcher is reloaded or when a new searcher is started. See the section Query-Related Listeners for more information, but a sample definition in solrconfig.xml
might look like this:
<listener event="newSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
<listener event="firstSearcher" class="org.apache.solr.schema.ExternalFileFieldReloader"/>
The PreAnalyzedField Type
The PreAnalyzedField
type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token attributes).
The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two out-of-the-box implementations:
- JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field’s content. This is the default parser to use if the field type is not configured otherwise.
- SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier to create than JSON.
There is only one configuration parameter, parserImpl
. The value of this parameter should be a fully qualified class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is org.apache.solr.schema.JsonPreAnalyzedParser
.
By default, the query-time analyzer for fields of this type will be the same as the index-time analyzer, which expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in order to perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer expects the default JSON serialization format, and the query-time analyzer will employ StandardTokenizer/LowerCaseFilter:
<fieldType name="pre_with_query_analyzer" class="solr.PreAnalyzedField">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
JsonPreAnalyzedParser
This is the default serialization format used by PreAnalyzedField type. It uses a top-level JSON map with the following keys:
Key | Description | Required |
---|---|---|
v | Version key. Currently the supported version is 1 . | required |
str | Stored string value of a field. You can use at most one of str or bin . | optional |
bin | Stored binary value of a field. The binary value has to be Base64 encoded. | optional |
tokens | serialized token stream. This is a JSON list. | optional |
Any other top-level key is silently ignored.
Token Stream Serialization
The token stream is expressed as a JSON list of JSON maps. The map for each token consists of the following keys and values:
Key | Description | Lucene Attribute | Value | Required? |
---|---|---|---|---|
t | token | CharTermAttribute | UTF-8 string representing the current token | required |
s | start offset | OffsetAttribute | Non-negative integer | optional |
e | end offset | OffsetAttribute | Non-negative integer | optional |
i | position increment | PositionIncrementAttribute | Non-negative integer - default is 1 | optional |
p | payload | PayloadAttribute | Base64 encoded payload | optional |
y | lexical type | TypeAttribute | UTF-8 string | optional |
f | flags | FlagsAttribute | String representing an integer value in hexadecimal format | optional |
Any other key is silently ignored.
JsonPreAnalyzedParser Example
{
"v":"1",
"str":"test ąćęłńóśźż",
"tokens": [
{"t":"two","s":5,"e":8,"i":1,"y":"word"},
{"t":"three","s":20,"e":22,"i":1,"y":"foobar"},
{"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"}
]
}
SimplePreAnalyzedParser
The fully qualified class name to use when specifying this format via the parserImpl
configuration parameter is org.apache.solr.schema.SimplePreAnalyzedParser
.
SimplePreAnalyzedParser Syntax
The serialization format supported by this parser is as follows:
Special characters in "text" values can be escaped using the escape character \
. The following escape sequences are recognized:
EscapeSequence | Description |
---|---|
\ | literal space character |
\, | literal , character |
\= | literal = character |
\\ | literal \ character |
\n | newline |
\r | carriage return |
\t | horizontal tab |
Please note that Unicode sequences (e.g., \u0001
) are not supported.
Supported Attributes
The following token attributes are supported, and identified with short symbolic names:
Name | Description | Lucene attribute | Value format |
---|---|---|---|
i | position increment | PositionIncrementAttribute | integer |
s | start offset | OffsetAttribute | integer |
e | end offset | OffsetAttribute | integer |
y | lexical type | TypeAttribute | string |
f | flags | FlagsAttribute | hexadecimal integer |
p | payload | PayloadAttribute | bytes in hexadecimal format; whitespace is ignored |
Token positions are tracked and implicitly added to the token stream - the start and end offsets consider only the term text and whitespace, and exclude the space taken by token attributes.
Example Token Streams
1 one two three
- version: 1
- stored: null
- token: (term=
one
,startOffset=0,endOffset=3) - token: (term=
two
,startOffset=4,endOffset=7) - token: (term=
three
,startOffset=8,endOffset=13)
1 one two three
- version: 1
- stored: null
- token: (term=
one
,startOffset=0,endOffset=3) - token: (term=
two
,startOffset=5,endOffset=8) - token: (term=
three
,startOffset=11,endOffset=16)
1 one,s=123,e=128,i=22 two three,s=20,e=22
- version: 1
- stored: null
- token: (term=
one
,positionIncrement=22,startOffset=123,endOffset=128) - token: (term=
two
,positionIncrement=1,startOffset=5,endOffset=8) - token: (term=
three
,positionIncrement=1,startOffset=20,endOffset=22)
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
- version: 1
- stored: null
- token: (term=
one ,
,positionIncrement=22,startOffset=0,endOffset=6) - token: (term=
two=
,positionIncrement=1,startOffset=7,endOffset=15) - token: (term=
\
,positionIncrement=1,startOffset=17,endOffset=18)
Note that unknown attributes and their values are ignored, so in this example, the “a
” attribute on the first token and the " " (escaped space) attribute on the second token are ignored, along with their values, because they are not among the supported attribute names.
1 ,i=22 ,i=33,s=2,e=20 ,
- version: 1
- stored: null
- token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
- token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
- token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
1 =This is the stored part with \=
\n \t escapes.=one two three
- version: 1
- stored:
This is the stored part with = \t escapes.
- token: (term=
one
,startOffset=0,endOffset=3) - token: (term=
two
,startOffset=4,endOffset=7) - token: (term=
three
,startOffset=8,endOffset=13)
Note that the \t
in the above stored value is not literal; it’s shown that way to visually indicate the actual tab char that is in the stored value.
1 ==
- version: 1
- stored: ""
- (no tokens)
1 =this is a test.=
- version: 1
- stored:
this is a test.
- (no tokens)