Class SimplePreAnalyzedParser

  • All Implemented Interfaces:
    PreAnalyzedField.PreAnalyzedParser

    public final class SimplePreAnalyzedParser
    extends Object
    implements PreAnalyzedField.PreAnalyzedParser
    Simple plain text format parser for PreAnalyzedField.

    Serialization format

    The format of the serialization is as follows:

     content ::= version (stored)? tokens
     version ::= digit+ " "
     ; stored field value - any "=" inside must be escaped!
     stored ::= "=" text "="
     tokens ::= (token ((" ") + token)*)*
     token ::= text ("," attrib)*
     attrib ::= name '=' value
     name ::= text
     value ::= text
     

    Special characters in "text" values can be escaped using the escape character \ . The following escape sequences are recognized:

     "\ " - literal space character
     "\," - literal , character
     "\=" - literal = character
     "\\" - literal \ character
     "\n" - newline
     "\r" - carriage return
     "\t" - horizontal tab
     
    Please note that Unicode sequences (e.g. \u0001) are not supported.

    Supported attribute names

    The following token attributes are supported, and identified with short symbolic names:
     i - position increment (integer)
     s - token offset, start position (integer)
     e - token offset, end position (integer)
     t - token type (string)
     f - token flags (hexadecimal integer)
     p - payload (bytes in hexadecimal format; whitespace is ignored)
     
    Token offsets are tracked and implicitly added to the token stream - the start and end offsets consider only the term text and whitespace, and exclude the space taken by token attributes.

    Example token streams

     1 one two three
     - version 1
     - stored: 'null'
     - tok: '(term=one,startOffset=0,endOffset=3)'
     - tok: '(term=two,startOffset=4,endOffset=7)'
     - tok: '(term=three,startOffset=8,endOffset=13)'
     1 one  two   three
     - version 1
     - stored: 'null'
     - tok: '(term=one,startOffset=0,endOffset=3)'
     - tok: '(term=two,startOffset=5,endOffset=8)'
     - tok: '(term=three,startOffset=11,endOffset=16)'
     1 one,s=123,e=128,i=22  two three,s=20,e=22
     - version 1
     - stored: 'null'
     - tok: '(term=one,positionIncrement=22,startOffset=123,endOffset=128)'
     - tok: '(term=two,positionIncrement=1,startOffset=5,endOffset=8)'
     - tok: '(term=three,positionIncrement=1,startOffset=20,endOffset=22)'
     1 \ one\ \,,i=22,a=\, two\=
    
     \n,\ =\   \
     - version 1
     - stored: 'null'
     - tok: '(term= one ,,positionIncrement=22,startOffset=0,endOffset=6)'
     - tok: '(term=two=
    
    
     ,positionIncrement=1,startOffset=7,endOffset=15)'
     - tok: '(term=\,positionIncrement=1,startOffset=17,endOffset=18)'
     1 ,i=22 ,i=33,s=2,e=20 ,
     - version 1
     - stored: 'null'
     - tok: '(term=,positionIncrement=22,startOffset=0,endOffset=0)'
     - tok: '(term=,positionIncrement=33,startOffset=2,endOffset=20)'
     - tok: '(term=,positionIncrement=1,startOffset=2,endOffset=2)'
     1 =This is the stored part with \=
     \n    \t escapes.=one two three
     - version 1
     - stored: 'This is the stored part with =
     \n    \t escapes.'
     - tok: '(term=one,startOffset=0,endOffset=3)'
     - tok: '(term=two,startOffset=4,endOffset=7)'
     - tok: '(term=three,startOffset=8,endOffset=13)'
     1 ==
     - version 1
     - stored: ''
     - (no tokens)
     1 =this is a test.=
     - version 1
     - stored: 'this is a test.'
     - (no tokens)