Class DocTermOrds

  • All Implemented Interfaces:
    org.apache.lucene.util.Accountable
    Direct Known Subclasses:
    UnInvertedField

    public class DocTermOrds
    extends Object
    implements org.apache.lucene.util.Accountable
    This class enables fast access to multiple term ords for a specified field across all docIDs.

    Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the getOrdTermsEnum(org.apache.lucene.index.LeafReader) method, and then seek-by-ord to get the term's bytes.

    While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. A previous internal limit (16 MB) on how many bytes each chunk of documents may consume has been increased to 2 GB.

    Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.

    The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).

    This class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The getOrdTermsEnum(org.apache.lucene.index.LeafReader) method then provides this wrapped enum.

    The RAM consumption of this class can be high!

    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected boolean checkForDocValues
      If true, check and throw an exception if the field has docValues enabled.
      static int DEFAULT_INDEX_INTERVAL_BITS
      Every 128th term is indexed, by default.
      protected String field
      Field we are uninverting.
      protected int[] index
      Holds the per-document ords or a pointer to the ords.
      protected org.apache.lucene.util.BytesRef[] indexedTermsArray
      Holds the indexed (by default every 128th) terms.
      protected int maxTermDocFreq
      Don't uninvert terms that exceed this count.
      protected int numTermsInField
      Number of terms in the field.
      protected int ordBase
      Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement TermsEnum.ord().
      protected int phase1_time
      Time for phase1 of the uninvert process.
      protected org.apache.lucene.index.PostingsEnum postingsEnum
      Used while uninverting.
      protected org.apache.lucene.util.BytesRef prefix
      If non-null, only terms matching this prefix were indexed.
      protected long sizeOfIndexedStrings
      Total bytes (sum of term lengths) for all indexed terms.
      protected long termInstances
      Total number of references to term numbers.
      protected byte[][] tnums
      Holds term ords for documents.
      protected int total_time
      Total time to uninvert the field.
      • Fields inherited from interface org.apache.lucene.util.Accountable

        NULL_ACCOUNTABLE
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected DocTermOrds​(String field, int maxTermDocFreq, int indexIntervalBits)
      Subclass inits w/ this, but be sure you then call uninvert, only once
        DocTermOrds​(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field)
      Inverts all terms.
        DocTermOrds​(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix)
      Inverts only terms starting w/ prefix
        DocTermOrds​(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq)
      Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq
        DocTermOrds​(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)
      Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      org.apache.lucene.index.TermsEnum getOrdTermsEnum​(org.apache.lucene.index.LeafReader reader)
      Returns a TermsEnum that implements ord, or null if no terms in field.
      boolean isEmpty()
      Returns true if no terms were indexed.
      org.apache.lucene.index.SortedSetDocValues iterator​(org.apache.lucene.index.LeafReader reader)
      Returns a SortedSetDocValues view of this instance
      org.apache.lucene.util.BytesRef lookupTerm​(org.apache.lucene.index.TermsEnum termsEnum, int ord)
      Returns the term (BytesRef) corresponding to the provided ordinal.
      int numTerms()
      Returns the number of terms in this field
      long ramBytesUsed()
      Returns total bytes used.
      protected void setActualDocFreq​(int termNum, int df)
      Invoked during uninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef) to record the document frequency for each uninverted term.
      protected void uninvert​(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix)
      Call this only once (if you subclass!)
      protected void visitTerm​(org.apache.lucene.index.TermsEnum te, int termNum)
      Subclass can override this
      • Methods inherited from interface org.apache.lucene.util.Accountable

        getChildResources
    • Field Detail

      • DEFAULT_INDEX_INTERVAL_BITS

        public static final int DEFAULT_INDEX_INTERVAL_BITS
        Every 128th term is indexed, by default.
        See Also:
        Constant Field Values
      • maxTermDocFreq

        protected final int maxTermDocFreq
        Don't uninvert terms that exceed this count.
      • field

        protected final String field
        Field we are uninverting.
      • numTermsInField

        protected int numTermsInField
        Number of terms in the field.
      • termInstances

        protected long termInstances
        Total number of references to term numbers.
      • total_time

        protected int total_time
        Total time to uninvert the field.
      • phase1_time

        protected int phase1_time
        Time for phase1 of the uninvert process.
      • index

        protected int[] index
        Holds the per-document ords or a pointer to the ords.
      • tnums

        protected byte[][] tnums
        Holds term ords for documents.
      • sizeOfIndexedStrings

        protected long sizeOfIndexedStrings
        Total bytes (sum of term lengths) for all indexed terms.
      • indexedTermsArray

        protected org.apache.lucene.util.BytesRef[] indexedTermsArray
        Holds the indexed (by default every 128th) terms.
      • prefix

        protected org.apache.lucene.util.BytesRef prefix
        If non-null, only terms matching this prefix were indexed.
      • ordBase

        protected int ordBase
        Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement TermsEnum.ord().
      • postingsEnum

        protected org.apache.lucene.index.PostingsEnum postingsEnum
        Used while uninverting.
      • checkForDocValues

        protected boolean checkForDocValues
        If true, check and throw an exception if the field has docValues enabled. Normally, docValues should be used in preference to DocTermOrds.
    • Constructor Detail

      • DocTermOrds

        public DocTermOrds​(org.apache.lucene.index.LeafReader reader,
                           org.apache.lucene.util.Bits liveDocs,
                           String field)
                    throws IOException
        Inverts all terms.
        Throws:
        IOException
      • DocTermOrds

        public DocTermOrds​(org.apache.lucene.index.LeafReader reader,
                           org.apache.lucene.util.Bits liveDocs,
                           String field,
                           org.apache.lucene.util.BytesRef termPrefix)
                    throws IOException
        Inverts only terms starting w/ prefix
        Throws:
        IOException
      • DocTermOrds

        public DocTermOrds​(org.apache.lucene.index.LeafReader reader,
                           org.apache.lucene.util.Bits liveDocs,
                           String field,
                           org.apache.lucene.util.BytesRef termPrefix,
                           int maxTermDocFreq)
                    throws IOException
        Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq
        Throws:
        IOException
      • DocTermOrds

        public DocTermOrds​(org.apache.lucene.index.LeafReader reader,
                           org.apache.lucene.util.Bits liveDocs,
                           String field,
                           org.apache.lucene.util.BytesRef termPrefix,
                           int maxTermDocFreq,
                           int indexIntervalBits)
                    throws IOException
        Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
        Throws:
        IOException
      • DocTermOrds

        protected DocTermOrds​(String field,
                              int maxTermDocFreq,
                              int indexIntervalBits)
        Subclass inits w/ this, but be sure you then call uninvert, only once
    • Method Detail

      • ramBytesUsed

        public long ramBytesUsed()
        Returns total bytes used.
        Specified by:
        ramBytesUsed in interface org.apache.lucene.util.Accountable
      • getOrdTermsEnum

        public org.apache.lucene.index.TermsEnum getOrdTermsEnum​(org.apache.lucene.index.LeafReader reader)
                                                          throws IOException
        Returns a TermsEnum that implements ord, or null if no terms in field.

        we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.

        NOTE: you must pass the same reader that was used when creating this class

        Throws:
        IOException
      • numTerms

        public int numTerms()
        Returns the number of terms in this field
      • isEmpty

        public boolean isEmpty()
        Returns true if no terms were indexed.
      • visitTerm

        protected void visitTerm​(org.apache.lucene.index.TermsEnum te,
                                 int termNum)
                          throws IOException
        Subclass can override this
        Throws:
        IOException
      • uninvert

        protected void uninvert​(org.apache.lucene.index.LeafReader reader,
                                org.apache.lucene.util.Bits liveDocs,
                                org.apache.lucene.util.BytesRef termPrefix)
                         throws IOException
        Call this only once (if you subclass!)
        Throws:
        IOException
      • lookupTerm

        public org.apache.lucene.util.BytesRef lookupTerm​(org.apache.lucene.index.TermsEnum termsEnum,
                                                          int ord)
                                                   throws IOException
        Returns the term (BytesRef) corresponding to the provided ordinal.
        Throws:
        IOException
      • iterator

        public org.apache.lucene.index.SortedSetDocValues iterator​(org.apache.lucene.index.LeafReader reader)
                                                            throws IOException
        Returns a SortedSetDocValues view of this instance
        Throws:
        IOException