Class DocTermOrds
- All Implemented Interfaces:
org.apache.lucene.util.Accountable
- Direct Known Subclasses:
UnInvertedField
Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable
fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term
bytes in RAM. Rather, you must obtain a TermsEnum from the getOrdTermsEnum(org.apache.lucene.index.LeafReader) method, and
then seek-by-ord to get the term's bytes.
While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. A previous internal limit (16 MB) on how many bytes each chunk of documents may consume has been increased to 2 GB.
Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.
The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).
This class will create its own term index internally, allowing to create a wrapped TermsEnum
that can handle ord. The getOrdTermsEnum(org.apache.lucene.index.LeafReader) method then provides this wrapped enum.
The RAM consumption of this class can be high!
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected booleanIf true, check and throw an exception if the field has docValues enabled.static final intEvery 128th term is indexed, by default.protected final StringField we are uninverting.protected int[]Holds the per-document ords or a pointer to the ords.protected org.apache.lucene.util.BytesRef[]Holds the indexed (by default every 128th) terms.protected final intDon't uninvert terms that exceed this count.protected intNumber of terms in the field.protected intOrdinal of the first term in the field, or 0 if thePostingsFormatdoes not implementTermsEnum.ord().protected intTime for phase1 of the uninvert process.protected org.apache.lucene.index.PostingsEnumUsed while uninverting.protected org.apache.lucene.util.BytesRefIf non-null, only terms matching this prefix were indexed.protected longTotal bytes (sum of term lengths) for all indexed terms.protected longTotal number of references to term numbers.protected byte[][]Holds term ords for documents.protected intTotal time to uninvert the field.Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedDocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits) Subclass inits w/ this, but be sure you then call uninvert, only onceDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field) Inverts all terms.DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix) Inverts only terms starting w/ prefixDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq) Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreqDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term). -
Method Summary
Modifier and TypeMethodDescriptionorg.apache.lucene.index.TermsEnumgetOrdTermsEnum(org.apache.lucene.index.LeafReader reader) Returns a TermsEnum that implements ord, or null if no terms in field.booleanisEmpty()Returnstrueif no terms were indexed.org.apache.lucene.index.SortedSetDocValuesiterator(org.apache.lucene.index.LeafReader reader) Returns a SortedSetDocValues view of this instanceorg.apache.lucene.util.BytesReflookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord) Returns the term (BytesRef) corresponding to the provided ordinal.intnumTerms()Returns the number of terms in this fieldlongReturns total bytes used.protected voidsetActualDocFreq(int termNum, int df) Invoked duringuninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef)to record the document frequency for each uninverted term.protected voiduninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix) Call this only once (if you subclass!)protected voidvisitTerm(org.apache.lucene.index.TermsEnum te, int termNum) Subclass can override thisMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
Field Details
-
DEFAULT_INDEX_INTERVAL_BITS
public static final int DEFAULT_INDEX_INTERVAL_BITSEvery 128th term is indexed, by default.- See Also:
-
maxTermDocFreq
protected final int maxTermDocFreqDon't uninvert terms that exceed this count. -
field
Field we are uninverting. -
numTermsInField
protected int numTermsInFieldNumber of terms in the field. -
termInstances
protected long termInstancesTotal number of references to term numbers. -
total_time
protected int total_timeTotal time to uninvert the field. -
phase1_time
protected int phase1_timeTime for phase1 of the uninvert process. -
index
protected int[] indexHolds the per-document ords or a pointer to the ords. -
tnums
protected byte[][] tnumsHolds term ords for documents. -
sizeOfIndexedStrings
protected long sizeOfIndexedStringsTotal bytes (sum of term lengths) for all indexed terms. -
indexedTermsArray
protected org.apache.lucene.util.BytesRef[] indexedTermsArrayHolds the indexed (by default every 128th) terms. -
prefix
protected org.apache.lucene.util.BytesRef prefixIf non-null, only terms matching this prefix were indexed. -
ordBase
protected int ordBaseOrdinal of the first term in the field, or 0 if thePostingsFormatdoes not implementTermsEnum.ord(). -
postingsEnum
protected org.apache.lucene.index.PostingsEnum postingsEnumUsed while uninverting. -
checkForDocValues
protected boolean checkForDocValuesIf true, check and throw an exception if the field has docValues enabled. Normally, docValues should be used in preference to DocTermOrds.
-
-
Constructor Details
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field) throws IOException Inverts all terms.- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix) throws IOException Inverts only terms starting w/ prefix- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq) throws IOException Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOException Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).- Throws:
IOException
-
DocTermOrds
Subclass inits w/ this, but be sure you then call uninvert, only once
-
-
Method Details
-
ramBytesUsed
public long ramBytesUsed()Returns total bytes used.- Specified by:
ramBytesUsedin interfaceorg.apache.lucene.util.Accountable
-
getOrdTermsEnum
public org.apache.lucene.index.TermsEnum getOrdTermsEnum(org.apache.lucene.index.LeafReader reader) throws IOException Returns a TermsEnum that implements ord, or null if no terms in field.we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.
NOTE: you must pass the same reader that was used when creating this class
- Throws:
IOException
-
numTerms
public int numTerms()Returns the number of terms in this field -
isEmpty
public boolean isEmpty()Returnstrueif no terms were indexed. -
visitTerm
Subclass can override this- Throws:
IOException
-
setActualDocFreq
Invoked duringuninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef)to record the document frequency for each uninverted term.- Throws:
IOException
-
uninvert
protected void uninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix) throws IOException Call this only once (if you subclass!)- Throws:
IOException
-
lookupTerm
public org.apache.lucene.util.BytesRef lookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord) throws IOException Returns the term (BytesRef) corresponding to the provided ordinal.- Throws:
IOException
-
iterator
public org.apache.lucene.index.SortedSetDocValues iterator(org.apache.lucene.index.LeafReader reader) throws IOException Returns a SortedSetDocValues view of this instance- Throws:
IOException
-