Package org.apache.solr.uninverting
Class DocTermOrds
- java.lang.Object
-
- org.apache.solr.uninverting.DocTermOrds
-
- All Implemented Interfaces:
org.apache.lucene.util.Accountable
- Direct Known Subclasses:
UnInvertedField
public class DocTermOrds extends Object implements org.apache.lucene.util.Accountable
This class enables fast access to multiple term ords for a specified field across all docIDs. Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from thegetOrdTermsEnum(org.apache.lucene.index.LeafReader)method, and then seek-by-ord to get the term's bytes. While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. A previous internal limit (16 MB) on how many bytes each chunk of documents may consume has been increased to 2 GB. Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords. The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once). This class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. ThegetOrdTermsEnum(org.apache.lucene.index.LeafReader)method then provides this wrapped enum. The RAM consumption of this class can be high!- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description protected booleancheckForDocValuesIf true, check and throw an exception if the field has docValues enabled.static intDEFAULT_INDEX_INTERVAL_BITSEvery 128th term is indexed, by default.protected StringfieldField we are uninverting.protected int[]indexHolds the per-document ords or a pointer to the ords.protected org.apache.lucene.util.BytesRef[]indexedTermsArrayHolds the indexed (by default every 128th) terms.protected intmaxTermDocFreqDon't uninvert terms that exceed this count.protected intnumTermsInFieldNumber of terms in the field.protected intordBaseOrdinal of the first term in the field, or 0 if thePostingsFormatdoes not implementTermsEnum.ord().protected intphase1_timeTime for phase1 of the uninvert process.protected org.apache.lucene.index.PostingsEnumpostingsEnumUsed while uninverting.protected org.apache.lucene.util.BytesRefprefixIf non-null, only terms matching this prefix were indexed.protected longsizeOfIndexedStringsTotal bytes (sum of term lengths) for all indexed terms.protected longtermInstancesTotal number of references to term numbers.protected byte[][]tnumsHolds term ords for documents.protected inttotal_timeTotal time to uninvert the field.
-
Constructor Summary
Constructors Modifier Constructor Description protectedDocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)Subclass inits w/ this, but be sure you then call uninvert, only onceDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field)Inverts all terms.DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix)Inverts only terms starting w/ prefixDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq)Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreqDocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.lucene.index.TermsEnumgetOrdTermsEnum(org.apache.lucene.index.LeafReader reader)Returns a TermsEnum that implements ord, or null if no terms in field.booleanisEmpty()Returnstrueif no terms were indexed.org.apache.lucene.index.SortedSetDocValuesiterator(org.apache.lucene.index.LeafReader reader)Returns a SortedSetDocValues view of this instanceorg.apache.lucene.util.BytesReflookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord)Returns the term (BytesRef) corresponding to the provided ordinal.intnumTerms()Returns the number of terms in this fieldlongramBytesUsed()Returns total bytes used.protected voidsetActualDocFreq(int termNum, int df)Invoked duringuninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef)to record the document frequency for each uninverted term.protected voiduninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix)Call this only once (if you subclass!)protected voidvisitTerm(org.apache.lucene.index.TermsEnum te, int termNum)Subclass can override this
-
-
-
Field Detail
-
DEFAULT_INDEX_INTERVAL_BITS
public static final int DEFAULT_INDEX_INTERVAL_BITS
Every 128th term is indexed, by default.- See Also:
- Constant Field Values
-
maxTermDocFreq
protected final int maxTermDocFreq
Don't uninvert terms that exceed this count.
-
field
protected final String field
Field we are uninverting.
-
numTermsInField
protected int numTermsInField
Number of terms in the field.
-
termInstances
protected long termInstances
Total number of references to term numbers.
-
total_time
protected int total_time
Total time to uninvert the field.
-
phase1_time
protected int phase1_time
Time for phase1 of the uninvert process.
-
index
protected int[] index
Holds the per-document ords or a pointer to the ords.
-
tnums
protected byte[][] tnums
Holds term ords for documents.
-
sizeOfIndexedStrings
protected long sizeOfIndexedStrings
Total bytes (sum of term lengths) for all indexed terms.
-
indexedTermsArray
protected org.apache.lucene.util.BytesRef[] indexedTermsArray
Holds the indexed (by default every 128th) terms.
-
prefix
protected org.apache.lucene.util.BytesRef prefix
If non-null, only terms matching this prefix were indexed.
-
ordBase
protected int ordBase
Ordinal of the first term in the field, or 0 if thePostingsFormatdoes not implementTermsEnum.ord().
-
postingsEnum
protected org.apache.lucene.index.PostingsEnum postingsEnum
Used while uninverting.
-
checkForDocValues
protected boolean checkForDocValues
If true, check and throw an exception if the field has docValues enabled. Normally, docValues should be used in preference to DocTermOrds.
-
-
Constructor Detail
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field) throws IOExceptionInverts all terms.- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix) throws IOExceptionInverts only terms starting w/ prefix- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq) throws IOExceptionInverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq- Throws:
IOException
-
DocTermOrds
public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOExceptionInverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).- Throws:
IOException
-
DocTermOrds
protected DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)
Subclass inits w/ this, but be sure you then call uninvert, only once
-
-
Method Detail
-
ramBytesUsed
public long ramBytesUsed()
Returns total bytes used.- Specified by:
ramBytesUsedin interfaceorg.apache.lucene.util.Accountable
-
getOrdTermsEnum
public org.apache.lucene.index.TermsEnum getOrdTermsEnum(org.apache.lucene.index.LeafReader reader) throws IOExceptionReturns a TermsEnum that implements ord, or null if no terms in field.we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.
NOTE: you must pass the same reader that was used when creating this class
- Throws:
IOException
-
numTerms
public int numTerms()
Returns the number of terms in this field
-
isEmpty
public boolean isEmpty()
Returnstrueif no terms were indexed.
-
visitTerm
protected void visitTerm(org.apache.lucene.index.TermsEnum te, int termNum) throws IOExceptionSubclass can override this- Throws:
IOException
-
setActualDocFreq
protected void setActualDocFreq(int termNum, int df) throws IOExceptionInvoked duringuninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef)to record the document frequency for each uninverted term.- Throws:
IOException
-
uninvert
protected void uninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix) throws IOExceptionCall this only once (if you subclass!)- Throws:
IOException
-
lookupTerm
public org.apache.lucene.util.BytesRef lookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord) throws IOExceptionReturns the term (BytesRef) corresponding to the provided ordinal.- Throws:
IOException
-
iterator
public org.apache.lucene.index.SortedSetDocValues iterator(org.apache.lucene.index.LeafReader reader) throws IOExceptionReturns a SortedSetDocValues view of this instance- Throws:
IOException
-
-