org.apache.solr.uninverting.DocTermOrds

All Implemented Interfaces:: org.apache.lucene.util.Accountable

Direct Known Subclasses:: UnInvertedField

public class DocTermOrds extends Object implements org.apache.lucene.util.Accountable

This class enables fast access to multiple term ords for a specified field across all docIDs.

Like FieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike FieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the getOrdTermsEnum(org.apache.lucene.index.LeafReader) method, and then seek-by-ord to get the term's bytes.

While normally term ords are type long, in this API they are int as the internal representation here cannot address more than MAX_INT unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. A previous internal limit (16 MB) on how many bytes each chunk of documents may consume has been increased to 2 GB.

Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.

The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparator) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).

This class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The getOrdTermsEnum(org.apache.lucene.index.LeafReader) method then provides this wrapped enum.

The RAM consumption of this class can be high!

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

protected boolean

checkForDocValues

If true, check and throw an exception if the field has docValues enabled.

static final int

DEFAULT_INDEX_INTERVAL_BITS

Every 128th term is indexed, by default.

protected final String

field

Field we are uninverting.

protected int[]

index

Holds the per-document ords or a pointer to the ords.

protected org.apache.lucene.util.BytesRef[]

indexedTermsArray

Holds the indexed (by default every 128th) terms.

protected final int

maxTermDocFreq

Don't uninvert terms that exceed this count.

protected int

numTermsInField

Number of terms in the field.

protected int

ordBase

Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement TermsEnum.ord().

protected int

phase1_time

Time for phase1 of the uninvert process.

protected org.apache.lucene.index.PostingsEnum

postingsEnum

Used while uninverting.

protected org.apache.lucene.util.BytesRef

prefix

If non-null, only terms matching this prefix were indexed.

protected long

sizeOfIndexedStrings

Total bytes (sum of term lengths) for all indexed terms.

protected long

termInstances

Total number of references to term numbers.

protected byte[][]

tnums

Holds term ords for documents.

protected int

total_time

Total time to uninvert the field.

Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)

Subclass inits w/ this, but be sure you then call uninvert, only once

DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field)

Inverts all terms.

DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix)

Inverts only terms starting w/ prefix

DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq)

Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq

DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)

Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
Method Summary

Modifier and Type

Method

Description

org.apache.lucene.index.TermsEnum

getOrdTermsEnum(org.apache.lucene.index.LeafReader reader)

Returns a TermsEnum that implements ord, or null if no terms in field.

boolean

isEmpty()

Returns true if no terms were indexed.

org.apache.lucene.index.SortedSetDocValues

iterator(org.apache.lucene.index.LeafReader reader)

Returns a SortedSetDocValues view of this instance

org.apache.lucene.util.BytesRef

lookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord)

Returns the term (BytesRef) corresponding to the provided ordinal.

int

numTerms()

Returns the number of terms in this field

long

ramBytesUsed()

Returns total bytes used.

protected void

setActualDocFreq(int termNum, int df)

Invoked during uninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef) to record the document frequency for each uninverted term.

protected void

uninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix)

Call this only once (if you subclass!)

protected void

visitTerm(org.apache.lucene.index.TermsEnum te, int termNum)

Subclass can override this

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.lucene.util.Accountable
getChildResources

Field Details
- DEFAULT_INDEX_INTERVAL_BITS
  
  public static final int DEFAULT_INDEX_INTERVAL_BITS
  
  Every 128th term is indexed, by default.
  See Also:
  
  Constant Field Values
- maxTermDocFreq
  
  protected final int maxTermDocFreq
  
  Don't uninvert terms that exceed this count.
- field
  
  protected final String field
  
  Field we are uninverting.
- numTermsInField
  
  protected int numTermsInField
  
  Number of terms in the field.
- termInstances
  
  protected long termInstances
  
  Total number of references to term numbers.
- total_time
  
  protected int total_time
  
  Total time to uninvert the field.
- phase1_time
  
  protected int phase1_time
  
  Time for phase1 of the uninvert process.
- index
  
  protected int[] index
  
  Holds the per-document ords or a pointer to the ords.
- tnums
  
  protected byte[][] tnums
  
  Holds term ords for documents.
- sizeOfIndexedStrings
  
  protected long sizeOfIndexedStrings
  
  Total bytes (sum of term lengths) for all indexed terms.
- indexedTermsArray
  
  protected org.apache.lucene.util.BytesRef[] indexedTermsArray
  
  Holds the indexed (by default every 128th) terms.
- prefix
  
  protected org.apache.lucene.util.BytesRef prefix
  
  If non-null, only terms matching this prefix were indexed.
- ordBase
  
  protected int ordBase
  
  Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement TermsEnum.ord().
- postingsEnum
  
  protected org.apache.lucene.index.PostingsEnum postingsEnum
  
  Used while uninverting.
- checkForDocValues
  
  protected boolean checkForDocValues
  
  If true, check and throw an exception if the field has docValues enabled. Normally, docValues should be used in preference to DocTermOrds.
Constructor Details
- DocTermOrds
  
  public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field) throws IOException
  
  Inverts all terms.
  
  Throws:
  
  IOException
- DocTermOrds
  
  public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix) throws IOException
  
  Inverts only terms starting w/ prefix
  
  Throws:
  
  IOException
- DocTermOrds
  
  public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq) throws IOException
  
  Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq
  
  Throws:
  
  IOException
- DocTermOrds
  
  public DocTermOrds(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, String field, org.apache.lucene.util.BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOException
  
  Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).
  
  Throws:
  
  IOException
- DocTermOrds
  
  protected DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits)
  
  Subclass inits w/ this, but be sure you then call uninvert, only once
Method Details
- ramBytesUsed
  
  public long ramBytesUsed()
  
  Returns total bytes used.
  
  Specified by:
  
  ramBytesUsed in interface org.apache.lucene.util.Accountable
- getOrdTermsEnum
  
  public org.apache.lucene.index.TermsEnum getOrdTermsEnum(org.apache.lucene.index.LeafReader reader) throws IOException
  
  Returns a TermsEnum that implements ord, or null if no terms in field.
  we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement ord. This also enables ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.
  NOTE: you must pass the same reader that was used when creating this class
  
  Throws:
  
  IOException
- numTerms
  
  public int numTerms()
  
  Returns the number of terms in this field
- isEmpty
  
  public boolean isEmpty()
  
  Returns true if no terms were indexed.
- visitTerm
  
  protected void visitTerm(org.apache.lucene.index.TermsEnum te, int termNum) throws IOException
  
  Subclass can override this
  
  Throws:
  
  IOException
- setActualDocFreq
  
  protected void setActualDocFreq(int termNum, int df) throws IOException
  
  Invoked during uninvert(org.apache.lucene.index.LeafReader,Bits,BytesRef) to record the document frequency for each uninverted term.
  
  Throws:
  
  IOException
- uninvert
  
  protected void uninvert(org.apache.lucene.index.LeafReader reader, org.apache.lucene.util.Bits liveDocs, org.apache.lucene.util.BytesRef termPrefix) throws IOException
  
  Call this only once (if you subclass!)
  
  Throws:
  
  IOException
- lookupTerm
  
  public org.apache.lucene.util.BytesRef lookupTerm(org.apache.lucene.index.TermsEnum termsEnum, int ord) throws IOException
  
  Returns the term (BytesRef) corresponding to the provided ordinal.
  
  Throws:
  
  IOException
- iterator
  
  public org.apache.lucene.index.SortedSetDocValues iterator(org.apache.lucene.index.LeafReader reader) throws IOException
  
  Returns a SortedSetDocValues view of this instance
  
  Throws:
  
  IOException

Class DocTermOrds

Field Summary

Fields inherited from interface org.apache.lucene.util.Accountable

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.lucene.util.Accountable

Field Details

DEFAULT_INDEX_INTERVAL_BITS

maxTermDocFreq

field

numTermsInField

termInstances

total_time

phase1_time

index

tnums

sizeOfIndexedStrings

indexedTermsArray

prefix

ordBase

postingsEnum

checkForDocValues

Constructor Details

DocTermOrds

DocTermOrds

DocTermOrds

DocTermOrds

DocTermOrds

Method Details

ramBytesUsed

getOrdTermsEnum

numTerms

isEmpty

visitTerm

setActualDocFreq

uninvert

lookupTerm

iterator