Class OffsetCorrector

java.lang.Object
org.apache.solr.handler.tagger.OffsetCorrector
Direct Known Subclasses:
XmlOffsetCorrector

public abstract class OffsetCorrector extends Object
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final String
    Document text.
    protected final com.carrotsearch.hppc.IntArrayList
    Disjoint start and end span offsets (inclusive) of non-taggable sections.
    protected final int[]
     
    protected final com.carrotsearch.hppc.IntArrayList
    tag id; parallel array to parentChangeOffsets
    protected final com.carrotsearch.hppc.IntArrayList
    offsets of parent tag id change (ascending order)
    protected final com.carrotsearch.hppc.IntArrayList
    Array of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff].
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    OffsetCorrector(String docText, boolean hasNonTaggable)
    Initialize based on the document text.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected int
    Correct endOffset for adjacent element at the right side.
    int[]
    correctPair(int leftOffset, int rightOffset)
    Corrects the start and end offset pair.
    protected int
    getCloseEndOff(int tag)
     
    protected int
     
    protected int
    getOpenEndOff(int tag)
     
    protected int
    getOpenStartOff(int tag)
     
    protected int
    getParentTag(int tag)
     
    protected boolean
    hasNonWhitespace(int start, int end)
     
    protected int
    lookupTag(int off)
     
    protected boolean
    spansNonTaggable(int startOff, int endOff)
     
    protected boolean
    tagEnclosesOffset(int tag, int off)
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • docText

      protected final String docText
      Document text.
    • tagInfo

      protected final com.carrotsearch.hppc.IntArrayList tagInfo
      Array of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff]. It's size indicates how many tags there are. Tag's are ID'ed sequentially from 0.
    • parentChangeOffsets

      protected final com.carrotsearch.hppc.IntArrayList parentChangeOffsets
      offsets of parent tag id change (ascending order)
    • parentChangeIds

      protected final com.carrotsearch.hppc.IntArrayList parentChangeIds
      tag id; parallel array to parentChangeOffsets
    • offsetPair

      protected final int[] offsetPair
    • nonTaggableOffsets

      protected final com.carrotsearch.hppc.IntArrayList nonTaggableOffsets
      Disjoint start and end span offsets (inclusive) of non-taggable sections. Null if none.
  • Constructor Details

    • OffsetCorrector

      protected OffsetCorrector(String docText, boolean hasNonTaggable)
      Initialize based on the document text.
      Parameters:
      docText - non-null structured content.
      hasNonTaggable - if there may be "non-taggable" tags to track
  • Method Details

    • correctPair

      public int[] correctPair(int leftOffset, int rightOffset)
      Corrects the start and end offset pair. It will return null if it can't due to a failure to keep the offsets balance-able, or if it spans "non-taggable" tags. The start (left) offset is pulled left as needed over whitespace and opening tags. The end (right) offset is pulled right as needed over whitespace and closing tags. It's returned as a 2-element array.

      Note that the returned array is internally reused; just use it to examine the response.

    • correctEndOffsetForCloseElement

      protected int correctEndOffsetForCloseElement(int endOffset)
      Correct endOffset for adjacent element at the right side. E.g. offsetPair might point to:
         foo</tag>
       
      and this method pulls the end offset left to the '<'. This is necessary for use with HTMLStripCharFilter.

      See https://issues.apache.org/jira/browse/LUCENE-5734

    • hasNonWhitespace

      protected boolean hasNonWhitespace(int start, int end)
    • tagEnclosesOffset

      protected boolean tagEnclosesOffset(int tag, int off)
    • getParentTag

      protected int getParentTag(int tag)
    • getOpenStartOff

      protected int getOpenStartOff(int tag)
    • getOpenEndOff

      protected int getOpenEndOff(int tag)
    • getCloseStartOff

      protected int getCloseStartOff(int tag)
    • getCloseEndOff

      protected int getCloseEndOff(int tag)
    • lookupTag

      protected int lookupTag(int off)
    • spansNonTaggable

      protected boolean spansNonTaggable(int startOff, int endOff)