Class OffsetCorrector

    • Field Detail

      • docText

        protected final String docText
        Document text.
      • tagInfo

        protected final com.carrotsearch.hppc.IntArrayList tagInfo
        Array of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff]. It's size indicates how many tags there are. Tag's are ID'ed sequentially from 0.
      • parentChangeOffsets

        protected final com.carrotsearch.hppc.IntArrayList parentChangeOffsets
        offsets of parent tag id change (ascending order)
      • parentChangeIds

        protected final com.carrotsearch.hppc.IntArrayList parentChangeIds
        tag id; parallel array to parentChangeOffsets
      • offsetPair

        protected final int[] offsetPair
      • nonTaggableOffsets

        protected final com.carrotsearch.hppc.IntArrayList nonTaggableOffsets
        Disjoint start and end span offsets (inclusive) of non-taggable sections. Null if none.
    • Constructor Detail

      • OffsetCorrector

        protected OffsetCorrector​(String docText,
                                  boolean hasNonTaggable)
        Initialize based on the document text.
        Parameters:
        docText - non-null structured content.
        hasNonTaggable - if there may be "non-taggable" tags to track
    • Method Detail

      • correctPair

        public int[] correctPair​(int leftOffset,
                                 int rightOffset)
        Corrects the start and end offset pair. It will return null if it can't due to a failure to keep the offsets balance-able, or if it spans "non-taggable" tags. The start (left) offset is pulled left as needed over whitespace and opening tags. The end (right) offset is pulled right as needed over whitespace and closing tags. It's returned as a 2-element array.

        Note that the returned array is internally reused; just use it to examine the response.

      • correctEndOffsetForCloseElement

        protected int correctEndOffsetForCloseElement​(int endOffset)
        Correct endOffset for adjacent element at the right side. E.g. offsetPair might point to:
           foo</tag>
         
        and this method pulls the end offset left to the '<'. This is necessary for use with HTMLStripCharFilter.

        See https://issues.apache.org/jira/browse/LUCENE-5734

      • hasNonWhitespace

        protected boolean hasNonWhitespace​(int start,
                                           int end)
      • tagEnclosesOffset

        protected boolean tagEnclosesOffset​(int tag,
                                            int off)
      • getParentTag

        protected int getParentTag​(int tag)
      • getOpenStartOff

        protected int getOpenStartOff​(int tag)
      • getOpenEndOff

        protected int getOpenEndOff​(int tag)
      • getCloseStartOff

        protected int getCloseStartOff​(int tag)
      • getCloseEndOff

        protected int getCloseEndOff​(int tag)
      • lookupTag

        protected int lookupTag​(int off)
      • spansNonTaggable

        protected boolean spansNonTaggable​(int startOff,
                                           int endOff)