Package org.apache.solr.handler.tagger
Class OffsetCorrector
java.lang.Object
org.apache.solr.handler.tagger.OffsetCorrector
- Direct Known Subclasses:
XmlOffsetCorrector
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final StringDocument text.protected final com.carrotsearch.hppc.IntArrayListDisjoint start and end span offsets (inclusive) of non-taggable sections.protected final int[]protected final com.carrotsearch.hppc.IntArrayListtag id; parallel array to parentChangeOffsetsprotected final com.carrotsearch.hppc.IntArrayListoffsets of parent tag id change (ascending order)protected final com.carrotsearch.hppc.IntArrayListArray of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff]. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedOffsetCorrector(String docText, boolean hasNonTaggable) Initialize based on the document text. -
Method Summary
Modifier and TypeMethodDescriptionprotected intcorrectEndOffsetForCloseElement(int endOffset) Correct endOffset for adjacent element at the right side.int[]correctPair(int leftOffset, int rightOffset) Corrects the start and end offset pair.protected intgetCloseEndOff(int tag) protected intgetCloseStartOff(int tag) protected intgetOpenEndOff(int tag) protected intgetOpenStartOff(int tag) protected intgetParentTag(int tag) protected booleanhasNonWhitespace(int start, int end) protected intlookupTag(int off) protected booleanspansNonTaggable(int startOff, int endOff) protected booleantagEnclosesOffset(int tag, int off)
-
Field Details
-
docText
Document text. -
tagInfo
protected final com.carrotsearch.hppc.IntArrayList tagInfoArray of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff]. It's size indicates how many tags there are. Tag's are ID'ed sequentially from 0. -
parentChangeOffsets
protected final com.carrotsearch.hppc.IntArrayList parentChangeOffsetsoffsets of parent tag id change (ascending order) -
parentChangeIds
protected final com.carrotsearch.hppc.IntArrayList parentChangeIdstag id; parallel array to parentChangeOffsets -
offsetPair
protected final int[] offsetPair -
nonTaggableOffsets
protected final com.carrotsearch.hppc.IntArrayList nonTaggableOffsetsDisjoint start and end span offsets (inclusive) of non-taggable sections. Null if none.
-
-
Constructor Details
-
OffsetCorrector
Initialize based on the document text.- Parameters:
docText- non-null structured content.hasNonTaggable- if there may be "non-taggable" tags to track
-
-
Method Details
-
correctPair
public int[] correctPair(int leftOffset, int rightOffset) Corrects the start and end offset pair. It will return null if it can't due to a failure to keep the offsets balance-able, or if it spans "non-taggable" tags. The start (left) offset is pulled left as needed over whitespace and opening tags. The end (right) offset is pulled right as needed over whitespace and closing tags. It's returned as a 2-element array.Note that the returned array is internally reused; just use it to examine the response.
-
correctEndOffsetForCloseElement
protected int correctEndOffsetForCloseElement(int endOffset) Correct endOffset for adjacent element at the right side. E.g. offsetPair might point to:foo</tag>
and this method pulls the end offset left to the '<'. This is necessary for use withHTMLStripCharFilter.See https://issues.apache.org/jira/browse/LUCENE-5734
-
hasNonWhitespace
protected boolean hasNonWhitespace(int start, int end) -
tagEnclosesOffset
protected boolean tagEnclosesOffset(int tag, int off) -
getParentTag
protected int getParentTag(int tag) -
getOpenStartOff
protected int getOpenStartOff(int tag) -
getOpenEndOff
protected int getOpenEndOff(int tag) -
getCloseStartOff
protected int getCloseStartOff(int tag) -
getCloseEndOff
protected int getCloseEndOff(int tag) -
lookupTag
protected int lookupTag(int off) -
spansNonTaggable
protected boolean spansNonTaggable(int startOff, int endOff)
-