Package org.apache.solr.handler.tagger
Class OffsetCorrector
- java.lang.Object
-
- org.apache.solr.handler.tagger.OffsetCorrector
-
- Direct Known Subclasses:
XmlOffsetCorrector
public abstract class OffsetCorrector extends Object
-
-
Field Summary
Fields Modifier and Type Field Description protected String
docText
Document text.protected com.carrotsearch.hppc.IntArrayList
nonTaggableOffsets
Disjoint start and end span offsets (inclusive) of non-taggable sections.protected int[]
offsetPair
protected com.carrotsearch.hppc.IntArrayList
parentChangeIds
tag id; parallel array to parentChangeOffsetsprotected com.carrotsearch.hppc.IntArrayList
parentChangeOffsets
offsets of parent tag id change (ascending order)protected com.carrotsearch.hppc.IntArrayList
tagInfo
Array of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff].
-
Constructor Summary
Constructors Modifier Constructor Description protected
OffsetCorrector(String docText, boolean hasNonTaggable)
Initialize based on the document text.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected int
correctEndOffsetForCloseElement(int endOffset)
Correct endOffset for adjacent element at the right side.int[]
correctPair(int leftOffset, int rightOffset)
Corrects the start and end offset pair.protected int
getCloseEndOff(int tag)
protected int
getCloseStartOff(int tag)
protected int
getOpenEndOff(int tag)
protected int
getOpenStartOff(int tag)
protected int
getParentTag(int tag)
protected boolean
hasNonWhitespace(int start, int end)
protected int
lookupTag(int off)
protected boolean
spansNonTaggable(int startOff, int endOff)
protected boolean
tagEnclosesOffset(int tag, int off)
-
-
-
Field Detail
-
docText
protected final String docText
Document text.
-
tagInfo
protected final com.carrotsearch.hppc.IntArrayList tagInfo
Array of tag info comprised of 5 int fields: [int parentTag, int openStartOff, int openEndOff, int closeStartOff, int closeEndOff]. It's size indicates how many tags there are. Tag's are ID'ed sequentially from 0.
-
parentChangeOffsets
protected final com.carrotsearch.hppc.IntArrayList parentChangeOffsets
offsets of parent tag id change (ascending order)
-
parentChangeIds
protected final com.carrotsearch.hppc.IntArrayList parentChangeIds
tag id; parallel array to parentChangeOffsets
-
offsetPair
protected final int[] offsetPair
-
nonTaggableOffsets
protected final com.carrotsearch.hppc.IntArrayList nonTaggableOffsets
Disjoint start and end span offsets (inclusive) of non-taggable sections. Null if none.
-
-
Constructor Detail
-
OffsetCorrector
protected OffsetCorrector(String docText, boolean hasNonTaggable)
Initialize based on the document text.- Parameters:
docText
- non-null structured content.hasNonTaggable
- if there may be "non-taggable" tags to track
-
-
Method Detail
-
correctPair
public int[] correctPair(int leftOffset, int rightOffset)
Corrects the start and end offset pair. It will return null if it can't due to a failure to keep the offsets balance-able, or if it spans "non-taggable" tags. The start (left) offset is pulled left as needed over whitespace and opening tags. The end (right) offset is pulled right as needed over whitespace and closing tags. It's returned as a 2-element array.Note that the returned array is internally reused; just use it to examine the response.
-
correctEndOffsetForCloseElement
protected int correctEndOffsetForCloseElement(int endOffset)
Correct endOffset for adjacent element at the right side. E.g. offsetPair might point to:foo</tag>
and this method pulls the end offset left to the '<'. This is necessary for use withHTMLStripCharFilter
.See https://issues.apache.org/jira/browse/LUCENE-5734
-
hasNonWhitespace
protected boolean hasNonWhitespace(int start, int end)
-
tagEnclosesOffset
protected boolean tagEnclosesOffset(int tag, int off)
-
getParentTag
protected int getParentTag(int tag)
-
getOpenStartOff
protected int getOpenStartOff(int tag)
-
getOpenEndOff
protected int getOpenEndOff(int tag)
-
getCloseStartOff
protected int getCloseStartOff(int tag)
-
getCloseEndOff
protected int getCloseEndOff(int tag)
-
lookupTag
protected int lookupTag(int off)
-
spansNonTaggable
protected boolean spansNonTaggable(int startOff, int endOff)
-
-