Text Analysis and Term Vectors
Term frequency-inverse document frequency (TF-IDF) term vectors are often used to represent text documents when performing text mining and machine learning operations. The math expressions library can be used to perform text analysis and create TF-IDF term vectors.
Text Analysis
The analyze
function applies a Solr analyzer to a text field and returns the tokens
emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr’s
schema can be used with the analyze
function.
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the subject
field in
the schema. The subject
field is defined as the field type text_general
and the text is analyzed using the
analysis chain configured for the text_general
field type.
analyze("hello world", subject)
When this expression is sent to the /stream
handler it responds with:
{
"result-set": {
"docs": [
{
"return-value": [
"hello",
"world"
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
Annotating Documents
The analyze
function can be used inside of a select
function to annotate documents with the tokens
generated by the analysis.
The example below performs a search
in "collection1". Each tuple returned by the search
function
contains an id
and subject
. For each tuple, the
select
function selects the id
field and calls the analyze
function on the subject
field.
The analyzer chain specified by the subject_bigram
field is configured to perform a bigram analysis.
The tokens generated by the analyze
function are added to each tuple in a field called terms
.
select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
id,
analyze(subject, subject_bigram) as terms)
Notice in the output that an array of bigram terms have been added to the tuples:
{
"result-set": {
"docs": [
{
"terms": [
"text analysis",
"analysis example"
],
"id": "1"
},
{
"terms": [
"example number",
"number two"
],
"id": "2"
},
{
"EOF": true,
"RESPONSE_TIME": 4
}
]
}
}
TF-IDF Term Vectors
The termVectors
function can be used to build TF-IDF term vectors from the terms generated by the analyze
function.
The termVectors
function operates over a list of tuples that contain a field called id
and a field called terms
.
Notice that this is the exact output structure of the document annotation example above.
The termVectors
function builds a matrix from the list of tuples. There is row in the
matrix for each tuple in the list. There is a column in the matrix for each term in the terms
field.
let(echo="c, d",
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
id,
analyze(subject, subject_bigram) as terms),
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1),
c=getRowLabels(b),
d=getColumnLabels(b))
The example below builds on the document annotation example.
1 | The echo parameter will echo variables c and d , so the output includes
the row and column labels, which will be defined later in the expression. |
2 | The list of tuples are stored in variable a . The termVectors function
operates over variable a and builds a matrix with 2 rows and 4 columns. |
3 | The termVectors function sets the row and column labels of the term vectors matrix as variable b .
The row labels are the document ids and the column labels are the terms. |
4 | The getRowLabels and getColumnLabels functions return
the row and column labels which are then stored in variables c and d . |
When this expression is sent to the /stream
handler it
responds with:
{
"result-set": {
"docs": [
{
"c": [
"1",
"2"
],
"d": [
"analysis example",
"example number",
"number two",
"text analysis"
]
},
{
"EOF": true,
"RESPONSE_TIME": 5
}
]
}
}
TF-IDF Values
The values within the term vectors matrix are the TF-IDF values for each term in each document. The example below shows the values of the matrix.
let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
id,
analyze(subject, subject_bigram) as terms),
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
When this expression is sent to the /stream
handler it
responds with:
{
"result-set": {
"docs": [
{
"b": [
[
1.4054651081081644,
0,
0,
1.4054651081081644
],
[
0,
1.4054651081081644,
1.4054651081081644,
0
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 5
}
]
}
}
Limiting the Noise
One of the key challenges when with working term vectors is that text often has a significant amount of noise
which can obscure the important terms in the data. The termVectors
function has several parameters
designed to filter out the less meaningful terms. This is also important because eliminating
the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory.
There are four parameters designed to filter noisy terms from the term vector matrix:
minTermLength
- The minimum term length required to include the term in the matrix.
- minDocFreq
- The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.
- maxDocFreq
- The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.
- exclude
- A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that term will be excluded from the term vector.