Exercise 7: Sentiment Analysis with OpenNLP
Exercise 7: Using OpenNLP and ONNX Models for Sentiment Analysis in Solr
This tutorial demonstrates how to enhance Solr with advanced Natural Language Processing (NLP) capabilities through Apache OpenNLP and ONNX. You’ll learn how to set up a sentiment analysis pipeline that automatically classifies documents during indexing.
We are going to use the bert-base-multilingual-uncased-sentiment model in the tutorial, however there are many others you can use.
is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish, and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).
Step 1: Start Solr with Required Modules
To enable NLP processing in Solr, start Solr with the analysis-extras
module and package support:
$ export SOLR_SECURITY_MANAGER_ENABLED=false
$ bin/solr start -m 4g -Dsolr.modules=analysis-extras -Dsolr.packages.enabled=true
We disable the security manager to allow loading of the ONNX runtime. In older JVM’s Solr runs with a security manager and you would need to disable it, in newer ones it is disabled. |
Step 2: Download the Required Model Files
For sentiment analysis, we need two essential files:
-
An ONNX model file that contains the neural network
-
A vocabulary file that maps tokens to IDs for the model
Let’s create a directory for our models and download them:
$ mkdir -p ./downloads/sentiment/
$ wget -O ./downloads/sentiment/model.onnx https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/resolve/main/onnx/model_quantized.onnx
$ wget -O ./downloads/sentiment/vocab.txt https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/raw/main/vocab.txt
If you do not have wget
installed you will need to adjust the above command or download manually.
Step 3: Create a Collection for Sentiment Analysis
Create a new collection for our sentiment analysis experiments:
$ bin/solr create -c sentiment
Step 4: Configure the Schema
We need to add fields to our schema to store both the input text and the sentiment classification results:
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"name",
"type":"string",
"stored":true }
}' "http://localhost:8983/solr/sentiment/schema"
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"name_sentiment",
"type":"string",
"stored":true }
}' "http://localhost:8983/solr/sentiment/schema"
Step 5: Upload the Model Files to Solr’s FileStore
Solr’s FileStore provides a distributed file storage mechanism for SolrCloud. Upload our model files there:
$ curl --data-binary @./downloads/sentiment/vocab.txt -X PUT "http://localhost:8983/api/cluster/filestore/files/models/sentiment/vocab.txt"
$ curl --data-binary @./downloads/sentiment/model.onnx -X PUT "http://localhost:8983/api/cluster/filestore/files/models/sentiment/model.onnx"
Step 6: Configure the Document Categorizer Update Processor
Now we’ll configure the update processor that will analyze sentiment during document indexing:
$ curl -X POST -H 'Content-type:application/json' -d '{
"add-updateprocessor": {
"name": "sentimentClassifier",
"class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
"modelFile": "models/sentiment/model.onnx",
"vocabFile": "models/sentiment/vocab.txt",
"source": "name",
"dest": "name_sentiment"
}
}' "http://localhost:8983/solr/sentiment/config"
This configuration creates an update processor that:
-
Takes text from the
name
field -
Processes it through the sentiment model
-
Stores the sentiment classification in the
name_sentiment
field
Parameter | Description |
---|---|
|
Path to the ONNX model file in the FileStore (required) |
|
Path to the vocabulary file in the FileStore (required) |
|
Field(s) containing text to analyze (required) |
|
Field where sentiment results will be stored (required) |
Step 7: Index Documents with Sentiment Analysis
Let’s index some sample documents to see the sentiment analysis in action:
$ curl -X POST -H 'Content-type:application/json' -d '[
{
"id":"good",
"name": "that was an awesome movie!"
},
{
"id":"bad",
"name": "that movie was bad and terrible"
}
]' "http://localhost:8983/solr/sentiment/update/json?processor=sentimentClassifier&commit=true"
Notice that we specify the processor name with processor=sentimentClassifier
in the URL.
Step 8: Query and Verify the Results
Query the documents to see the sentiment classifications:
$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:good"
You should see the positive review classified as "very good":
{
"response":{"numFound":1,"start":0,"docs":[
{
"id":"good",
"name":"that was an awesome movie!",
"name_sentiment":"very good",
"_version_":1687591998864932864}]
}
}
Check the negative review:
$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:bad"
The result should show "very bad" sentiment:
{
"response":{"numFound":1,"start":0,"docs":[
{
"id":"bad",
"name":"that movie was bad and terrible",
"name_sentiment":"very bad",
"_version_":1687591998897568768}]
}
}
Advanced Configuration Options
The DocumentCategorizerUpdateProcessorFactory
supports several advanced configuration options. Here are some examples from real-world use cases:
Processing Multiple Source Fields
You can specify multiple source fields either as separate source
parameters or as an array:
<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
<str name="modelFile">models/sentiment/model.onnx</str>
<str name="vocabFile">models/sentiment/vocab.txt</str>
<str name="source">title</str>
<str name="source">content</str>
<str name="dest">document_sentiment</str>
</processor>
Or using JSON configuration:
{
"add-updateprocessor": {
"name": "multiFieldSentiment",
"class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
"modelFile": "models/sentiment/model.onnx",
"vocabFile": "models/sentiment/vocab.txt",
"source": ["title", "content", "comments"],
"dest": "document_sentiment"
}
}
Using Field Pattern Matching (Regex)
You can use regular expressions to select fields to process:
<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
<str name="modelFile">models/sentiment/model.onnx</str>
<str name="vocabFile">models/sentiment/vocab.txt</str>
<lst name="source">
<str name="fieldRegex">.*_text$|comments_.*</str>
</lst>
<str name="dest">sentiment</str>
</processor>
This will process any field ending with _text
or starting with comments_
.
Dynamic Destination Field Names
You can dynamically generate destination field names based on source field patterns:
<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
<str name="modelFile">models/sentiment/model.onnx</str>
<str name="vocabFile">models/sentiment/vocab.txt</str>
<lst name="source">
<str name="fieldRegex">review_\d+_text</str>
</lst>
<lst name="dest">
<str name="pattern">review_(\d+)_text</str>
<str name="replacement">review_$1_sentiment</str>
</lst>
</processor>
This would process fields like review_1_text
and store results in corresponding fields like review_1_sentiment
.
Field Selection with Exclusions
You can include certain fields and exclude others:
<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
<str name="modelFile">models/sentiment/model.onnx</str>
<str name="vocabFile">models/sentiment/vocab.txt</str>
<lst name="source">
<str name="fieldRegex">text.*</str>
<lst name="exclude">
<str name="fieldRegex">text\_private\_.*</str>
</lst>
</lst>
<str name="dest">sentiment</str>
</processor>
This selects all fields starting with text
except those starting with text_private_
.
Creating a Custom Update Processor Chain
For a permanent configuration, define an update processor chain in solrconfig.xml
:
<updateRequestProcessorChain name="sentiment-analysis-chain">
<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
<str name="modelFile">models/sentiment/model.onnx</str>
<str name="vocabFile">models/sentiment/vocab.txt</str>
<str name="source">name</str>
<str name="dest">name_sentiment</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
You can then use this chain by default or explicitly reference it when indexing:
$ curl "http://localhost:8983/solr/sentiment/update/json?update.chain=sentiment-analysis-chain" -d '...'
Practical Applications of Sentiment Analysis in Solr
Faceting by Sentiment
Create facets based on sentiment to understand opinion distribution:
$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.field=name_sentiment"
Filtering by Sentiment
Filter search results to show only documents with specific sentiment:
$ curl "http://localhost:8983/solr/sentiment/select?q=product_type:electronics&fq=name_sentiment:very%20good"
Boosting by Sentiment
Boost documents with positive sentiment in search results:
$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&defType=edismax&bq=name_sentiment:very%20good^5.0"
Time-Based Sentiment Analysis
Analyze sentiment trends over time using time-based queries and facets:
$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.range=timestamp&facet.range.start=NOW/DAY-30DAY&facet.range.end=NOW&facet.range.gap=%2B1DAY&facet.pivot=timestamp,name_sentiment"
Performance Considerations
When using ONNX models in Solr, consider these performance aspects:
-
Memory Usage: ONNX models can be memory-intensive. Ensure sufficient heap space.
-
Batch Processing: For large document sets, consider batching updates.
-
Model Size: Quantized models (like the one in our example) offer better performance.
-
CPU Utilization: NLP processing is CPU-intensive. Consider CPU resources when planning deployments. We anticipate in the future leveraging ONNX on the GPU.
-
Response Time Impact: The additional processing increases indexing time but not query time.
A pattern that has been demonstrated is to index each document twice.
The first time you index the document without any sentiment analysis so you get the basic data into the index quickly and made available to users.
The second time you enable the update.chain
and that performs the sentiment analysis.
Going Beyond Sentiment Analysis
The same approach can be extended to other NLP tasks using different models:
-
Named Entity Recognition: Use
OpenNLPExtractNamedEntitiesUpdateProcessorFactory
to identify entities -
Language Detection: Use
OpenNLPLangDetectUpdateProcessorFactory
for automatic language identification -
Document Classification: Use custom models for topic or category classification
-
Summarization: Extract key sentences or generate summaries during indexing
Troubleshooting
Common Issues and Solutions
-
Model Loading Errors:
-
Ensure paths to model files are correct
-
Verify models are properly uploaded to the FileStore
-
Check that the security manager is configured to allow ONNX
-
-
Out of Memory Errors:
-
Increase JVM heap space with
-m
parameter -
Use quantized models to reduce memory usage
-
Process documents in smaller batches
-
-
Unexpected Classifications:
-
Check that text preprocessing matches model expectations
-
Ensure vocabulary file corresponds to the model
-
Consider text normalization in your schema definition
-
Conclusion
In this tutorial, you learned how to:
-
Configure Solr with OpenNLP and ONNX runtime
-
Load and use a pre-trained sentiment analysis model
-
Set up a document categorizer update processor
-
Process documents with automatic sentiment classification
-
Use advanced configuration options for complex scenarios
-
Apply sentiment analysis in practical search applications
This integration demonstrates how Solr can leverage modern NLP capabilities to enhance search and analytics functionality. By automatically enriching documents with sentiment information during indexing, you can provide more nuanced search experiences and gain deeper insights into your text data.