Exercise 7: Sentiment Analysis with OpenNLP

Exercise 7: Using OpenNLP and ONNX Models for Sentiment Analysis in Solr

This tutorial demonstrates how to enhance Solr with advanced Natural Language Processing (NLP) capabilities through Apache OpenNLP and ONNX. You’ll learn how to set up a sentiment analysis pipeline that automatically classifies documents during indexing.

We are going to use the bert-base-multilingual-uncased-sentiment model in the tutorial, however there are many others you can use.

is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in
six languages: English, Dutch, German, French, Spanish, and Italian.
It predicts the sentiment of the review as a number of stars (between 1 and 5).

Step 1: Start Solr with Required Modules

To enable NLP processing in Solr, start Solr with the analysis-extras module and package support:

$ export SOLR_SECURITY_MANAGER_ENABLED=false
$ bin/solr start -m 4g -Dsolr.modules=analysis-extras -Dsolr.packages.enabled=true

We disable the security manager to allow loading of the ONNX runtime. In older JVM’s Solr runs with a security manager and you would need to disable it, in newer ones it is disabled.

Step 2: Download the Required Model Files

For sentiment analysis, we need two essential files:

  1. An ONNX model file that contains the neural network

  2. A vocabulary file that maps tokens to IDs for the model

Let’s create a directory for our models and download them:

$ mkdir -p ./downloads/sentiment/
$ wget -O ./downloads/sentiment/model.onnx https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/resolve/main/onnx/model_quantized.onnx
$ wget -O ./downloads/sentiment/vocab.txt https://huggingface.co/onnx-community/bert-base-multilingual-uncased-sentiment-ONNX/raw/main/vocab.txt

If you do not have wget installed you will need to adjust the above command or download manually.

About ONNX Models

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. It allows models trained in different frameworks (like PyTorch, TensorFlow, or Hugging Face) to be exported to a standard format that can be used by various runtime environments. Solr gains access to ONNX models via OpenNLP.

The model we’re using is a multilingual BERT model fine-tuned for sentiment classification and quantized for better performance. It produces classifications on a 5-point scale from "very bad" to "very good".

Learn more about ONNX at onnx.ai.

Step 3: Create a Collection for Sentiment Analysis

Create a new collection for our sentiment analysis experiments:

$ bin/solr create -c sentiment

Step 4: Configure the Schema

We need to add fields to our schema to store both the input text and the sentiment classification results:

$ curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field":{
    "name":"name",
    "type":"string",
    "stored":true }
}' "http://localhost:8983/solr/sentiment/schema"
$ curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field":{
    "name":"name_sentiment",
    "type":"string",
    "stored":true }
}' "http://localhost:8983/solr/sentiment/schema"

Step 5: Upload the Model Files to Solr’s FileStore

Solr’s FileStore provides a distributed file storage mechanism for SolrCloud. Upload our model files there:

$ curl --data-binary @./downloads/sentiment/vocab.txt -X PUT "http://localhost:8983/api/cluster/filestore/files/models/sentiment/vocab.txt"
$ curl --data-binary @./downloads/sentiment/model.onnx -X PUT "http://localhost:8983/api/cluster/filestore/files/models/sentiment/model.onnx"
Understanding Solr’s FileStore

Solr’s FileStore is a distributed file storage system that replicates files across the SolrCloud cluster. Files uploaded to the FileStore are accessible by all Solr nodes, making it ideal for storing resources like models and vocabularies.

When you reference these files in configuration, you use paths relative to the FileStore root.

Step 6: Configure the Document Categorizer Update Processor

Now we’ll configure the update processor that will analyze sentiment during document indexing:

$ curl -X POST -H 'Content-type:application/json' -d '{
  "add-updateprocessor": {
    "name": "sentimentClassifier",
    "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
    "modelFile": "models/sentiment/model.onnx",
    "vocabFile": "models/sentiment/vocab.txt",
    "source": "name",
    "dest": "name_sentiment"
  }
}' "http://localhost:8983/solr/sentiment/config"

This configuration creates an update processor that:

  • Takes text from the name field

  • Processes it through the sentiment model

  • Stores the sentiment classification in the name_sentiment field

Table 1. Required Parameters for DocumentCategorizerUpdateProcessorFactory
Parameter Description

modelFile

Path to the ONNX model file in the FileStore (required)

vocabFile

Path to the vocabulary file in the FileStore (required)

source

Field(s) containing text to analyze (required)

dest

Field where sentiment results will be stored (required)

Step 7: Index Documents with Sentiment Analysis

Let’s index some sample documents to see the sentiment analysis in action:

$ curl -X POST -H 'Content-type:application/json' -d '[
  {
    "id":"good",
    "name": "that was an awesome movie!"
  },
  {
    "id":"bad",
    "name": "that movie was bad and terrible"
  }
]' "http://localhost:8983/solr/sentiment/update/json?processor=sentimentClassifier&commit=true"

Notice that we specify the processor name with processor=sentimentClassifier in the URL.

Step 8: Query and Verify the Results

Query the documents to see the sentiment classifications:

$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:good"

You should see the positive review classified as "very good":

{
  "response":{"numFound":1,"start":0,"docs":[
    {
      "id":"good",
      "name":"that was an awesome movie!",
      "name_sentiment":"very good",
      "_version_":1687591998864932864}]
  }
}

Check the negative review:

$ curl -X GET "http://localhost:8983/solr/sentiment/select?q=id:bad"

The result should show "very bad" sentiment:

{
  "response":{"numFound":1,"start":0,"docs":[
    {
      "id":"bad",
      "name":"that movie was bad and terrible",
      "name_sentiment":"very bad",
      "_version_":1687591998897568768}]
  }
}

Advanced Configuration Options

The DocumentCategorizerUpdateProcessorFactory supports several advanced configuration options. Here are some examples from real-world use cases:

Processing Multiple Source Fields

You can specify multiple source fields either as separate source parameters or as an array:

<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
  <str name="modelFile">models/sentiment/model.onnx</str>
  <str name="vocabFile">models/sentiment/vocab.txt</str>
  <str name="source">title</str>
  <str name="source">content</str>
  <str name="dest">document_sentiment</str>
</processor>

Or using JSON configuration:

{
  "add-updateprocessor": {
    "name": "multiFieldSentiment",
    "class": "solr.processor.DocumentCategorizerUpdateProcessorFactory",
    "modelFile": "models/sentiment/model.onnx",
    "vocabFile": "models/sentiment/vocab.txt",
    "source": ["title", "content", "comments"],
    "dest": "document_sentiment"
  }
}

Using Field Pattern Matching (Regex)

You can use regular expressions to select fields to process:

<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
  <str name="modelFile">models/sentiment/model.onnx</str>
  <str name="vocabFile">models/sentiment/vocab.txt</str>
  <lst name="source">
    <str name="fieldRegex">.*_text$|comments_.*</str>
  </lst>
  <str name="dest">sentiment</str>
</processor>

This will process any field ending with _text or starting with comments_.

Dynamic Destination Field Names

You can dynamically generate destination field names based on source field patterns:

<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
  <str name="modelFile">models/sentiment/model.onnx</str>
  <str name="vocabFile">models/sentiment/vocab.txt</str>
  <lst name="source">
    <str name="fieldRegex">review_\d+_text</str>
  </lst>
  <lst name="dest">
    <str name="pattern">review_(\d+)_text</str>
    <str name="replacement">review_$1_sentiment</str>
  </lst>
</processor>

This would process fields like review_1_text and store results in corresponding fields like review_1_sentiment.

Field Selection with Exclusions

You can include certain fields and exclude others:

<processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
  <str name="modelFile">models/sentiment/model.onnx</str>
  <str name="vocabFile">models/sentiment/vocab.txt</str>
  <lst name="source">
    <str name="fieldRegex">text.*</str>
    <lst name="exclude">
      <str name="fieldRegex">text\_private\_.*</str>
    </lst>
  </lst>
  <str name="dest">sentiment</str>
</processor>

This selects all fields starting with text except those starting with text_private_.

Creating a Custom Update Processor Chain

For a permanent configuration, define an update processor chain in solrconfig.xml:

<updateRequestProcessorChain name="sentiment-analysis-chain">
  <processor class="solr.processor.DocumentCategorizerUpdateProcessorFactory">
    <str name="modelFile">models/sentiment/model.onnx</str>
    <str name="vocabFile">models/sentiment/vocab.txt</str>
    <str name="source">name</str>
    <str name="dest">name_sentiment</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

You can then use this chain by default or explicitly reference it when indexing:

$ curl "http://localhost:8983/solr/sentiment/update/json?update.chain=sentiment-analysis-chain" -d '...'

Practical Applications of Sentiment Analysis in Solr

Faceting by Sentiment

Create facets based on sentiment to understand opinion distribution:

$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.field=name_sentiment"

Filtering by Sentiment

Filter search results to show only documents with specific sentiment:

$ curl "http://localhost:8983/solr/sentiment/select?q=product_type:electronics&fq=name_sentiment:very%20good"

Boosting by Sentiment

Boost documents with positive sentiment in search results:

$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&defType=edismax&bq=name_sentiment:very%20good^5.0"

Time-Based Sentiment Analysis

Analyze sentiment trends over time using time-based queries and facets:

$ curl "http://localhost:8983/solr/sentiment/select?q=*:*&facet=true&facet.range=timestamp&facet.range.start=NOW/DAY-30DAY&facet.range.end=NOW&facet.range.gap=%2B1DAY&facet.pivot=timestamp,name_sentiment"

Performance Considerations

When using ONNX models in Solr, consider these performance aspects:

  • Memory Usage: ONNX models can be memory-intensive. Ensure sufficient heap space.

  • Batch Processing: For large document sets, consider batching updates.

  • Model Size: Quantized models (like the one in our example) offer better performance.

  • CPU Utilization: NLP processing is CPU-intensive. Consider CPU resources when planning deployments. We anticipate in the future leveraging ONNX on the GPU.

  • Response Time Impact: The additional processing increases indexing time but not query time.

A pattern that has been demonstrated is to index each document twice. The first time you index the document without any sentiment analysis so you get the basic data into the index quickly and made available to users. The second time you enable the update.chain and that performs the sentiment analysis.

Going Beyond Sentiment Analysis

The same approach can be extended to other NLP tasks using different models:

  • Named Entity Recognition: Use OpenNLPExtractNamedEntitiesUpdateProcessorFactory to identify entities

  • Language Detection: Use OpenNLPLangDetectUpdateProcessorFactory for automatic language identification

  • Document Classification: Use custom models for topic or category classification

  • Summarization: Extract key sentences or generate summaries during indexing

Troubleshooting

Common Issues and Solutions

  1. Model Loading Errors:

    • Ensure paths to model files are correct

    • Verify models are properly uploaded to the FileStore

    • Check that the security manager is configured to allow ONNX

  2. Out of Memory Errors:

    • Increase JVM heap space with -m parameter

    • Use quantized models to reduce memory usage

    • Process documents in smaller batches

  3. Unexpected Classifications:

    • Check that text preprocessing matches model expectations

    • Ensure vocabulary file corresponds to the model

    • Consider text normalization in your schema definition

Conclusion

In this tutorial, you learned how to:

  1. Configure Solr with OpenNLP and ONNX runtime

  2. Load and use a pre-trained sentiment analysis model

  3. Set up a document categorizer update processor

  4. Process documents with automatic sentiment classification

  5. Use advanced configuration options for complex scenarios

  6. Apply sentiment analysis in practical search applications

This integration demonstrates how Solr can leverage modern NLP capabilities to enhance search and analytics functionality. By automatically enriching documents with sentiment information during indexing, you can provide more nuanced search experiences and gain deeper insights into your text data.

Cleaning Up

When you’re done with this tutorial, stop Solr:

$ bin/solr stop --all