Exercise 4: Using ParamSets

Exercise 4: Using ParamSets

This exercise will teach you to use ParamSets to group a number of different query parameters into a labelled grouping that you can refer to in your queries.

Getting Ready

Make sure you have a running Solr, following the steps in tutorial-films.adoc#restart-solr. Then go ahead to the next section.

Create a New Collection

$ bin/solr create -c films

Because we didn’t specify a ConfigSet, we will end up using the _default ConfigSet. We’ll specify the specific schema for a couple of fields that Solr would otherwise guess differently (than we’d like) about:

$ curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field" : [
    {
      "name":"name",
      "type":"text_general",
      "multiValued":false,
      "stored":true
    },
    {
      "name":"initial_release_date",
      "type":"pdate",
      "stored":true
    }
  ]
}'

Without explicitly defining those field types, the name field would have been guessed as a multi-valued string field type and initial_release_date would have been guessed as a multi-valued pdate type. It makes more sense with this particular data set domain to have the movie name be a single valued general full-text searchable field, and for the release date also to be single valued.

Index the Data

Now that we have updated our Schema, we need to index the sample film data, or, if you already have indexed it, then re-index it to take advantage of the new field definitions we added.

Linux/Mac

$ bin/solr post -c films example/films/films.json

Windows

$ bin/solr post -c films example\films\films.json

Let’s get Searching!

Search for 'Batman':

  • If you get an error about the name field not existing, you haven’t yet indexed the data.

  • If you don’t get an error, but zero results, chances are that the name field schema type override wasn’t set before indexing the data the first time (it ended up as a "string" type, requiring exact matching by case even). It’s easiest to simply reset your environment and try again, ensuring that each step successfully executes.

Show me all 'Super hero' movies:

$ curl 'http://localhost:8983/solr/films/query?q=*:*&fq=genre:"Superhero movie"'

Let’s see the distribution of genres across all the movies. See the facet section of the response for the counts:

$ curl 'http://localhost:8983/solr/films/query?q=*:*&facet=true&facet.field=genre'

Time for relevancy tuning with ParamSets :

Now that we can query our data, let’s actually use the ParamSets to organize our parameters into two experiments.

Search for 'harry potter':

Notice the very first result is the movie Dumb & Dumberer: When Harry Met Lloyd? That is clearly not related to any Harry Potter movies.

Let’s set up two relevancy algorithms, using our APIs, and then compare the quality of the results. Algorithm A will specify using dismax and a qf parameter, while Algorithm B will use dismax, qf and a must match mm set to 100%.

curl http://localhost:8983/solr/films/config/params -X POST -H 'Content-type:application/json' --data-binary '{
"set": {
    "algo_a":{
      "defType":"dismax",
      "qf":"name"
    }
  },
  "set": {
    "algo_b":{
      "defType":"dismax",
      "qf":"name",
      "mm":"100%"
    }
  }
}'

Search for 'harry potter' with Algorithm A:

We are returning the five results, including the Harry Potter movies, however notice that we still have the Dumb & Dumberer: When Harry Met Lloyd movie coming back?

Search for 'harry potter' with Algorithm B:

We are returning only the four Harry Potter movies, leading to more precise results! We can say that we believe Algorithm B is better then Algorithm A, at least for this one query. You can validate this hypothesis with online A/B testing to confirm with real users that Algorithm B is better overall.

Exercise 4 Wrap Up

In this exercise, we used the Schema API to create the fields that we needed, and then learned how to organize our query parameters into named groups of parameters called ParamSets that we created using the Config API and subsequently referenced in queries.