Solr Indexing

This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying that content or deleting it.

By adding content to an index, we make it searchable by Solr.

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

Here are the three most common ways of loading data into a Solr index:

Indexing with Solr Cell and Apache Tika, built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated.
Writing a custom Java application to ingest data through Solr’s Java Client API (which is described in more detail in Client APIs). Using the Java API may be the best choice if you’re working with an application, such as a Content Management System (CMS), that offers a Java API.

Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a document containing multiple fields, each with a name and containing content, which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr.

If the field name is defined in the Schema that is associated with the index, then the analysis steps associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition, if one matching the field name exists.

The Solr Example Directory

When starting Solr with the "-e" option, the example/ directory will be used as base directory for the example Solr instances that are created. This directory also includes an example/exampledocs/ subdirectory containing sample documents in a variety of formats that you can use to experiment with indexing into the various examples.

The curl Utility for Transferring Files

Many of the instructions and examples in this section make use of the curl utility for transferring content through a URL. curl posts and retrieves data over HTTP, FTP, and many other protocols. Most Linux distributions include a copy of curl. You’ll find curl downloads for Linux, Windows, and many other operating systems at http://curl.haxx.se/download.html. Documentation for curl is available here: http://curl.haxx.se/docs/manpage.html.

Using curl or other command line tools for posting data is just fine for examples or tests, but it’s not the recommended method for achieving the best performance for updates in production environments. You will achieve better performance with Solr Cell or the other methods described in this section.

Instead of curl, you can use utilities such as GNU wget (http://www.gnu.org/software/wget/) or manage GETs and POSTS with Perl, although the command line options will differ.