The Data Import Handler is deprecated and will be removed in 9.0. This functionality is being migrated to a new 3rd party plugin available at https://github.com/rohitbemax/dataimporthandler. See the section Package Manager for information about Solr’s plugin framework.

Many search applications store the content to be indexed in a structured data store, such as a relational database. The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it.

In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields.

## DIH Concepts and Terminology

Descriptions of the Data Import Handler use several familiar terms, such as entity and processor, in specific ways, as explained in the table below.

Datasource
As its name suggests, a datasource defines the location of the data of interest. For a database, it’s a DSN. For an HTTP datasource, it’s the base URL.
Entity
Conceptually, an entity is processed to generate a set of documents, containing multiple fields, which (after optionally being transformed in various ways) are sent to Solr for indexing. For a RDBMS data source, an entity is a view or table, which would be processed by one or more SQL statements to generate a set of rows (documents) with one or more columns (fields).
Processor
An entity processor does the work of extracting content from a data source, transforming it, and adding it to the index. Custom entity processors can be written to extend or replace the ones supplied.
Transformer
Each set of fields fetched by the entity may optionally be transformed. This process can modify the fields, create new fields, or generate multiple rows/documents form a single row. There are several built-in transformers in the DIH, which perform functions such as modifying dates and stripping HTML. It is possible to write custom transformers using the publicly available interface.

## Solr’s DIH Examples

The example/example-DIH directory contains several collections to demonstrate many of the features of the data import handler. These are available with the dih example from the Solr Control Script:

bin/solr -e dih

This launches a standalone Solr instance with several collections that correspond to detailed examples. The available examples are atom, db, mail, solr, and tika.

All examples in this section assume you are running the DIH example server.

## Configuring DIH

### Configuring solrconfig.xml for DIH

The Data Import Handler has to be registered in solrconfig.xml. For example:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>

The only required parameter is the config parameter, which specifies the location of the DIH configuration file that contains specifications for the data source, how to fetch data, what data to fetch, and how to process it to generate the Solr documents to be posted to the index.

You can have multiple DIH configuration files. Each file would require a separate definition in the solrconfig.xml file, specifying a path to the file.

### Configuring the DIH Configuration File

An annotated configuration file, based on the db collection in the dih example server, is shown below (this file is located in example/example-DIH/solr/db/conf/db-data-config.xml).

This example shows how to extract fields from four tables defining a simple product database. More information about the parameters and options shown here will be described in the sections following.

<dataConfig>

<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex"
<document>
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'"> <field column="NAME" name="name" /> <entity name="feature" query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>

<entity name="item_category"
query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> <entity name="category" query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
 1 The first element is the dataSource, in this case an HSQLDB database. The path to the JDBC driver and the JDBC URL and login credentials are all specified here. Other permissible attributes include whether or not to autocommit to Solr, the batchsize used in the JDBC connection, and a readOnly flag. 2 The password attribute is optional if there is no password set for the DB. Alternately, the password can be encrypted; the section Encrypting a Database Password below describes how to do this. 3 A document element follows, containing multiple entity elements. Note that entity elements can be nested, and this allows the entity relationships in the sample database to be mirrored here, so that we can generate a denormalized Solr record which may include multiple features for one item, for instance. 4 The possible attributes for the entity element are described in later sections. Entity elements may contain one or more field elements, which map the data source field names to Solr fields, and optionally specify per-field transformations. This entity is the root entity. 5 This entity is nested and reflects the one-to-many relationship between an item and its multiple features. Note the use of variables; ${item.ID} is the value of the column 'ID' for the current item (item referring to the entity name). Datasources can still be specified in solrconfig.xml. These must be specified in the defaults section of the handler in solrconfig.xml. However, these are not parsed until the main configuration is loaded. The entire configuration itself can be passed as a request parameter using the dataConfig parameter rather than using a file. When configuration errors are encountered, the error message is returned in XML format. Due to security concerns, this only works if you start Solr with -Denable.dih.dataConfigParam=true. A reload-config command is also supported, which is useful for validating a new configuration file, or if you want to specify a file, load it, and not have it reloaded again on import. If there is an xml mistake in the configuration a user-friendly message is returned in xml format. You can then fix the problem and do a reload-config.  You can also view the DIH configuration in the Solr Admin UI from the Dataimport Screen. It includes an interface to import content. #### DIH Request Parameters Request parameters can be substituted in configuration with placeholder${dataimporter.request.paramname}, as in this example:

<dataSource driver="org.hsqldb.jdbcDriver"
url="${dataimporter.request.jdbcurl}" user="${dataimporter.request.jdbcuser}"

### URLDataSource

This data source is often used with XPathEntityProcessor to fetch content from an underlying file:// or http:// location. Here’s an example:

<dataSource name="a"
type="URLDataSource"
baseUrl="http://host:port/"
encoding="UTF-8"
connectionTimeout="5000"

The URLDataSource type accepts these optional parameters:

baseURL
Specifies a new baseURL for pathnames. You can use this to specify host/port changes between Dev/QA/Prod environments. Using this attribute isolates the changes to be made to the solrconfig.xml
connectionTimeout
Specifies the length of time in milliseconds after which the connection should time out. The default value is 5000ms.
encoding
By default the encoding in the response header is used. You can use this property to override the default encoding.
Specifies the length of time in milliseconds after which a read operation should time out. The default value is 10000ms.

## Entity Processors

Entity processors extract data, transform it, and add it to a Solr index. Examples of entities include views or tables in a data store.

Each processor has its own set of attributes, described in its own section below. In addition, there are several attributes common to all entities which may be specified:

dataSource
The name of a data source. If there are multiple data sources defined, use this attribute with the name of the data source for this entity.
name
Required. The unique name used to identify an entity.
pk

The primary key for the entity. It is optional, and required only when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they can both be the same.

This attribute is mandatory if you do delta-imports and then refer to the column name in ${dataimporter.delta.<column-name>} which is used as the primary key. processor Default is SqlEntityProcessor. Required only if the datasource is not RDBMS. onError Defines what to do if an error is encountered. Permissible values are: abort Stops the import. skip Skips the current document. continue Ignores the error and processing continues. preImportDeleteQuery Before a full-import command, use this query this to cleanup the index instead of using *:*. This is honored only on an entity that is an immediate sub-child of <document>. postImportDeleteQuery Similar to preImportDeleteQuery, but it executes after the import has completed. rootEntity By default the entities immediately under <document> are root entities. If this attribute is set to false, the entity directly falling under that entity will be treated as the root entity (and so on). For every row returned by the root entity, a document is created in Solr. transformer Optional. One or more transformers to be applied on this entity. cacheImpl Optional. A class (which must implement DIHCache) to use for caching this entity when doing lookups from an entity which wraps it. Provided implementation is SortedMapBackedCache. cacheKey The name of a property of this entity to use as a cache key if cacheImpl is specified. cacheLookup An entity + property name that will be used to lookup cached instances of this entity if cacheImpl is specified. where An alternative way to specify cacheKey and cacheLookup concatenated with '='. For example, where="CODE=People.COUNTRY_CODE" is equivalent to cacheKey="CODE" cacheLookup="People.COUNTRY_CODE" child="true" Enables indexing document blocks aka Nested Child Documents for searching with Block Join Query Parsers. It can be only specified on the <entity> element under another root entity. It switches from default behavior (merging field values) to nesting documents as children documents. Note: parent <entity> should add a field which is used as a parent filter in query time. join="zipper" Enables merge join, aka "zipper" algorithm, for joining parent and child entities without cache. It should be specified at child (nested) <entity>. It implies that parent and child queries return results ordered by keys, otherwise it throws an exception. Keys should be specified either with where attribute or with cacheKey and cacheLookup. ### Entity Caching Caching of entities in DIH is provided to avoid repeated lookups for same entities again and again. The default SortedMapBackedCache is a HashMap where a key is a field in the row and the value is a bunch of rows for that same key. In the example below, each manufacturer entity is cached using the id property as a cache key. Cache lookups will be performed for each product entity based on the product’s manu property. When the cache has no data for a particular key, the query is run and the cache is populated <entity name="product" query="select description,sku, manu from product" > <entity name="manufacturer" query="select id, name from manufacturer" cacheKey="id" cacheLookup="product.manu" cacheImpl="SortedMapBackedCache"/> </entity> ### The SQL Entity Processor The SqlEntityProcessor is the default processor. The associated JdbcDataSource should be a JDBC URL. The entity attributes specific to this processor are shown in the table below. These are in addition to the attributes common to all entity processors described above. query Required. The SQL query used to select rows. deltaQuery SQL query used if the operation is delta-import. This query selects the primary keys of the rows which will be parts of the delta-update. The pks will be available to the deltaImportQuery through the variable${dataimporter.delta.<column-name>}.
parentDeltaQuery
SQL query used if the operation is delta-import.
deletedPkQuery
SQL query used if the operation is delta-import.
deltaImportQuery

SQL query used if the operation is delta-import. If this is not present, DIH tries to construct the import query by (after identifying the delta) modifying the 'query' (this is error prone).

There is a namespace ${dataimporter.delta.<column-name>} which can be used in this query. For example, select * from tbl where id=${dataimporter.delta.id}.

### The XPathEntityProcessor

This processor is used when indexing XML formatted data. The data source is typically URLDataSource or FileDataSource. XPath can also be used with the FileListEntityProcessor described below, to generate a document from each file.

The entity attributes unique to this processor are shown below. These are in addition to the attributes common to all entity processors described above.

Processor
Required. Must be set to XpathEntityProcessor.
url
Required. The HTTP URL or file location.
stream
forEach
Required unless you define useSolrAddSchema. The XPath expression which demarcates each record. This will be used to set up the processing loop.
xsl
Optional: Its value (a URL or filesystem path) is the name of a resource used as a preprocessor for applying the XSL transformation.
Set this to true if the content is in the form of the standard Solr update XML schema.

Each <field> element in the entity can have the following attributes as well as the default ones.

xpath
Required. The XPath expression which will extract the content from the record for this field. Only a subset of XPath syntax is supported.
commonField
Optional. If true, then when this field is encountered in a record it will be copied to future records when creating a Solr document.
flatten

Optional. If set to true, then any children text nodes are collected to form the value of a field.

 The default value is false, meaning that if there are any sub-elements of the node pointed to by the XPath expression, they will be quietly omitted.

Here is an example from the atom collection in the dih example (data-config file found at example/example-DIH/solr/atom/conf/atom-data-config.xml):

<dataConfig>
<dataSource type="URLDataSource"/>
<document>

<entity name="stackoverflow"
url="https://stackoverflow.com/feeds/tag/solr"
processor="XPathEntityProcessor"
forEach="/feed|/feed/entry"
transformer="HTMLStripTransformer,RegexTransformer">

<!-- Pick this value up from the feed level and apply to all documents -->
<field column="lastchecked_dt" xpath="/feed/updated" commonField="true"/>

<!-- Keep only the final numeric part of the URL -->
<field column="id" xpath="/feed/entry/id" regex=".*/" replaceWith=""/>

<field column="title"    xpath="/feed/entry/title"/>
<field column="author"   xpath="/feed/entry/author/name"/>
<field column="category" xpath="/feed/entry/category/@term"/>

<!-- Use transformers to convert HTML into plain text.
There is also an UpdateRequestProcess to trim remaining spaces.
-->
<field column="summary" xpath="/feed/entry/summary" stripHTML="true" regex="( |\n)+" replaceWith=" "/>

<!-- Ignore namespaces when matching XPath -->
<field column="rank" xpath="/feed/entry/rank"/>

<field column="published_dt" xpath="/feed/entry/published"/>
<field column="updated_dt" xpath="/feed/entry/updated"/>
</entity>

</document>
</dataConfig>

### The MailEntityProcessor

The MailEntityProcessor uses the Java Mail API to index email messages using the IMAP protocol.

The MailEntityProcessor works by connecting to a specified mailbox using a username and password, fetching the email headers for each message, and then fetching the full email contents to construct a document (one document for each mail message).

The entity attributes unique to the MailEntityProcessor are shown below. These are in addition to the attributes common to all entity processors described above.

processor
Required. Must be set to MailEntityProcessor.
user
Required. Username for authenticating to the IMAP server; this is typically the email address of the mailbox owner.
Required. Password for authenticating to the IMAP server.
host
Required. The IMAP server to connect to.
protocol
Required. The IMAP protocol to use, valid values are: imap, imaps, gimap, and gimaps.
fetchMailsSince
Optional. Date/time used to set a filter to import messages that occur after the specified date; expected format is: yyyy-MM-dd HH:mm:ss.
folders
Required. Comma-delimited list of folder names to pull messages from, such as "inbox".
recurse
Optional. Default is true. Flag to indicate if the processor should recurse all child folders when looking for messages to import.
include
Optional. Comma-delimited list of folder patterns to include when processing folders (can be a literal value or regular expression).
exclude
Optional. Comma-delimited list of folder patterns to exclude when processing folders (can be a literal value or regular expression). Excluded folder patterns take precedence over include folder patterns.
processAttachement or processAttachments
Optional. Default is true. Use Tika to process message attachments.
includeContent
Optional. Default is true. Include the message body when constructing Solr documents for indexing.

Here is an example from the mail collection of the dih example (data-config file found at example/example-DIH/mail/conf/mail-data-config.xml):

<dataConfig>
<document>
<entity processor="MailEntityProcessor"
user="email@gmail.com"
host="imap.gmail.com"
protocol="imaps"
fetchMailsSince="2014-06-30 00:00:00"
batchSize="20"
folders="inbox"
processAttachement="false"
name="mail_entity"/>
</document>
</dataConfig>

#### Importing New Emails Only

After running a full import, the MailEntityProcessor keeps track of the timestamp of the previous import so that subsequent imports can use the fetchMailsSince filter to only pull new messages from the mail server. This occurs automatically using the DataImportHandler dataimport.properties file (stored in conf).

For instance, if you set fetchMailsSince="2014-08-22 00:00:00" in your mail-data-config.xml, then all mail messages that occur after this date will be imported on the first run of the importer. Subsequent imports will use the date of the previous import as the fetchMailsSince filter, so that only new emails since the last import are indexed each time.

#### GMail Extensions

When connecting to a GMail account, you can improve the efficiency of the MailEntityProcessor by setting the protocol to gimap or gimaps.

This allows the processor to send the fetchMailsSince filter to the GMail server to have the date filter applied on the server, which means the processor only receives new messages from the server. However, GMail only supports date granularity, so the server-side filter may return previously seen messages if run more than once a day.

### The TikaEntityProcessor

The TikaEntityProcessor uses Apache Tika to process incoming documents. This is similar to Uploading Data with Solr Cell using Apache Tika, but using DataImportHandler options instead.

The parameters for this processor are described in the table below. These are in addition to the attributes common to all entity processors described above.

dataSource

This parameter defines the data source and an optional name which can be referred to in later parts of the configuration if needed. This is the same dataSource explained in the description of general entity processor attributes above.

The available data source types for this processor are:

• BinURLDataSource: used for HTTP resources, but can also be used for files.

• BinFileDataSource: used for content on the local filesystem.

url
Required. The path to the source file(s), as a file path or a traditional internet URL.
htmlMapper

Optional. Allows control of how Tika parses HTML. If this parameter is defined, it must be either default or identity; if it is absent, "default" is assumed.

The "default" mapper strips much of the HTML from documents while the "identity" mapper passes all HTML as-is with no modifications.

format
The output format. The options are text, xml, html or none. The default is "text" if not defined. The format "none" can be used if metadata only should be indexed and not the body of the documents.
parser
Optional. The default parser is org.apache.tika.parser.AutoDetectParser. If a custom or other parser should be used, it should be entered as a fully-qualified name of the class and path.
fields
The list of fields from the input documents and how they should be mapped to Solr fields. If the attribute meta is defined as "true", the field will be obtained from the metadata of the document and not parsed from the body of the main text.
extractEmbedded
Instructs the TikaEntityProcessor to extract embedded documents or attachments when true. If false, embedded documents and attachments will be ignored.
onError
By default, the TikaEntityProcessor will stop processing documents if it finds one that generates an error. If you define onError to "skip", the TikaEntityProcessor will instead skip documents that fail processing and log a message that the document was skipped.

Here is an example from the tika collection of the dih example (data-config file found in example/example-DIH/tika/conf/tika-data-config.xml):

<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor" dataSource="null"
baseDir="${solr.install.dir}/example/exampledocs" fileName=".*pdf" rootEntity="false"> <field column="file" name="id"/> <entity name="pdf" processor="TikaEntityProcessor" url="${file.fileAbsolutePath}" format="text">

<field column="Author" name="author" meta="true"/>
<!-- in the original PDF, the Author meta-field name is upper-cased,
but in Solr schema it is lower-cased
-->

<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>

<field column="text" name="text"/>

</entity>
</entity>
</document>
</dataConfig>

### The FileListEntityProcessor

This processor is basically a wrapper, and is designed to generate a set of files satisfying conditions specified in the attributes which can then be passed to another processor, such as the XPathEntityProcessor.

The entity information for this processor would be nested within the FileListEntity entry. It generates five implicit fields: fileAbsolutePath, fileDir, fileSize, fileLastModified, and file, which can be used in the nested processor. This processor does not use a data source.

The attributes specific to this processor are described in the table below:

fileName
Required. A regular expression pattern to identify files to be included.
basedir
Required. The base directory (absolute path).
recursive
Whether to search directories recursively. Default is 'false'.
excludes
A regular expression pattern to identify files which will be excluded.
A date in the format yyyy-MM-ddHH:mm:ss or a date math expression (NOW - 2YEARS).
olderThan
A date, using the same formats as newerThan.
rootEntity
This should be set to false. This ensures that each row (filepath) emitted by this processor is considered to be a document.
dataSource
Must be set to null.

The example below shows the combination of the FileListEntityProcessor with another processor which will generate a set of fields from each file found.

<dataConfig>
<dataSource type="FileDataSource"/>
<document>
<!-- this outer processor generates a list of files satisfying the conditions
specified in the attributes -->
<entity name="f" processor="FileListEntityProcessor"
fileName=".*xml"
recursive="true"
rootEntity="false"
dataSource="null"
baseDir="/my/document/directory">

<!-- this processor extracts content using XPath from each file found -->

<entity name="nested" processor="XPathEntityProcessor"
forEach="/rootelement" url="${f.fileAbsolutePath}" > <field column="name" xpath="/rootelement/name"/> <field column="number" xpath="/rootelement/number"/> </entity> </entity> </document> </dataConfig> ### LineEntityProcessor This EntityProcessor reads all content from the data source on a line by line basis and returns a field called rawLine for each line read. The content is not parsed in any way; however, you may add transformers to manipulate the data within the rawLine field, or to create other additional fields. The lines read can be filtered by two regular expressions specified with the acceptLineRegex and omitLineRegex attributes. The LineEntityProcessor has the following attributes: url A required attribute that specifies the location of the input file in a way that is compatible with the configured data source. If this value is relative and you are using FileDataSource or URLDataSource, it assumed to be relative to baseLoc. acceptLineRegex An optional attribute that if present discards any line which does not match the regular expression. omitLineRegex An optional attribute that is applied after any acceptLineRegex and that discards any line which matches this regular expression. For example: <entity name="jc" processor="LineEntityProcessor" acceptLineRegex="^.*\.xml$"
omitLineRegex="/obsolete"
url="file:///Volumes/ts/files.lis"
rootEntity="false"
transformer="RegexTransformer,DateFormatTransformer">
</entity>

While there are use cases where you might need to create a Solr document for each line read from a file, it is expected that in most cases that the lines read by this processor will consist of a pathname, which in turn will be consumed by another entity processor, such as the XPathEntityProcessor.

### PlainTextEntityProcessor

This EntityProcessor reads all content from the data source into an single implicit field called plainText. The content is not parsed in any way, however you may add transformers to manipulate the data within the plainText as needed, or to create other additional fields.

For example:

<entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
<!-- copies the text to a field called 'text' in Solr-->
<field column="plainText" name="text"/>
</entity>

Ensure that the dataSource is of type DataSource<Reader> (FileDataSource, URLDataSource).

### SolrEntityProcessor

This EntityProcessor imports data from different Solr instances and cores. The data is retrieved based on a specified filter query. This EntityProcessor is useful in cases you want to copy your Solr index and want to modify the data in the target index.

The SolrEntityProcessor can only copy fields that are stored in the source index.

The SolrEntityProcessor supports the following parameters:

url
Required. The URL of the source Solr instance and/or core.
query
Required. The main query to execute on the source index.
fq
Any filter queries to execute on the source index. If more than one filter query is defined, they must be separated by a comma.
rows
The number of rows to return for each iteration. The default is 50 rows.
fl
A comma-separated list of fields to fetch from the source index. Note, these fields must be stored in the source Solr instance.
qt
The search handler to use, if not the default.
wt
The response format to use, either javabin or xml.
timeout
The query timeout in seconds. The default is 5 minutes (300 seconds).
cursorMark="true"
Use this to enable cursor for efficient result set scrolling.
sort="id asc"
This should be used to specify a sort parameter referencing the uniqueKey field of the source Solr instance. See Pagination of Results for details.

Here is a simple example of a SolrEntityProcessor:

<dataConfig>
<document>
<entity name="sep" processor="SolrEntityProcessor"
url="http://127.0.0.1:8983/solr/db "
query="*:*"
fl="*,orig_version_l:_version_,ignored_price_c:price_c"/>
</document>
</dataConfig>

## Transformers

Transformers manipulate the fields in a document returned by an entity. A transformer can create new fields or modify existing ones. You must tell the entity which transformers your import operation will be using, by adding an attribute containing a comma separated list to the <entity> element.

<entity name="abcde" transformer="org.apache.solr....,my.own.transformer,..." />

Specific transformation rules are then added to the attributes of a <field> element, as shown in the examples below. The transformers are applied in the order in which they are specified in the transformer attribute.

The DataImportHandler contains several built-in transformers. You can also write your own custom transformers if necessary. The ScriptTransformer described below offers an alternative method for writing your own transformers.

### ClobTransformer

You can use the ClobTransformer to create a string out of a CLOB in a database. A CLOB is a character large object: a collection of character data typically stored in a separate location that is referenced in the database.

The ClobTransformer accepts these attributes:

clob
Boolean value to signal if ClobTransformer should process this field or not. If this attribute is omitted, then the corresponding field is not transformed.
sourceColName
The source column to be used as input. If this is absent source and target are same

Here’s an example of invoking the ClobTransformer.

<entity name="example" transformer="ClobTransformer" ...>
<field column="hugeTextField" clob="true" />
...
</entity>

### The DateFormatTransformer

This transformer converts dates from one format to another. This would be useful, for example, in a situation where you wanted to convert a field with a fully specified date/time into a less precise date format, for use in faceting.

DateFormatTransformer applies only on the fields with an attribute dateTimeFormat. Other fields are not modified.

This transformer recognizes the following attributes:

dateTimeFormat
The format used for parsing this field. This must comply with the syntax of the Java SimpleDateFormat class.
sourceColName
The column on which the dateFormat is to be applied. If this is absent source and target are same.
locale
The locale to use for date transformations. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US.

Here is example code that returns the date rounded up to the month "2007-JUL":

<entity name="en" pk="id" transformer="DateFormatTransformer" ... >
...
<field column="date" sourceColName="fulldate" dateTimeFormat="yyyy-MMM"/>
</entity>

### The HTMLStripTransformer

You can use this transformer to strip HTML out of a field.

There is one attribute for this transformer, stripHTML, which is a boolean value (true or false) to signal if the HTMLStripTransformer should process the field or not.

For example:

<entity name="e" transformer="HTMLStripTransformer" ... >
<field column="htmlText" stripHTML="true" />
...
</entity>

### The LogTransformer

You can use this transformer to log data to the console or log files. For example:

<entity ...
transformer="LogTransformer"
</entity>

## Special Commands for DIH

You can pass special commands to the DIH by adding any of the variables listed below to any row returned by any component:

$skipDoc Skip the current document; that is, do not add it to Solr. The value can be the string true or false.$skipRow
Skip the current row. The document will be added with rows from other entities. The value can be the string true or false.
$deleteDocById Delete a document from Solr with this ID. The value has to be the uniqueKey value of the document.$deleteDocByQuery
Delete documents from Solr using this query. The value must be a Solr Query.