solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc - lucene-solr - Git at Google

 = Uploading Data with Index Handlers
 :page-children: transforming-and-indexing-custom-json
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to having plugins for importing rich documents <<uploading-data-with-solr-cell-using-apache-tika.adoc#,using Tika>> or from structured data sources using the <<uploading-structured-data-store-data-with-the-data-import-handler.adoc#,Data Import Handler>>, Solr natively supports indexing structured documents in XML, CSV and JSON.

 The recommended way to configure and use request handlers is with path based names that map to paths in the request url. However, request handlers can also be specified with the `qt` (query type) parameter if the <<requestdispatcher-in-solrconfig.adoc#,`requestDispatcher`>> is appropriately configured. It is possible to access the same handler using more than one name, which can be useful if you wish to specify different sets of default options.

 A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating to the appropriate `ContentStreamLoader` based on the `Content-Type` of the <<content-streams.adoc#,ContentStream>>.

 == UpdateRequestHandler Configuration

 The default configuration file has the update request handler configured by default.

 [source,xml]
 ----
 <requestHandler name="/update" class="solr.UpdateRequestHandler" />
 ----

 == XML Formatted Index Updates

 Index update commands can be sent as XML message to the update handler using `Content-type: application/xml` or `Content-type: text/xml`.

 === Adding Documents

 The XML schema recognized by the update handler for adding documents is very straightforward:

 * The `<add>` element introduces one more documents to be added.
 * The `<doc>` element introduces the fields making up a document.
 * The `<field>` element presents the content for a specific field.

 For example:

 [source,xml]
 ----
 <add>
   <doc>
     <field name="authors">Patrick Eagar</field>
     <field name="subject">Sports</field>
     <field name="dd">796.35</field>
     <field name="numpages">128</field>
     <field name="desc"></field>
     <field name="price">12.40</field>
     <field name="title">Summer of the all-rounder: Test and championship cricket in England 1982</field>
     <field name="isbn">0002166313</field>
     <field name="yearpub">1982</field>
     <field name="publisher">Collins</field>
   </doc>
   <doc>
   ...
   </doc>
 </add>
 ----

 The add command supports some optional attributes which may be specified.

 `commitWithin`::
 Add the document within the specified number of milliseconds.

 `overwrite`::
 Default is `true`. Indicates if the unique key constraints should be checked to overwrite previous versions of the same document (see below).

 If the document schema defines a unique key, then by default an `/update` operation to add a document will overwrite (i.e., replace) any document in the index with the same unique key. If no unique key has been defined, indexing performance is somewhat faster, as no check has to be made for an existing documents to replace.

 If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (e.g., you build your indexes in batch, and your indexing code guarantees it never adds the same document more than once) you can specify the `overwrite="false"` option when adding your documents.

 === XML Update Commands

 ==== Commit and Optimize During Updates

 The `<commit>` operation writes all documents loaded since the last commit to one or more segment files on the disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation opens a new searcher, and triggers any event listeners that have been configured.

 Commits may be issued explicitly with a `<commit/>` message, and can also be triggered from `<autocommit>` parameters in `solrconfig.xml`.

 The `<optimize>` operation requests Solr to merge internal data structures. For a large index, optimization will take some time to complete, but by merging many small segment files into larger segments, search performance may improve. If you are using Solr's replication mechanism to distribute searches across many systems, be aware that after an optimize, a complete index will need to be transferred.

 WARNING: You should only consider using optimize on static indexes, i.e., indexes that can be optimized as part of the regular update process (say once-a-day updates). Applications requiring NRT functionality should not use optimize.

 The `<commit>` and `<optimize>` elements accept these optional attributes:

 `waitSearcher`::
 Default is `true`. Blocks until a new searcher is opened and registered as the main query searcher, making the changes visible.

 `expungeDeletes`:: (commit only) Default is `false`. Merges segments that have more than 10% deleted docs, expunging the deleted documents in the process. Resulting segments will respect `maxMergedSegmentMB`.
 +
 WARNING: expungeDeletes is "less expensive" than optimize, but the same warnings apply.

 `maxSegments`:: (optimize only) Default is unlimited, resulting segments respect the `maxMergedSegmentMB` setting. Makes a best effort attempt to merge the segments down to no more than this number of segments but does not guarantee that the goal will be achieved. Unless there is tangible evidence that optimizing to a small number of segments is beneficial, this parameter should be omitted and the default behavior accepted.

 Here are examples of `<commit>` and `<optimize>` using optional attributes:

 [source,xml]
 ----
 <commit waitSearcher="false"/>
 <commit waitSearcher="false" expungeDeletes="true"/>
 <optimize waitSearcher="false"/>
 ----

 ==== Delete Operations

 Documents can be deleted from the index in two ways.
 "Delete by ID" deletes the document with the specified ID, and can be used only if a UniqueID field has been defined in the schema.
 It doesn't work for child/nested docs.
 "Delete by Query" deletes all documents matching a specified query, although `commitWithin` is ignored for a Delete by Query. A single delete message can contain multiple delete operations.

 [source,xml]
 ----
 <delete>
   <id>0002166313</id>
   <id>0031745983</id>
   <query>subject:sport</query>
   <query>publisher:penguin</query>
 </delete>
 ----

 [IMPORTANT]
 ====

 When using the Join query parser in a Delete By Query, you should use the `score` parameter with a value of "none" to avoid a `ClassCastException`. See the section on the <<other-parsers.adoc#,Join Query Parser>> for more details on the `score` parameter.

 ====

 ==== Rollback Operations

 The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any event listeners nor creates a new searcher. Its syntax is simple: `<rollback/>`.

 ==== Grouping Operations

 You can post several commands in a single XML file by grouping them with the surrounding `<update>` element.

 [source,xml]
 ----
 <update>
   <add>
     <doc><!-- doc 1 content --></doc>
   </add>
   <add>
     <doc><!-- doc 2 content --></doc>
   </add>
   <delete>
     <id>0002166313</id>
   </delete>
 </update>
 ----


 === Using curl to Perform Updates

 You can use the `curl` utility to perform any of the above commands, using its `--data-binary` option to append the XML message to the `curl` command, and generating a HTTP POST request. For example:

 [source,bash]
 ----
 curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary '
 <add>
   <doc>
     <field name="authors">Patrick Eagar</field>
     <field name="subject">Sports</field>
     <field name="dd">796.35</field>
     <field name="isbn">0002166313</field>
     <field name="yearpub">1982</field>
     <field name="publisher">Collins</field>
   </doc>
 </add>'
 ----

 For posting XML messages contained in a file, you can use the alternative form:

 [source,bash]
 ----
 curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary @myfile.xml
 ----

 The approach above works well, but using the `--data-binary` option causes `curl` to load the whole `myfile.xml` into memory before posting it to server. This may be problematic when dealing with multi-gigabyte files. This alternative `curl` command performs equivalent operations but with minimal `curl` memory usage:

 [source,bash]
 ----
 curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" -T "myfile.xml" -X POST
 ----

 Short requests can also be sent using a HTTP GET command, if enabled in <<requestdispatcher-in-solrconfig.adoc#requestparsers-element,RequestDispatcher in SolrConfig>> element, URL-encoding the request, as in the following. Note the escaping of "<" and ">":

 [source,bash]
 ----
 curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E&wt=xml
 ----

 Responses from Solr take the form shown here:

 [source,xml]
 ----
 <response>
   <lst name="responseHeader">
     <int name="status">0</int>
     <int name="QTime">127</int>
   </lst>
 </response>
 ----

 The status field will be non-zero in case of failure.

 === Using XSLT to Transform XML Index Updates

 The UpdateRequestHandler allows you to index any arbitrary XML using the `<tr>` parameter to apply an https://en.wikipedia.org/wiki/XSLT[XSL transformation]. You must have an XSLT stylesheet in the `conf/xslt` directory of your <<config-sets.adoc#,configset>> that can transform the incoming data to the expected `<add><doc/></add>` format, and use the `tr` parameter to specify the name of that stylesheet.

 Here is an example XSLT stylesheet:

 [source,xml]
 ----
 <xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
   <xsl:output media-type="text/xml" method="xml" indent="yes"/>
   <xsl:template match='/'>
     <add>
       <xsl:apply-templates select="response/result/doc"/>
     </add>
   </xsl:template>
   <!-- Ignore score (makes no sense to index) -->
   <xsl:template match="doc/*[@name='score']" priority="100"></xsl:template>
   <xsl:template match="doc">
     <xsl:variable name="pos" select="position()"/>
     <doc>
       <xsl:apply-templates>
         <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
       </xsl:apply-templates>
     </doc>
   </xsl:template>
   <!-- Flatten arrays to duplicate field lines -->
   <xsl:template match="doc/arr" priority="100">
     <xsl:variable name="fn" select="@name"/>
     <xsl:for-each select="*">
       <xsl:element name="field">
         <xsl:attribute name="name"><xsl:value-of select="$fn"/></xsl:attribute>
         <xsl:value-of select="."/>
       </xsl:element>
     </xsl:for-each>
   </xsl:template>
   <xsl:template match="doc/*">
     <xsl:variable name="fn" select="@name"/>
       <xsl:element name="field">
         <xsl:attribute name="name"><xsl:value-of select="$fn"/></xsl:attribute>
       <xsl:value-of select="."/>
     </xsl:element>
   </xsl:template>
   <xsl:template match="*"/>
 </xsl:stylesheet>
 ----

 This stylesheet transforms Solr's XML search result format into Solr's Update XML syntax. One example usage would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can be indexed into another Solr file (provided that all fields are stored):

 [source,plain]
 ----
 http://localhost:8983/solr/my_collection/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000
 ----

 You can also use the stylesheet in `XsltUpdateRequestHandler` to transform an index when updating:

 [source,bash]
 ----
 curl "http://localhost:8983/solr/my_collection/update?commit=true&tr=updateXml.xsl" -H "Content-Type: text/xml" --data-binary @myexporteddata.xml
 ----

 == JSON Formatted Index Updates

 Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update request, described below in the section <<transforming-and-indexing-custom-json.adoc#,Transforming and Indexing Custom JSON>>.

 === Solr-Style JSON

 JSON formatted update requests may be sent to Solr's `/update` handler using `Content-Type: application/json` or `Content-Type: text/json`.

 JSON formatted updates can take 3 basic forms, described in depth below:

 * <<Adding a Single JSON Document,A single document to add>>, expressed as a top level JSON Object. To differentiate this from a set of commands, the `json.command=false` request parameter is required.
 * <<Adding Multiple JSON Documents,A list of documents to add>>, expressed as a top level JSON Array containing a JSON Object per document.
 * <<Sending JSON Update Commands,A sequence of update commands>>, expressed as a top level JSON Object (aka: Map).

 ==== Adding a Single JSON Document

 The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the `/update/json/docs` path:

 [source,bash]
 ----
 curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '
 {
   "id": "1",
   "title": "Doc 1"
 }'
 ----

 ==== Adding Multiple JSON Documents

 Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each object represents a document:

 [source,bash]
 ----
 curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary '
 [
   {
     "id": "1",
     "title": "Doc 1"
   },
   {
     "id": "2",
     "title": "Doc 2"
   }
 ]'
 ----

 A sample JSON file is provided at `example/exampledocs/books.json` and contains an array of objects that you can add to the Solr `techproducts` example:

 [source,bash]
 ----
 curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary @example/exampledocs/books.json -H 'Content-type:application/json'
 ----

 ==== Sending JSON Update Commands

 In general, the JSON update syntax supports all of the update commands that the XML update handler supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be contained in one message:

 [source,bash,subs="verbatim,callouts"]
 ----
 curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary '
 {
   "add": {
     "doc": {
       "id": "DOC1",
       "my_field": 2.3,
       "my_multivalued_field": [ "aaa", "bbb" ]   --<1>
     }
   },
   "add": {
     "commitWithin": 5000, --<2>
     "overwrite": false,  --<3>
     "doc": {
       "f1": "v1", --<4>
       "f1": "v2"
     }
   },

   "commit": {},
   "optimize": { "waitSearcher":false },

   "delete": { "id":"ID" },  --<5>
   "delete": { "query":"QUERY" } --<6>
 }'
 ----

 <1> Can use an array for a multi-valued field
 <2> Commit this document within 5 seconds
 <3> Don't check for existing documents with the same uniqueKey
 <4> Can use repeated keys for a multi-valued field
 <5> Delete by ID (uniqueKey field)
 <6> Delete by Query

 As with other update handlers, parameters such as `commit`, `commitWithin`, `optimize`, and `overwrite` may be specified in the URL instead of in the body of the message.

 The JSON update format allows for a simple delete-by-id. The value of a `delete` can be an array which contains a list of zero or more specific document id's (not a range) to be deleted. For example, a single document:

 [source,json]
 ----
 { "delete":"myid" }
 ----

 Or a list of document IDs:

 [source,json]
 ----
 { "delete":["id1","id2"] }
 ----

 Note: Delete-by-id doesn't work for child/nested docs.

 You can also specify `\_version_` with each "delete":

 [source,json]
 ----
 {
   "delete":"id":50,
   "_version_":12345
 }
 ----

 You can specify the version of deletes in the body of the update request as well.

 === JSON Update Convenience Paths

 In addition to the `/update` handler, there are a few additional JSON specific request handler paths available by default in Solr, that implicitly override the behavior of some request parameters:

 [width="100%",options="header",]
 |===
 |Path |Default Parameters
 |`/update/json` |`stream.contentType=application/json`
 |`/update/json/docs` a|
 `stream.contentType=application/json`

 `json.command=false`

 |===

 The `/update/json` path may be useful for clients sending in JSON formatted update commands from applications where setting the Content-Type proves difficult, while the `/update/json/docs` path can be particularly convenient for clients that always want to send in documents – either individually or as a list – without needing to worry about the full JSON command syntax.

 === Custom JSON Documents

 Solr can support custom JSON. This is covered in the section <<transforming-and-indexing-custom-json.adoc#,Transforming and Indexing Custom JSON>>.


 == CSV Formatted Index Updates

 CSV formatted update requests may be sent to Solr's `/update` handler using `Content-Type: application/csv` or `Content-Type: text/csv`.

 A sample CSV file is provided at `example/exampledocs/books.csv` that you can use to add some documents to the Solr `techproducts` example:

 [source,bash]
 ----
 curl 'http://localhost:8983/solr/my_collection/update?commit=true' --data-binary @example/exampledocs/books.csv -H 'Content-type:application/csv'
 ----

 === CSV Update Parameters

 The CSV handler allows the specification of many parameters in the URL in the form: `f._parameter_._optional_fieldname_=_value_`.

 The table below describes the parameters for the update handler.

 `separator`::
 Character used as field separator; default is ",". This parameter is global; for per-field usage, see the `split` parameter.
 +
 Example:  `separator=%09`

 `trim`::
 If `true`, remove leading and trailing whitespace from values. The default is `false`. This parameter can be either global or per-field.
 +
 Examples: `f.isbn.trim=true` or `trim=false`

 `header`::
 Set to `true` if first line of input contains field names. These will be used if the `fieldnames` parameter is absent. This parameter is global.

 `fieldnames`::
 Comma-separated list of field names to use when adding documents. This parameter is global.
 +
 Example: `fieldnames=isbn,price,title`

 `literal._field_name_`::
 A literal value for a specified field name. This parameter is global.
 +
 Example: `literal.color=red`

 `skip`::
 Comma separated list of field names to skip. This parameter is global.
 +
 Example: `skip=uninteresting,shoesize`

 `skipLines`::
 Number of lines to discard in the input stream before the CSV data starts, including the header, if present. Default=`0`. This parameter is global.
 +
 Example: `skipLines=5`

 `encapsulator`:: The character optionally used to surround values to preserve characters such as the CSV separator or whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated value by doubling the encapsulator.
 +
 This parameter is global; for per-field usage, see `split`.
 +
 Example: `encapsulator="`

 `escape`:: The character used for escaping CSV separators or other reserved characters. If an escape is specified, the encapsulator is not used unless also explicitly specified since most formats use either encapsulation or escaping, not both. |g |
 +
 Example: `escape=\`

 `keepEmpty`::
 Keep and index zero length (empty) fields. The default is `false`. This parameter can be global or per-field.
 +
 Example: `f.price.keepEmpty=true`

 `map`:: Map one value to another. Format is value:replacement (which can be empty). This parameter can be global or per-field.
 +
 Example: `map=left:right` or `f.subject.map=history:bunk`

 `split`::
 If `true`, split a field into multiple values by a separate parser. This parameter is used on a per-field basis.

 `overwrite`::
 If `true` (the default), check for and overwrite duplicate documents, based on the uniqueKey field declared in the Solr schema. If you know the documents you are indexing do not contain any duplicates then you may see a considerable speed up setting this to `false`.
 +
 This parameter is global.

 `commit`::
 Issues a commit after the data has been ingested. This parameter is global.

 `commitWithin`::
 Add the document within the specified number of milliseconds. This parameter is global.
 +
 Example: `commitWithin=10000`

 `rowid`::
 Map the `rowid` (line number) to a field specified by the value of the parameter, for instance if your CSV doesn't have a unique key and you want to use the row id as such. This parameter is global.
 +
 Example: `rowid=id`

 `rowidOffset`::
 Add the given offset (as an integer) to the `rowid` before adding it to the document. Default is `0`. This parameter is global.
 +
 Example: `rowidOffset=10`

 === Indexing Tab-Delimited files

 The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files) and even handle backslash escaping rather than CSV encapsulation.

 For example, one can dump a MySQL table to a tab-delimited file with:

 [source,sql]
 ----
 SELECT * INTO OUTFILE '/tmp/result.txt' FROM mytable;
 ----

 This file could then be imported into Solr by setting the `separator` to tab (%09) and the `escape` to backslash (%5c).

 [source,bash]
 ----
 curl 'http://localhost:8983/solr/my_collection/update/csv?commit=true&separator=%09&escape=%5c' --data-binary @/tmp/result.txt
 ----

 === CSV Update Convenience Paths

 In addition to the `/update` handler, there is an additional CSV specific request handler path available by default in Solr, that implicitly override the behavior of some request parameters:

 [cols=",",options="header",]
 |===
 |Path |Default Parameters
 |`/update/csv` |`stream.contentType=application/csv`
 |===

 The `/update/csv` path may be useful for clients sending in CSV formatted update commands from applications where setting the Content-Type proves difficult.