solr/solr-ref-guide/modules/getting-started/pages/tutorial-diy.adoc - solr - Git at Google

 = Exercise 3: Index Your Own Data
 :experimental:
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 [[exercise-3]]
 == Exercise 3: Index Your Own Data

 For this exercise, work with a dataset of your choice.
 This can be files on your local hard drive, a set of data you have worked with before, or maybe a sample of the data you intend to index to Solr for your production application.

 This exercise is intended to get you thinking about what you will need to do for your application:

 * What sorts of data do you need to index?
 * What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields, determine analysis rules, etc.)
 * What kinds of search options do you want to provide to users?
 * How much testing will you need to do to ensure everything works the way you expect?

 === Create Your Own Collection

 Before you get started, create a new collection, named whatever you'd like.
 In this example, the collection will be named "localDocs"; replace that name with whatever name you choose if you want to.

 [,console]
 ----
 $ bin/solr create -c localDocs -s 2 -rf 2
 ----

 Again, as we saw from Exercise 2 above, this will use the `_default` configset and all the schemaless features it provides.
 As we noted previously, this may cause problems when we index our data.
 You may need to iterate on indexing a few times before you get the schema right.

 === Indexing Ideas

 Solr has lots of ways to index data.
 Choose one of the approaches below and try it out with your system:

 Local Files with `bin/solr post`::
 If you have a local directory of files, the Post Tool (`bin/solr post`) can index a directory of files.
 We saw this in action in our first exercise.
 +
 We used only JSON, XML and CSV in our exercises, but the Post Tool can also handle HTML, PDF, Microsoft Office formats (such as MS Word), plain text, and more.
 +
 In this example, assume there is a directory named "Documents" locally.
 To index it, we would issue a command like this (correcting the collection name after the `-c` parameter as needed):
 +
 [,console]
 ----
 $ bin/solr post -c localDocs ~/Documents
 ----
 +
 You may get errors as it works through your documents.
 These might be caused by the field guessing, or the file type may not be supported.
 Indexing content such as this demonstrates the need to plan Solr for your data, which requires understanding it and perhaps also some trial and error.

 SolrJ::
 SolrJ is a Java-based client for interacting with Solr.
 Use xref:deployment-guide:solrj.adoc[] for JVM-based languages or other xref:deployment-guide:client-apis.adoc[] to programmatically create documents to send to Solr.

 Documents Screen::
 Use the Admin UI xref:indexing-guide:documents-screen.adoc[] (at http://localhost:8983/solr/#/localDocs/documents) to paste in a document to be indexed, or select `Document Builder` from the `Document Type` dropdown to build a document one field at a time.
 Click on the btn:[Submit Document] button below the form to index your document.

 === Updating Data

 You may notice that even if you index content in this tutorial more than once, it does not duplicate the results found.
 This is because the example Solr schema (a file named either `managed-schema.xml` or `schema.xml`) specifies a `uniqueKey` field called `id`.
 Whenever you POST commands to Solr to add a document with the same value for the `uniqueKey` as an existing document, it automatically replaces it for you.

 You can see that has happened by looking at the values for `numDocs` and `maxDoc` in the core-specific Overview section of the Solr Admin UI.

 `numDocs` represents the number of searchable documents in the index (and will be larger than the number of XML, JSON, or CSV files since some files contained more than one document).
 The `maxDoc` value may be larger as the `maxDoc` count includes logically deleted documents that have not yet been physically removed from the index.
 You can re-post the sample files over and over again as much as you want and `numDocs` will never increase, because the new documents will constantly be replacing the old.

 Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool (`bin/solr post`).
 You'll see your changes reflected in subsequent searches.

 === Deleting Data

 If you need to iterate a few times to get your schema right, you may want to delete documents to clear out the collection and try again.
 Note, however, that merely removing documents doesn't change the underlying field definitions.
 Essentially, this will allow you to reindex your data after making changes to fields for your needs.

 You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents (be careful with that one!).
 We can use `bin/solr post` to delete documents also if we structure the request properly.

 Execute the following command to delete a specific document:

 [,console]
 ----
 $ bin/solr post -c localDocs -d "<delete><id>SP2514N</id></delete>"
 ----

 To delete all documents, you can use "delete-by-query" command like:

 [,console]
 ----
 $ bin/solr post -c localDocs -d "<delete><query>*:*</query></delete>"
 ----

 You can also modify the above to only delete documents that match a specific query.

 === Exercise 3 Wrap Up

 At this point, you're ready to start working on your own.

 Jump ahead to the overall xref:solr-tutorial.adoc#wrapping-up[wrap up] when you're ready to stop Solr and remove all the examples you worked with and start fresh.

 Or if you'd like, you could work your way through the remaining exercises.
	= Exercise 3: Index Your Own Data
	:experimental:
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	[[exercise-3]]
	== Exercise 3: Index Your Own Data

	For this exercise, work with a dataset of your choice.
	This can be files on your local hard drive, a set of data you have worked with before, or maybe a sample of the data you intend to index to Solr for your production application.

	This exercise is intended to get you thinking about what you will need to do for your application:

	* What sorts of data do you need to index?
	* What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields, determine analysis rules, etc.)
	* What kinds of search options do you want to provide to users?
	* How much testing will you need to do to ensure everything works the way you expect?

	=== Create Your Own Collection

	Before you get started, create a new collection, named whatever you'd like.
	In this example, the collection will be named "localDocs"; replace that name with whatever name you choose if you want to.

	[,console]
	----
	$ bin/solr create -c localDocs -s 2 -rf 2
	----

	Again, as we saw from Exercise 2 above, this will use the `_default` configset and all the schemaless features it provides.
	As we noted previously, this may cause problems when we index our data.
	You may need to iterate on indexing a few times before you get the schema right.

	=== Indexing Ideas

	Solr has lots of ways to index data.
	Choose one of the approaches below and try it out with your system:

	Local Files with `bin/solr post`::
	If you have a local directory of files, the Post Tool (`bin/solr post`) can index a directory of files.
	We saw this in action in our first exercise.
	+
	We used only JSON, XML and CSV in our exercises, but the Post Tool can also handle HTML, PDF, Microsoft Office formats (such as MS Word), plain text, and more.
	+
	In this example, assume there is a directory named "Documents" locally.
	To index it, we would issue a command like this (correcting the collection name after the `-c` parameter as needed):
	+
	[,console]
	----
	$ bin/solr post -c localDocs ~/Documents
	----
	+
	You may get errors as it works through your documents.
	These might be caused by the field guessing, or the file type may not be supported.
	Indexing content such as this demonstrates the need to plan Solr for your data, which requires understanding it and perhaps also some trial and error.

	SolrJ::
	SolrJ is a Java-based client for interacting with Solr.
	Use xref:deployment-guide:solrj.adoc[] for JVM-based languages or other xref:deployment-guide:client-apis.adoc[] to programmatically create documents to send to Solr.

	Documents Screen::
	Use the Admin UI xref:indexing-guide:documents-screen.adoc[] (at http://localhost:8983/solr/#/localDocs/documents) to paste in a document to be indexed, or select `Document Builder` from the `Document Type` dropdown to build a document one field at a time.
	Click on the btn:[Submit Document] button below the form to index your document.

	=== Updating Data

	You may notice that even if you index content in this tutorial more than once, it does not duplicate the results found.
	This is because the example Solr schema (a file named either `managed-schema.xml` or `schema.xml`) specifies a `uniqueKey` field called `id`.
	Whenever you POST commands to Solr to add a document with the same value for the `uniqueKey` as an existing document, it automatically replaces it for you.

	You can see that has happened by looking at the values for `numDocs` and `maxDoc` in the core-specific Overview section of the Solr Admin UI.

	`numDocs` represents the number of searchable documents in the index (and will be larger than the number of XML, JSON, or CSV files since some files contained more than one document).
	The `maxDoc` value may be larger as the `maxDoc` count includes logically deleted documents that have not yet been physically removed from the index.
	You can re-post the sample files over and over again as much as you want and `numDocs` will never increase, because the new documents will constantly be replacing the old.

	Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool (`bin/solr post`).
	You'll see your changes reflected in subsequent searches.

	=== Deleting Data

	If you need to iterate a few times to get your schema right, you may want to delete documents to clear out the collection and try again.
	Note, however, that merely removing documents doesn't change the underlying field definitions.
	Essentially, this will allow you to reindex your data after making changes to fields for your needs.

	You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents (be careful with that one!).
	We can use `bin/solr post` to delete documents also if we structure the request properly.

	Execute the following command to delete a specific document:

	[,console]
	----
	$ bin/solr post -c localDocs -d "<delete><id>SP2514N</id></delete>"
	----

	To delete all documents, you can use "delete-by-query" command like:

	[,console]
	----
	$ bin/solr post -c localDocs -d "<delete><query>:</query></delete>"
	----

	You can also modify the above to only delete documents that match a specific query.

	=== Exercise 3 Wrap Up

	At this point, you're ready to start working on your own.

	Jump ahead to the overall xref:solr-tutorial.adoc#wrapping-up[wrap up] when you're ready to stop Solr and remove all the examples you worked with and start fresh.

	Or if you'd like, you could work your way through the remaining exercises.