We have a movie data set in JSON, Solr XML, and CSV formats. All 3 formats contain the same data. You can use any one format to index documents to Solr.
The data is fetched from Freebase and the data license is present in the films-LICENSE.txt file.
This data consists of the following fields:
Steps:
Start Solr:
bin/solr start
Create a “films” core:
bin/solr create -c films
Set the schema on a couple of fields that Solr would otherwise guess differently (than we'd like) about:
curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{ "add-field" : { "name":"name", "type":"text_general", "multiValued":false, "stored":true }, "add-field" : { "name":"initial_release_date", "type":"pdate", "stored":true } }'
Now let's index the data, using one of these three commands:
bin/post -c films example/films/films.json
bin/post -c films example/films/films.xml
bin/post \ -c films \ example/films/films.csv \ -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
Let's get searching!
Search for ‘Batman’:
http://localhost:8983/solr/films/query?q=name:batman
Show me all ‘Super hero’ movies:
http://localhost:8983/solr/films/query?q=:&fq=genre:%22Superhero%20movie%22
Let's see the distribution of genres across all the movies. See the facet section of the response for the counts:
http://localhost:8983/solr/films/query?q=:&facet=true&facet.field=genre
Exploring the data further -
FAQ: Why override the schema of the name and initial_release_date fields?
Without overriding those field types, the _name_ field would have been guessed as a multi-valued string field type and _initial_release_date_ would have been guessed as a multi-valued pdate type. It makes more sense with this particular data set domain to have the movie name be a single valued general full-text searchable field, and for the release date also to be single valued.
How do I clear and reset my environment?
See the script below.
Is there an easy to copy/paste script to do all of the above?
Here ya go << END_OF_SCRIPT bin/solr stop rm server/logs/*.log rm -Rf server/solr/films/ bin/solr start bin/solr create -c films curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{ "add-field" : { "name":"name", "type":"text_general", "multiValued":false, "stored":true }, "add-field" : { "name":"initial_release_date", "type":"pdate", "stored":true } }' bin/post -c films example/films/films.json # END_OF_SCRIPT