blob: c7fd8e94e353bd840367182ffeabec1a375fa5e4 [file] [log] [blame]
= Indexing Nested Child Documents
:solr-root-path: ../../
:example-source-dir: {solr-root-path}solrj/src/test/org/apache/solr/client/ref_guide_examples/
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
Solr supports indexing nested documents, described here, and ways to <<searching-nested-documents.adoc#,search and retrieve>> them very efficiently.
By way of examples: nested documents in Solr can be used to bind a blog post (parent document)
with comments (child documents) -- or as a way to model major product lines as parent documents,
with multiple types of child documents representing individual SKUs (with unique sizes / colors) and supporting documentation (either directly nested under the products, or under individual SKUs.
The "top most" parent with all children is referred to as a "root" document or formerly "block
document" and it explains some of the nomenclature of related features.
At query time, the <<other-parsers.adoc#block-join-query-parsers,Block Join Query Parsers>> can search these relationships,
and the `<<transforming-result-documents.adoc#child-childdoctransformerfactory,[child]>>` Document Transformer can attach child (or other "descendent") documents to the result documents.
In terms of performance, indexing the relationships between documents usually yields much faster queries than an equivalent "<<other-parsers#join-query-parser,query time join>>",
since the relationships are already stored in the index and do not need to be computed.
However, nested documents are less flexible than query time joins as it imposes rules that some applications may not be able to accept.
Nested documents may be indexed via either the XML or JSON data syntax, and is also supported by <<using-solrj.adoc#,SolrJ>> with javabin.
[CAUTION]
====
.Reindexing Considerations
With the exception of in-place updates, Solr must internally reindex an entire nested document tree
if there are updates to it. For some applications this may
result in a lot of extra indexing overhead that may not be worth the performance gains at query
time versus other modeling approaches.
====
In the examples on this page, the IDs of child documents are always provided. However, you need not
generate such IDs; you can let Solr populate them automatically. It will concatenate the ID of its
parent with a separator and path information that should be unique. Try it out for yourself!
[#example-indexing-syntax]
== Example Indexing Syntax: Pseudo-Fields
This example shows what it looks like to index two root "product" documents, each containing two
different types of child documents specified in "pseudo-fields": "skus" and "manuals". Two of the "sku" type documents have their own nested child "manuals" documents...
[NOTE]
====
Even though the child documents in these examples are provided syntactically as field values, this is simply a matter of syntax and as such `skus` and `manuals` are not actual fields in the documents. Consequently, these field names need not be defined in the schema and probably shouldn't be as it would be confusing. There is no "child document" field type.
====
//
// DO NOT MODIFY THESE EXAMPLE DOCS WITH OUT REVIEWING ALL PAGES THAT INCLUDE/REFER BACK TO THESE EXAMPLES
// INCLUDING THE SEMI-EQUIVALENT ANONYMOUS CHILDREN EXAMPLE AT THE BOTTOM OF THIS PAGE
//
[.dynamic-tabs]
--
[example.tab-pane#json]
====
[.tab-label]*JSON*
// tag::sample-indexing-deeply-nested-documents[]
[source,json]
----
[{ "id": "P11!prod",
"name_s": "Swingline Stapler",
"description_t": "The Cadillac of office staplers ...",
"skus": [ { "id": "P11!S21",
"color_s": "RED",
"price_i": 42,
"manuals": [ { "id": "P11!D41",
"name_s": "Red Swingline Brochure",
"pages_i":1,
"content_t": "..."
} ]
},
{ "id": "P11!S31",
"color_s": "BLACK",
"price_i": 3
} ],
"manuals": [ { "id": "P11!D51",
"name_s": "Quick Reference Guide",
"pages_i":1,
"content_t": "How to use your stapler ..."
},
{ "id": "P11!D61",
"name_s": "Warranty Details",
"pages_i":42,
"content_t": "... lifetime guarantee ..."
} ]
},
{ "id": "P22!prod",
"name_s": "Mont Blanc Fountain Pen",
"description_t": "A Premium Writing Instrument ...",
"skus": [ { "id": "P22!S22",
"color_s": "RED",
"price_i": 89,
"manuals": [ { "id": "P22!D42",
"name_s": "Red Mont Blanc Brochure",
"pages_i":1,
"content_t": "..."
} ]
},
{ "id": "P22!S32",
"color_s": "BLACK",
"price_i": 67
} ],
"manuals": [ { "id": "P22!D52",
"name_s": "How To Use A Pen",
"pages_i":42,
"content_t": "Start by removing the cap ..."
} ]
} ]
----
// end::sample-indexing-deeply-nested-documents[]
[CAUTION]
=====
The <<uploading-data-with-index-handlers#json-update-convenience-paths,`/update/json/docs` convenience path>> will automatically flatten complex JSON documents by default -- so to index nested JSON documents make sure to use `/update`.
=====
====
[example.tab-pane#xml]
====
[.tab-label]*XML*
[source,xml]
----
<add>
<doc>
<field name="id">P11!prod</field>
<field name="name_s">Swingline Stapler</field>
<field name="description_t">The Cadillac of office staplers ...</field>
<field name="skus">
<doc>
<field name="id">P11!S21</field>
<field name="color_s">RED</field>
<field name="price_i">42</field>
<field name="manuals">
<doc>
<field name="id">P11!D41</field>
<field name="name_s">Red Swingline Brochure</field>
<field name="pages_i">1</field>
<field name="content_t">...</field>
</doc>
</field>
</doc>
<doc>
<field name="id">P11!S31</field>
<field name="color_s">BLACK</field>
<field name="price_i">3</field>
</doc>
</field>
<field name="manuals">
<doc>
<field name="id">P11!D51</field>
<field name="name_s">Quick Reference Guide</field>
<field name="pages_i">1</field>
<field name="content_t">How to use your stapler ...</field>
</doc>
<doc>
<field name="id">P11!D61</field>
<field name="name_s">Warranty Details</field>
<field name="pages_i">42</field>
<field name="content_t">... lifetime guarantee ...</field>
</doc>
</field>
</doc>
<doc>
<field name="id">P22!prod</field>
<field name="name_s">Mont Blanc Fountain Pen</field>
<field name="description_t">A Premium Writing Instrument ...</field>
<field name="skus">
<doc>
<field name="id">P22!S22</field>
<field name="color_s">RED</field>
<field name="price_i">89</field>
<field name="manuals">
<doc>
<field name="id">P22!D42</field>
<field name="name_s">Red Mont Blanc Brochure</field>
<field name="pages_i">1</field>
<field name="content_t">...</field>
</doc>
</field>
</doc>
<doc>
<field name="id">P22!S32</field>
<field name="color_s">BLACK</field>
<field name="price_i">67</field>
</doc>
</field>
<field name="manuals">
<doc>
<field name="id">P22!D52</field>
<field name="name_s">How To Use A Pen</field>
<field name="pages_i">42</field>
<field name="content_t">Start by removing the cap ...</field>
</doc>
</field>
</doc>
</add>
----
====
[example.tab-pane#solrj]
====
[.tab-label]*SolrJ*
[source,java,indent=0]
----
include::{example-source-dir}IndexingNestedDocuments.java[tag=nest-path]
----
====
--
== Schema Configuration
Indexing nested documents _requires_ an indexed field named `\_root_`:
[source,xml]
----
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
----
* Solr automatically populates this field in _all_ documents with the `id` value of it's root document
-- it's highest ancestor, possibly itself.
* This field must be indexed (`indexed="true"`) but doesn't need to
be either stored (`stored="true"`) or use doc values (`docValues="true"`), however you are free
to do so if you find it useful. If you want to use `uniqueBlock(\_root_)`
<<json-facet-api#stat-facet-functions,field type limitation>>, then you should enable docValues.
Preferably, you will also define `\_nest_path_` which adds features and ease-of-use:
[source,xml]
----
<fieldType name="_nest_path_" class="solr.NestPathField" />
<field name="_nest_path_" type="_nest_path_" />`
----
* Solr automatically populates this field for any child document but not root documents.
* This field enables Solr to properly record & reconstruct the named and nested relationship of documents
when using the `<<searching-nested-documents.adoc#child-doc-transformer,[child]>>` doc transformer.
** If this field does not exist, the `[child]` transformer will return all descendent child documents as a flattened list -- just as if they had been <<#indexing-anonymous-children,indexed as anonymous children>>.
* If you do not use `\_nest_path_` it is strongly recommended that every document have some
field that differentiates root documents from their nested children -- and differentiates different "types" of child documents. This is not strictly necessary, so long as it's possible to write a "filter" query that can be used to isolate and select only parent documents for use in the <<other-parsers.adoc#block-join-query-parsers,block join query parsers>> and <<searching-nested-documents.adoc#child-doc-transformer,[child]>> doc transformer
* It's possible to query on this field, although at present it's only documented how to in the
context of `[child]`'s `childFilter` parameter.
You might optionally want to define `\_nest_parent_` to store parent IDs:
[source,xml]
----
<field name="_nest_parent_" type="string" indexed="true" stored="true" />
----
* Solr automatically populates this field in child documents but not root documents.
Finally, understand that nested child documents are very much documents in their own right even if certain nested
documents hold different information from the parent or other child documents, therefore:
* All field names in the schema can only be configured in one -- different types of child documents can not have the same field name configured in different ways.
* It may be infeasible to use `required` for any field names that aren't required for all types of
documents.
* Even child documents need a _globally_ unique `id`.
[TIP]
====
When using SolrCloud it is a _VERY_ good idea to use
<<shards-and-indexing-data-in-solrcloud#document-routing,prefix based compositeIds>> with a
common prefix for all documents in the nested document tree. This makes it much easier to apply
<<updating-parts-of-documents#updating-child-documents,atomic updates to individual child documents>>
====
== Maintaining Integrity with Updates and Deletes
Nested document trees can be modified with Solr's
<<updating-parts-of-documents#updating-child-documents,atomic/partial update>> feature to
manipulate any document in a nested tree, and even to add new child documents.
This aspect isn't different than updating any normal document -- Solr internally deletes the old
nested document tree and it adds the newly modified one.
Just be mindful to add a `_root_` field if the partial update is to a child doc so that Solr
knows which Root doc it's related to.
Solr demands that the `id` of _all_ documents in a collection be unique. Solr enforces this for
root documents within a shard but it doesn't for child documents to avoid the expense of checking.
Clients should be very careful to *never* violate this.
To delete an entire nested document tree, you can simply delete-by-ID using the `id` of the root
document. Delete-by-ID will not work with the `id` of a child document, since only root document
IDs are considered. Instead, use delete-by-query (most efficient) or
<<updating-parts-of-documents#updating-child-documents,atomic updates>> to remove the child document from it's parent.
If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that any deletion query
is structured to ensure no descendent children remain of any documents that are being deleted. *_Doing otherwise will violate integrity assumptions that Solr expects._*
== Indexing Anonymous Children
Although not recommended, it is also possible to index child documents "anonymously":
[.dynamic-tabs]
--
[example.tab-pane#anon_json]
====
[.tab-label]*JSON*
[source,json]
----
[{ "id": "P11!prod",
"name_s": "Swingline Stapler",
"type_s": "PRODUCT",
"description_t": "The Cadillac of office staplers ...",
"_childDocuments_": [
{ "id": "P11!S21",
"type_s": "SKU",
"color_s": "RED",
"price_i": 42,
"_childDocuments_": [
{ "id": "P11!D41",
"type_s": "MANUAL",
"name_s": "Red Swingline Brochure",
"pages_i":1,
"content_t": "..."
} ]
},
{ "id": "P11!S31",
"type_s": "SKU",
"color_s": "BLACK",
"price_i": 3
},
{ "id": "P11!D51",
"type_s": "MANUAL",
"name_s": "Quick Reference Guide",
"pages_i":1,
"content_t": "How to use your stapler ..."
},
{ "id": "P11!D61",
"type_s": "MANUAL",
"name_s": "Warranty Details",
"pages_i":42,
"content_t": "... lifetime guarantee ..."
}
]
} ]
----
====
[example.tab-pane#anon_xml]
====
[.tab-label]*XML*
[source,xml]
----
<add>
<doc>
<field name="id">P11!prod</field>
<field name="type_s">PRODUCT</field>
<field name="name_s">Swingline Stapler</field>
<field name="description_t">The Cadillac of office staplers ...</field>
<doc>
<field name="id">P11!S21</field>
<field name="type_s">SKU</field>
<field name="color_s">RED</field>
<field name="price_i">42</field>
<doc>
<field name="id">P11!D41</field>
<field name="type_s">MANUAL</field>
<field name="name_s">Red Swingline Brochure</field>
<field name="pages_i">1</field>
<field name="content_t">...</field>
</doc>
</doc>
<doc>
<field name="id">P11!S31</field>
<field name="type_s">SKU</field>
<field name="color_s">BLACK</field>
<field name="price_i">3</field>
</doc>
<doc>
<field name="id">P11!D51</field>
<field name="type_s">MANUAL</field>
<field name="name_s">Quick Reference Guide</field>
<field name="pages_i">1</field>
<field name="content_t">How to use your stapler ...</field>
</doc>
<doc>
<field name="id">P11!D61</field>
<field name="type_s">MANUAL</field>
<field name="name_s">Warranty Details</field>
<field name="pages_i">42</field>
<field name="content_t">... lifetime guarantee ...</field>
</doc>
</doc>
</add>
----
====
[example.tab-pane#anon_solrj]
====
[.tab-label]*SolrJ*
[source,java,indent=0]
----
include::{example-source-dir}IndexingNestedDocuments.java[tag=anon-kids]
----
====
--
This simplified approach was common in older versions of Solr, and can still be used with "Root-Only" schemas that do not contain any other nested related fields apart from `\_root_`. (Many schemas in existence are this way simply because default configsets are this way, even if the application isn't using nested documents.)
This approach should *NOT* be used when schemas include a `\_nest_path_` field, as the existence of that field triggers assumptions and changes in behavior in various query time functionality, such as the <<searching-nested-documents.adoc#child-doc-transformer,[child]>>, that will not work when nested documents do not have any intrinsic "nested path" information.
The results of indexing anonymous nested children with a "Root-Only" schema are similar to what
happens if you attempt to index "pseudo field" nested documents using a "Root-Only" schema.
Notably: since there is no nested path information for the
<<searching-nested-documents.adoc#child-doc-transformer,[child]>> transformer to use to reconstruct the structure of a nest
of documents, it returns all matching children as a flat list, similar in structure to how they were originally indexed:
[.dynamic-tabs]
--
[example.tab-pane#anon_json_out]
====
[.tab-label]*JSON*
[source,bash]
----
$ curl --globoff 'http://localhost:8983/solr/gettingstarted/select?omitHeader=true&q=id:P11!prod&fl=*,[child%20parentFilter=%22type_s:PRODUCT%22]'
{
"response":{"numFound":1,"start":0,"maxScore":0.7002023,"numFoundExact":true,"docs":[
{
"id":"P11!prod",
"name_s":"Swingline Stapler",
"type_s":"PRODUCT",
"description_t":"The Cadillac of office staplers ...",
"_version_":1673055562829398016,
"_childDocuments_":[
{
"id":"P11!D41",
"type_s":"MANUAL",
"name_s":"Red Swingline Brochure",
"pages_i":1,
"content_t":"...",
"_version_":1673055562829398016},
{
"id":"P11!S21",
"type_s":"SKU",
"color_s":"RED",
"price_i":42,
"_version_":1673055562829398016},
{
"id":"P11!S31",
"type_s":"SKU",
"color_s":"BLACK",
"price_i":3,
"_version_":1673055562829398016},
{
"id":"P11!D51",
"type_s":"MANUAL",
"name_s":"Quick Reference Guide",
"pages_i":1,
"content_t":"How to use your stapler ...",
"_version_":1673055562829398016},
{
"id":"P11!D61",
"type_s":"MANUAL",
"name_s":"Warranty Details",
"pages_i":42,
"content_t":"... lifetime guarantee ...",
"_version_":1673055562829398016}]}]
}}
----
====
[example.tab-pane#anon_xml_out]
====
[.tab-label]*XML*
[source,bash]
----
$ curl --globoff 'http://localhost:8983/solr/gettingstarted/select?omitHeader=true&q=id:P11!prod&fl=*,[child%20parentFilter=%22type_s:PRODUCT%22]&wt=xml'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<result name="response" numFound="1" start="0" maxScore="0.7002023" numFoundExact="true">
<doc>
<str name="id">P11!prod</str>
<str name="name_s">Swingline Stapler</str>
<str name="type_s">PRODUCT</str>
<str name="description_t">The Cadillac of office staplers ...</str>
<long name="_version_">1673055562829398016</long>
<doc>
<str name="id">P11!D41</str>
<str name="type_s">MANUAL</str>
<str name="name_s">Red Swingline Brochure</str>
<int name="pages_i">1</int>
<str name="content_t">...</str>
<long name="_version_">1673055562829398016</long></doc>
<doc>
<str name="id">P11!S21</str>
<str name="type_s">SKU</str>
<str name="color_s">RED</str>
<int name="price_i">42</int>
<long name="_version_">1673055562829398016</long></doc>
<doc>
<str name="id">P11!S31</str>
<str name="type_s">SKU</str>
<str name="color_s">BLACK</str>
<int name="price_i">3</int>
<long name="_version_">1673055562829398016</long></doc>
<doc>
<str name="id">P11!D51</str>
<str name="type_s">MANUAL</str>
<str name="name_s">Quick Reference Guide</str>
<int name="pages_i">1</int>
<str name="content_t">How to use your stapler ...</str>
<long name="_version_">1673055562829398016</long></doc>
<doc>
<str name="id">P11!D61</str>
<str name="type_s">MANUAL</str>
<str name="name_s">Warranty Details</str>
<int name="pages_i">42</int>
<str name="content_t">... lifetime guarantee ...</str>
<long name="_version_">1673055562829398016</long></doc></doc>
</result>
</response>
----
====
--