title: corpora parent: Workload Reference grand_parent: Reference nav_order: 70

corpora

The "corpora" array defines the datasets to index. Each corpus references one or more document files.

Syntax

{
  "corpora": [
    {
      "name": "<corpus-name>",
      "documents": [
        {
          "source-file": "<file>",
          "document-count": <n>,
          "compressed-bytes": <bytes>,
          "uncompressed-bytes": <bytes>,
          "target-collection": "<collection-name>"
        }
      ]
    }
  ]
}

Fields

Field	Type	Required	Description
`name`	string	Yes	Corpus name, referenced from `bulk-index` operations.
`documents`	array	Yes	List of document file descriptors.
`source-file`	string	Yes	Path (relative to workload dir) to the compressed NDJSON data file.
`document-count`	integer	Yes	Number of documents in the file (used for progress display and `--test-mode` limits).
`compressed-bytes`	integer	No	Compressed file size in bytes (for download progress display).
`uncompressed-bytes`	integer	No	Uncompressed file size in bytes.
`target-collection`	string	No	Target collection name. Defaults to the workload's primary collection.

Data file format

Documents must be in gzip-compressed NDJSON (Newline-Delimited JSON) format. Each line is one JSON document. Each document should include an id field matching the unique key field in your Solr schema:

{"id": "1", "title": "My document", "timestamp": "2024-01-01T00:00:00Z"}
{"id": "2", "title": "Another document", "timestamp": "2024-01-02T00:00:00Z"}

Example

{
  "corpora": [
    {
      "name": "nyc_taxis",
      "documents": [
        {
          "source-file": "files/data.json.gz",
          "document-count": 165346692,
          "compressed-bytes": 4917851637,
          "uncompressed-bytes": 74818096036
        }
      ]
    }
  ]
}