| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. default-domain:: cpp |
| .. highlight:: cpp |
| |
| .. cpp:namespace:: arrow::csv |
| |
| ================= |
| Reading CSV files |
| ================= |
| |
| Arrow provides a fast CSV reader allowing ingestion of external data |
| as Arrow tables. |
| |
| .. seealso:: |
| :ref:`CSV reader API reference <cpp-api-csv>`. |
| |
| Basic usage |
| =========== |
| |
| A CSV file is read from a :class:`~arrow::io::InputStream`. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/csv/api.h" |
| |
| { |
| // ... |
| arrow::io::IOContext io_context = arrow::io::default_io_context(); |
| std::shared_ptr<arrow::io::InputStream> input = ...; |
| |
| auto read_options = arrow::csv::ReadOptions::Defaults(); |
| auto parse_options = arrow::csv::ParseOptions::Defaults(); |
| auto convert_options = arrow::csv::ConvertOptions::Defaults(); |
| |
| // Instantiate TableReader from input stream and options |
| auto maybe_reader = |
| arrow::csv::TableReader::Make(io_context, |
| input, |
| read_options, |
| parse_options, |
| convert_options); |
| if (!maybe_reader.ok()) { |
| // Handle TableReader instantiation error... |
| } |
| std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader; |
| |
| // Read table from CSV file |
| auto maybe_table = reader->Read(); |
| if (!maybe_table.ok()) { |
| // Handle CSV read error |
| // (for example a CSV syntax error or failed type conversion) |
| } |
| std::shared_ptr<arrow::Table> table = *maybe_table; |
| } |
| |
| Column names |
| ============ |
| |
| There are three possible ways to infer column names from the CSV file: |
| |
| * By default, the column names are read from the first row in the CSV file |
| * If :member:`ReadOptions::column_names` is set, it forces the column |
| names in the table to these values (the first row in the CSV file is |
| read as data) |
| * If :member:`ReadOptions::autogenerate_column_names` is true, column names |
| will be autogenerated with the pattern "f0", "f1"... (the first row in the |
| CSV file is read as data) |
| |
| Column selection |
| ================ |
| |
| By default, Arrow reads all columns in the CSV file. You can narrow the |
| selection of columns with the :member:`ConvertOptions::include_columns` |
| option. If some columns in :member:`ConvertOptions::include_columns` |
| are missing from the CSV file, an error will be emitted unless |
| :member:`ConvertOptions::include_missing_columns` is true, in which case |
| the missing columns are assumed to contain all-null values. |
| |
| Interaction with column names |
| ----------------------------- |
| |
| If both :member:`ReadOptions::column_names` and |
| :member:`ConvertOptions::include_columns` are specified, |
| the :member:`ReadOptions::column_names` are assumed to map to CSV columns, |
| and :member:`ConvertOptions::include_columns` is a subset of those column |
| names that will part of the Arrow Table. |
| |
| Data types |
| ========== |
| |
| By default, the CSV reader infers the most appropriate data type for each |
| column. Type inference considers the following data types, in order: |
| |
| * Null |
| * Int64 |
| * Boolean |
| * Date32 |
| * Timestamp (with seconds unit) |
| * Float64 |
| * Dictionary<String> (if :member:`ConvertOptions::auto_dict_encode` is true) |
| * Dictionary<Binary> (if :member:`ConvertOptions::auto_dict_encode` is true) |
| * String |
| * Binary |
| |
| It is possible to override type inference for select columns by setting |
| the :member:`ConvertOptions::column_types` option. Explicit data types |
| can be chosen from the following list: |
| |
| * Null |
| * All Integer types |
| * Float32 and Float64 |
| * Decimal128 |
| * Boolean |
| * Date32 and Date64 |
| * Timestamp |
| * Binary and Large Binary |
| * String and Large String (with optional UTF8 input validation) |
| * Fixed-Size Binary |
| * Dictionary with index type Int32 and value type one of the following: |
| Binary, String, LargeBinary, LargeString, Int32, UInt32, Int64, UInt64, |
| Float32, Float64, Decimal128 |
| |
| Other data types do not support conversion from CSV values and will error out. |
| |
| Dictionary inference |
| -------------------- |
| |
| If type inference is enabled and :member:`ConvertOptions::auto_dict_encode` |
| is true, the CSV reader first tries to convert string-like columns to a |
| dictionary-encoded string-like array. It switches to a plain string-like |
| array when the threshold in :member:`ConvertOptions::auto_dict_max_cardinality` |
| is reached. |
| |
| Nulls |
| ----- |
| |
| Null values are recognized from the spellings stored in |
| :member:`ConvertOptions::null_values`. The :func:`ConvertOptions::Defaults` |
| factory method will initialize a number of conventional null spellings such |
| as ``N/A``. |
| |
| Character encoding |
| ------------------ |
| |
| CSV files are expected to be encoded in UTF8. However, non-UTF8 data |
| is accepted for Binary columns. |
| |
| Performance |
| =========== |
| |
| By default, the CSV reader will parallelize reads in order to exploit all |
| CPU cores on your machine. You can change this setting in |
| :member:`ReadOptions::use_threads`. A reasonable expectation is at least |
| 100 MB/s per core on a performant desktop or laptop computer (measured in |
| source CSV bytes, not target Arrow data bytes). |