| <?xml version="1.0"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <document> |
| <properties> |
| <title>User Guide</title> |
| <author email="dev@commons.apache.org">Apache Commons Documentation Team</author> |
| </properties> |
| <body> |
| <!-- ================================================== --> |
| |
| <h1>Apache Commons CSV User Guide</h1> |
| |
| <macro name="toc"> |
| </macro> |
| |
| <section name="Parsing files"> |
| |
| Parsing files with Apache Commons CSV is relatively straight forward. |
| The CSVFormat class provides some commonly used CSV variants: |
| |
| <dl> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#DEFAULT">DEFAULT</a></dt><dd>Standard Comma Separated Value format, as for RFC4180 but allowing empty lines.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#EXCEL">EXCEL</a></dt><dd>The Microsoft Excel CSV format.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD">INFORMIX_UNLOAD<sup>1.3</sup></a></dt><dd>Informix <a href="http://www.ibm.com/support/knowledgecenter/SSBJG3_2.5.0/com.ibm.gen_busug.doc/c_fgl_InOutSql_UNLOAD.htm">UNLOAD</a> format used by the <code>UNLOAD TO file_name</code> operation.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD_CSV">INFORMIX_UNLOAD_CSV<sup>1.3</sup></a></dt><dd>Informix <a href="http://www.ibm.com/support/knowledgecenter/SSBJG3_2.5.0/com.ibm.gen_busug.doc/c_fgl_InOutSql_UNLOAD.htm">CSV UNLOAD</a> format used by the <code>UNLOAD TO file_name</code> operation (escaping is disabled.)</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#MYSQL">MONGO_CSV<sup>1.7</sup></a></dt><dd>MongoDB CSV format used by the <code>mongoexport</code> operation.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#MYSQL">MONGO_TSV<sup>1.7</sup></a></dt><dd>MongoDB TSV format used by the <code>mongoexport</code> operation.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#MYSQL">MYSQL</a></dt><dd>The MySQL CSV format.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#ORACLE">ORACLE<sup>1.6</sup></a></dt><dd>Default Oracle format used by the SQL*Loader utility.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#POSTGRESSQL_CSV">POSTGRESSQL_CSV<sup>1.5</sup></a></dt><dd>Default PostgreSQL CSV format used by the COPY operation.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#POSTGRESSQL_TEXT">POSTGRESSQL_TEXT<sup>1.5</sup></a></dt><dd>Default PostgreSQL text format used by the COPY operation.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#RFC4180">RFC-4180</a></dt><dd>The RFC-4180 format defined by <a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC-4180</a>.</dd> |
| <dt><a href="https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#TDF">TDF</a></dt><dd>A tab delimited format.</dd> |
| </dl> |
| |
| <subsection name="Example: Parsing an Excel CSV File"> |
| <p>To parse an Excel CSV file, write:</p> |
| <source>Reader in = new FileReader("path/to/file.csv"); |
| Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in); |
| for (CSVRecord record : records) { |
| String lastName = record.get("Last Name"); |
| String firstName = record.get("First Name"); |
| } |
| </source> |
| </subsection> |
| <subsection name="Handling Byte Order Marks"> |
| <p> |
| To handle files that start with a Byte Order Mark (BOM) like some Excel CSV files, you need an extra step to |
| deal with these optional bytes. |
| You can use the |
| <a href="https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html"> |
| BOMInputStream |
| </a> |
| class from |
| <a href="https://commons.apache.org/proper/commons-io/">Apache Commons IO</a> |
| for example: |
| </p> |
| <source>final URL url = ...; |
| try (final Reader reader = new InputStreamReader(new BOMInputStream(url.openStream()), "UTF-8"); |
| final CSVParser parser = CSVFormat.EXCEL.builder() |
| .setHeader() |
| .build() |
| .parse(reader)) { |
| for (final CSVRecord record : parser) { |
| final String string = record.get("SomeColumn"); |
| ... |
| } |
| } |
| </source> |
| <p> |
| You might find it handy to create something like this: |
| </p> |
| <source>/** |
| * Creates a reader capable of handling BOMs. |
| */ |
| public InputStreamReader newReader(final InputStream inputStream) { |
| return new InputStreamReader(new BOMInputStream(inputStream), StandardCharsets.UTF_8); |
| } |
| </source> |
| </subsection> |
| </section> |
| |
| <section name="Working with headers"> |
| |
| Apache Commons CSV provides several ways to access record values. |
| The simplest way is to access values by their index in the record. |
| However, columns in CSV files often have a name, for example: ID, CustomerNo, Birthday, etc. |
| The CSVFormat class provides an API for specifying these <i>header</i> names and CSVRecord on |
| the other hand has methods to access values by their corresponding header name. |
| |
| <subsection name="Accessing column values by index"> |
| To access a record value by index, no special configuration of the CSVFormat is necessary: |
| <source>Reader in = new FileReader("path/to/file.csv"); |
| Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in); |
| for (CSVRecord record : records) { |
| String columnOne = record.get(0); |
| String columnTwo = record.get(1); |
| } |
| </source> |
| </subsection> |
| <subsection name="Defining a header manually"> |
| Indices may not be the most intuitive way to access record values. For this reason it is possible to |
| assign names to each column in the file: |
| <source>Reader in = new FileReader("path/to/file.csv"); |
| Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| .setHeader("ID", "CustomerNo", "Name") |
| .build() |
| .parse(in); |
| for (CSVRecord record : records) { |
| String id = record.get("ID"); |
| String customerNo = record.get("CustomerNo"); |
| String name = record.get("Name"); |
| } |
| </source> |
| Note that column values can still be accessed using their index. |
| </subsection> |
| <subsection name="Using an enum to define a header"> |
| Using String values all over the code to reference columns can be error prone. For this reason, |
| it is possible to define an enum to specify header names. Note that the enum constant names are |
| used to access column values. This may lead to enums constant names which do not follow the Java |
| coding standard of defining constants in upper case with underscores: |
| <source>public enum Headers { |
| ID, CustomerNo, Name |
| } |
| Reader in = new FileReader("path/to/file.csv"); |
| Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| .setHeader(Headers.class) |
| .build() |
| .parse(in); |
| for (CSVRecord record : records) { |
| String id = record.get(Headers.ID); |
| String customerNo = record.get(Headers.CustomerNo); |
| String name = record.get(Headers.Name); |
| } |
| </source> |
| Again it is possible to access values by their index and by using a String (for example "CustomerNo"). |
| </subsection> |
| <subsection name="Header auto detection"> |
| Some CSV files define header names in their first record. If configured, Apache Commons CSV can parse |
| the header names from the first record: |
| <source>Reader in = new FileReader("path/to/file.csv"); |
| Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| .setHeader() |
| .setSkipHeaderRecord(true) |
| .build() |
| .parse(in); |
| for (CSVRecord record : records) { |
| String id = record.get("ID"); |
| String customerNo = record.get("CustomerNo"); |
| String name = record.get("Name"); |
| } |
| </source> |
| This will use the values from the first record as header names and skip the first record when iterating. |
| </subsection> |
| <subsection name="Printing with headers"> |
| <p> |
| To print a CSV file with headers, you specify the headers in the format: |
| </p> |
| <source>final Appendable out = ...; |
| final CSVPrinter printer = CSVFormat.DEFAULT.builder() |
| .setHeader("H1", "H2") |
| .build() |
| .print(out); |
| </source> |
| <p> |
| To print a CSV file with JDBC column labels, you specify the ResultSet in the format: |
| </p> |
| <source>try (final ResultSet resultSet = ...) { |
| final CSVPrinter printer = CSVFormat.DEFAULT.builder() |
| .setHeader(resultSet) |
| .build() |
| .print(out); |
| } |
| </source> |
| </subsection> |
| </section> |
| </body> |
| </document> |