parquet-cli - parquet-mr

tree: db3a4871db40baaa67b81c9e3c2a80c2f33593a2 [path history] [tgz]

parquet-cli/README.md

Building

You can build this project using maven:

mvn clean install -DskipTests

Running

The build produces a shaded Jar that can be run using the hadoop command:

hadoop jar parquet-cli-1.12.3-runtime.jar org.apache.parquet.cli.Main

For a shorter command-line invocation, add an alias to your shell like this:

alias parquet="hadoop jar /path/to/parquet-cli-1.12.3-runtime.jar org.apache.parquet.cli.Main --dollar-zero parquet"

Running without Hadoop

To run from the target directory instead of using the hadoop command, first copy the dependencies to a folder:

mvn dependency:copy-dependencies

Then, run the command-line and add target/dependencies/* to the classpath:

java -cp 'target/parquet-cli-1.12.3.jar:target/dependency/*' org.apache.parquet.cli.Main

Note that you shouldn‘t include the runtime jar used above into the classpath in this case. In that jar, the org.apache.avro package is relocated for avoiding conflict with Hadoop’s one. That relocation changes method signatures, so it can cause NoSuchMethodError depending on the class loading order. See PARQUET-2142 for details.

Help

The parquet tool includes help for the included commands:

parquet help

Usage: parquet [options] [command] [command options]

  Options:

    -v, --verbose, --debug
        Print extra debugging information

  Commands:

    help
        Retrieves details on the functions of other commands
    meta
        Print a Parquet file's metadata
    pages
        Print page summaries for a Parquet file
    dictionary
        Print dictionaries for a Parquet column
    check-stats
        Check Parquet files for corrupt page and column stats (PARQUET-251)
    schema
        Print the Avro schema for a file
    csv-schema
        Build a schema from a CSV data sample
    convert-csv
        Create a file from CSV data
    convert
        Create a Parquet file from a data file
    to-avro
        Create an Avro file from a data file
    cat
        Print the first N records from a file
    head
        Print the first N records from a file
    column-index
        Prints the column and offset indexes of a Parquet file
    column-size
        Print the column sizes of a parquet file
    prune
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
    trans-compression
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
    footer
        Print the Parquet file footer in json format
    bloom-filter
        Check bloom filters for a Parquet column
    scan
        Scan all records from a file
    rewrite
        Rewrite one or more Parquet files to a new Parquet file

  Examples:

    # print information for create
    parquet help meta

  See 'parquet help <command>' for more information on a specific command.