| //// |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| +sqoop-import-mainframe+ |
| ------------------------ |
| |
| Purpose |
| ~~~~~~~ |
| |
| include::import-mainframe-purpose.txt[] |
| |
| Syntax |
| ~~~~~~ |
| |
| ---- |
| $ sqoop import-mainframe (generic-args) (import-args) |
| $ sqoop-import-mainframe (generic-args) (import-args) |
| ---- |
| |
| While the Hadoop generic arguments must precede any import arguments, |
| you can type the import arguments in any order with respect to one |
| another. |
| |
| include::mainframe-common-args.txt[] |
| |
| include::connecting-to-mainframe.txt[] |
| |
| .Import control arguments: |
| [grid="all"] |
| `---------------------------------`-------------------------------------- |
| Argument Description |
| ------------------------------------------------------------------------- |
| +\--as-avrodatafile+ Imports data to Avro Data Files |
| +\--as-sequencefile+ Imports data to SequenceFiles |
| +\--as-textfile+ Imports data as plain text (default) |
| +\--as-parquetfile+ Imports data to Parquet Files |
| +\--as-binaryfile+ Imports data as binary files |
| +\--delete-target-dir+ Delete the import target directory\ |
| if it exists |
| +-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel |
| +\--target-dir <dir>+ HDFS destination dir |
| +\--warehouse-dir <dir>+ HDFS parent for table destination |
| +-z,\--compress+ Enable compression |
| +\--compression-codec <c>+ Use Hadoop codec (default gzip) |
| ------------------------------------------------------------------------- |
| |
| Selecting the Files to Import |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| You can use the +\--dataset+ argument to specify a partitioned dataset name. |
| All sequential datasets in the partitioned dataset will be imported. |
| |
| Controlling Parallelism |
| ^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sqoop imports data in parallel by making multiple ftp connections to the |
| mainframe to transfer multiple files simultaneously. You can specify the |
| number of map tasks (parallel processes) to use to perform the import by |
| using the +-m+ or +\--num-mappers+ argument. Each of these arguments |
| takes an integer value which corresponds to the degree of parallelism |
| to employ. By default, four tasks are used. You can adjust this value to |
| maximize the data transfer rate from the mainframe. |
| |
| include::distributed-cache.txt[] |
| |
| Controlling the Import Process |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| By default, Sqoop will import all sequential files in a partitioned dataset |
| +pds+ to a directory named +pds+ inside your home directory in HDFS. For |
| example, if your username is +someuser+, then the import tool will write to |
| +/user/someuser/pds/(files)+. You can adjust the parent directory of |
| the import with the +\--warehouse-dir+ argument. For example: |
| |
| ---- |
| $ sqoop import-mainframe --connnect <host> --dataset foo --warehouse-dir /shared \ |
| ... |
| ---- |
| |
| This command would write to a set of files in the +/shared/pds/+ directory. |
| |
| You can also explicitly choose the target directory, like so: |
| |
| ---- |
| $ sqoop import-mainframe --connnect <host> --dataset foo --target-dir /dest \ |
| ... |
| ---- |
| |
| This will import the files into the +/dest+ directory. +\--target-dir+ is |
| incompatible with +\--warehouse-dir+. |
| |
| By default, imports go to a new target location. If the destination directory |
| already exists in HDFS, Sqoop will refuse to import and overwrite that |
| directory's contents. |
| |
| File Formats |
| ^^^^^^^^^^^^ |
| |
| By default, each record in a dataset is stored |
| as a text record with a newline at the end. Each record is assumed to contain |
| a single text field with the name DEFAULT_COLUMN. |
| When Sqoop imports data to HDFS, it generates a Java class which can |
| reinterpret the text files that it creates. |
| |
| You can also import mainframe records to Sequence, Avro, or Parquet files. |
| |
| By default, data is not compressed. You can compress your data by |
| using the deflate (gzip) algorithm with the +-z+ or +\--compress+ |
| argument, or specify any Hadoop compression codec using the |
| +\--compression-codec+ argument. |
| |
| include::output-args.txt[] |
| |
| Since mainframe record contains only one field, importing to delimited files |
| will not contain any field delimiter. However, the field may be enclosed with |
| enclosing character or escaped by an escaping character. |
| |
| include::input-args.txt[] |
| |
| When Sqoop imports data to HDFS, it generates a Java class which can |
| reinterpret the text files that it creates when doing a |
| delimited-format import. The delimiters are chosen with arguments such |
| as +\--fields-terminated-by+; this controls both how the data is |
| written to disk, and how the generated +parse()+ method reinterprets |
| this data. The delimiters used by the +parse()+ method can be chosen |
| independently of the output arguments, by using |
| +\--input-fields-terminated-by+, and so on. This is useful, for example, to |
| generate classes which can parse records created with one set of |
| delimiters, and emit the records to a different set of files using a |
| separate set of delimiters. |
| |
| include::hive-args.txt[] |
| |
| include::hive.txt[] |
| |
| include::hbase-args.txt[] |
| include::hbase.txt[] |
| |
| include::accumulo-args.txt[] |
| include::accumulo.txt[] |
| |
| include::codegen-args.txt[] |
| |
| As mentioned earlier, a byproduct of importing a table to HDFS is a |
| class which can manipulate the imported data. |
| You should use this class in your subsequent |
| MapReduce processing of the data. |
| |
| The class is typically named after the partitioned dataset name; a |
| partitioned dataset named +foo+ will |
| generate a class named +foo+. You may want to override this class |
| name. For example, if your partitioned dataset |
| is named +EMPLOYEES+, you may want to |
| specify +\--class-name Employee+ instead. Similarly, you can specify |
| just the package name with +\--package-name+. The following import |
| generates a class named +com.foocorp.SomePDS+: |
| |
| ---- |
| $ sqoop import-mainframe --connect <host> --dataset SomePDS --package-name com.foocorp |
| ---- |
| |
| The +.java+ source file for your class will be written to the current |
| working directory when you run +sqoop+. You can control the output |
| directory with +\--outdir+. For example, +\--outdir src/generated/+. |
| |
| The import process compiles the source into +.class+ and +.jar+ files; |
| these are ordinarily stored under +/tmp+. You can select an alternate |
| target directory with +\--bindir+. For example, +\--bindir /scratch+. |
| |
| If you already have a compiled class that can be used to perform the |
| import and want to suppress the code-generation aspect of the import |
| process, you can use an existing jar and class by |
| providing the +\--jar-file+ and +\--class-name+ options. For example: |
| |
| ---- |
| $ sqoop import-mainframe --dataset SomePDS --jar-file mydatatypes.jar \ |
| --class-name SomePDSType |
| ---- |
| |
| This command will load the +SomePDSType+ class out of +mydatatypes.jar+. |
| |
| Support for Generation Data Group and Sequential data sets. |
| This can be specified with the --datasettype option followed by one of: |
| 'p' for partitioned dataset (default) |
| 'g' for generation data group dataset |
| 's' for sequential dataset |
| |
| In the case of generation data group datasets, Sqoop will retrieve just the last or |
| latest file (or generation). |
| |
| In the case of sequential datasets, Sqoop will retrieve just the file specified. |
| |
| Support of datasets that are stored on tape volumes by specifying --tape true. |
| |
| By default, mainframe datasets are assumed to be plain text. Attempting to transfer |
| binary datasets using this method will result in data corruption. |
| Support for binary datasets by specifying --as-binaryfile and optionally --buffersize followed by |
| buffer size specified in bytes. By default, --buffersize is set to 32760 bytes. Altering buffersize |
| will alter the number of records Sqoop reports to have imported. This is because it reads the |
| binary dataset in chunks specified by buffersize. Larger buffer size means lower number of records. |
| |
| Use the +\--ftp-commands+ with a comma separated list of commands to send custom FTP commands prior to |
| file retrieval. This is useful for letting the mainframe know to embed data into the binary files |
| like Record Descriptor Words for variable length records so downstream processes can separate each |
| record. The mainframe will otherwise discard this metadata in the file transmission. |
| |
| NOTE: The responses from the mainframe of these commands are logged ONLY. It is up to the user to check |
| for errors responses from the mainframe. |
| |
| ---- |
| $ sqoop import-mainframe -D hadoop.security.credential.provider.path=jceks://file/my/folder/mainframe.jceks \ |
| --connect <host> --username user1 --password-alias alias1 --dataset SomeDS --tape true \ |
| --as-binaryfile --datasettype g --ftp-commands "SITE RDW,SITE RDW READTAPEFORMAT=V" |
| ---- |
| |
| Additional Import Configuration Properties |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| There are some additional properties which can be configured by modifying |
| +conf/sqoop-site.xml+. Properties can be specified the same as in Hadoop |
| configuration files, for example: |
| |
| ---- |
| <property> |
| <name>property.name</name> |
| <value>property.value</value> |
| </property> |
| ---- |
| |
| They can also be specified on the command line in the generic arguments, for |
| example: |
| |
| ---- |
| sqoop import -D property.name=property.value ... |
| ---- |
| |
| Example Invocations |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| The following examples illustrate how to use the import tool in a variety |
| of situations. |
| |
| A basic import of all sequential files in a partitioned dataset named |
| +EMPLOYEES+ in the mainframe host z390: |
| |
| ---- |
| $ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \ |
| --username SomeUser -P |
| Enter password: (hidden) |
| ---- |
| |
| Import of a tape based generation data group dataset using a password alias and writing out to |
| an intermediate directory (--outdir) before moving it to (--target-dir). |
| ---- |
| $ sqoop import-mainframe --dataset SomeGdg --connect <host> --username myuser --password-alias \ |
| mypasswordalias --datasettype g --tape true --outdir /tmp/imported/sqoop \ |
| --target-dir /data/imported/mainframe/SomeGdg |
| ---- |
| |
| Import of a tape based binary generation data group dataset with a buffer size of 64000 using a |
| password alias and writing out to an intermediate directory (--outdir) before moving it |
| to (--target-dir). |
| ---- |
| $ sqoop import-mainframe --dataset SomeGdg --connect <host> --username myuser --password-alias \ |
| mypasswordalias --datasettype g --tape true --as-binaryfile --buffersize 64000 --outdir /tmp/imported/sqoop \ |
| --target-dir /data/imported/mainframe/SomeGdg |
| ---- |
| |
| Controlling the import parallelism (using 8 parallel tasks): |
| |
| ---- |
| $ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \ |
| --username SomeUser --password-file mypassword -m 8 |
| ---- |
| |
| Importing the data to Hive: |
| |
| ---- |
| $ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \ |
| --hive-import |
| ---- |