| |
| //// |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| //// |
| |
| |
| Automatic Full-database Import |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| If you want to import all the tables in a database, you can use the |
| +--all-tables+ command to do so: |
| |
| ---- |
| $ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables |
| ---- |
| |
| This will query the database for the available tables, generate an ORM |
| class for each table, and run a MapReduce job to import each one. |
| Hadoop uses the DBInputFormat to read from a database into a Mapper |
| instance. To read a table into a MapReduce program requires creating a |
| class to hold the fields of one row of the table. One of the benefits |
| of Sqoop is that it generates this class definition for you, based on |
| the table definition in the database. |
| |
| The generated +.java+ files are, by default, placed in the current |
| directory. You can supply a different directory with the +--outdir+ |
| parameter. These are then compiled into +.class+ and +.jar+ files for use |
| by the MapReduce job that it launches. These files are created in a |
| temporary directory. You can redirect this target with +--bindir+. |
| |
| Each table will be imported into a separate directory in HDFS, with |
| the same name as the table. For instance, if my Hadoop username is |
| aaron, the above command would have generated the following |
| directories in HDFS: |
| |
| ---- |
| /user/aaron/employee_names |
| /user/aaron/payroll_checks |
| /user/aaron/job_descriptions |
| /user/aaron/office_supplies |
| ---- |
| |
| You can change the base directory under which the tables are loaded |
| with the +--warehouse-dir+ parameter. For example: |
| |
| ---- |
| $ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables \ |
| --warehouse-dir /common/warehouse |
| ---- |
| |
| This would create the following directories instead: |
| |
| ---- |
| /common/warehouse/employee_names |
| /common/warehouse/payroll_checks |
| /common/warehouse/job_descriptions |
| /common/warehouse/office_supplies |
| ---- |
| |
| By default the data will be read into text files in HDFS. Each of the |
| columns will be represented as comma-delimited text. Each row is |
| terminated by a newline. See the section on "Controlling the Output |
| Format" below for information on how to change these delimiters. |
| |
| If you want to leverage compression and binary file formats, the |
| +--as-sequencefile+ argument to Sqoop will import the table |
| to a set of SequenceFiles instead. This stores each field of each |
| database record in a separate object in a SequenceFile. |
| This representation is also likely to be higher performance when used |
| as an input to subsequent MapReduce programs as it does not require |
| parsing. For completeness, Sqoop provides an +--as-textfile+ option, which is |
| implied by default. An +--as-textfile+ on the command-line will override |
| a previous +--as-sequencefile+ argument. |
| |
| The SequenceFile format will embed the records from the database as |
| objects using the code generated by Sqoop. It is important that you |
| retain the +.java+ file for this class, as you will need to be able to |
| instantiate the same type to read the objects back later, in other |
| user-defined applications. |
| |