src/contrib/sqoop/doc/full-db-import.txt - hadoop-mapreduce - Git at Google


 ////
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 ////


 Automatic Full-database Import
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 If you want to import all the tables in a database, you can use the
 +--all-tables+ command to do so:

 ----
 $ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables
 ----

 This will query the database for the available tables, generate an ORM
 class for each table, and run a MapReduce job to import each one.
 Hadoop uses the DBInputFormat to read from a database into a Mapper
 instance. To read a table into a MapReduce program requires creating a
 class to hold the fields of one row of the table. One of the benefits
 of Sqoop is that it generates this class definition for you, based on
 the table definition in the database.

 The generated +.java+ files are, by default, placed in the current
 directory. You can supply a different directory with the +--outdir+
 parameter. These are then compiled into +.class+ and +.jar+ files for use
 by the MapReduce job that it launches. These files are created in a
 temporary directory. You can redirect this target with +--bindir+.

 Each table will be imported into a separate directory in HDFS, with
 the same name as the table. For instance, if my Hadoop username is
 aaron, the above command would have generated the following
 directories in HDFS:

 ----
 /user/aaron/employee_names
 /user/aaron/payroll_checks
 /user/aaron/job_descriptions
 /user/aaron/office_supplies
 ----

 You can change the base directory under which the tables are loaded
 with the +--warehouse-dir+ parameter. For example:

 ----
 $ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables \
     --warehouse-dir /common/warehouse
 ----

 This would create the following directories instead:

 ----
 /common/warehouse/employee_names
 /common/warehouse/payroll_checks
 /common/warehouse/job_descriptions
 /common/warehouse/office_supplies
 ----

 By default the data will be read into text files in HDFS. Each of the
 columns will be represented as comma-delimited text. Each row is
 terminated by a newline. See the section on "Controlling the Output
 Format" below for information on how to change these delimiters.

 If you want to leverage compression and binary file formats, the
 +--as-sequencefile+ argument to Sqoop will import the table
 to a set of SequenceFiles instead. This stores each field of each
 database record in a separate object in a SequenceFile.
 This representation is also likely to be higher performance when used
 as an input to subsequent MapReduce programs as it does not require
 parsing. For completeness, Sqoop provides an +--as-textfile+ option, which is
 implied by default. An +--as-textfile+ on the command-line will override
 a previous +--as-sequencefile+ argument.

 The SequenceFile format will embed the records from the database as
 objects using the code generated by Sqoop. It is important that you
 retain the +.java+ file for this class, as you will need to be able to
 instantiate the same type to read the objects back later, in other
 user-defined applications.

	////
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	////


	Automatic Full-database Import
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	If you want to import all the tables in a database, you can use the
	+--all-tables+ command to do so:

	----
	$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables
	----

	This will query the database for the available tables, generate an ORM
	class for each table, and run a MapReduce job to import each one.
	Hadoop uses the DBInputFormat to read from a database into a Mapper
	instance. To read a table into a MapReduce program requires creating a
	class to hold the fields of one row of the table. One of the benefits
	of Sqoop is that it generates this class definition for you, based on
	the table definition in the database.

	The generated +.java+ files are, by default, placed in the current
	directory. You can supply a different directory with the +--outdir+
	parameter. These are then compiled into +.class+ and +.jar+ files for use
	by the MapReduce job that it launches. These files are created in a
	temporary directory. You can redirect this target with +--bindir+.

	Each table will be imported into a separate directory in HDFS, with
	the same name as the table. For instance, if my Hadoop username is
	aaron, the above command would have generated the following
	directories in HDFS:

	----
	/user/aaron/employee_names
	/user/aaron/payroll_checks
	/user/aaron/job_descriptions
	/user/aaron/office_supplies
	----

	You can change the base directory under which the tables are loaded
	with the +--warehouse-dir+ parameter. For example:

	----
	$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables \
	--warehouse-dir /common/warehouse
	----

	This would create the following directories instead:

	----
	/common/warehouse/employee_names
	/common/warehouse/payroll_checks
	/common/warehouse/job_descriptions
	/common/warehouse/office_supplies
	----

	By default the data will be read into text files in HDFS. Each of the
	columns will be represented as comma-delimited text. Each row is
	terminated by a newline. See the section on "Controlling the Output
	Format" below for information on how to change these delimiters.

	If you want to leverage compression and binary file formats, the
	+--as-sequencefile+ argument to Sqoop will import the table
	to a set of SequenceFiles instead. This stores each field of each
	database record in a separate object in a SequenceFile.
	This representation is also likely to be higher performance when used
	as an input to subsequent MapReduce programs as it does not require
	parsing. For completeness, Sqoop provides an +--as-textfile+ option, which is
	implied by default. An +--as-textfile+ on the command-line will override
	a previous +--as-sequencefile+ argument.

	The SequenceFile format will embed the records from the database as
	objects using the code generated by Sqoop. It is important that you
	retain the +.java+ file for this class, as you will need to be able to
	instantiate the same type to read the objects back later, in other
	user-defined applications.