docs/topics/impala_isilon.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?><!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept rev="5.4.3" id="impala_isilon">

   <title>Using Impala with Isilon Storage</title>
   <titlealts audience="PDF"><navtitle>Isilon Storage</navtitle></titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Isilon"/>
       <data name="Category" value="Disk Storage"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="hidden">Isilon</indexterm>
       You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS.
       This capability allows convenient query access to a storage system where you might already be
       managing large volumes of data. The combination of the Impala query engine and Isilon storage is
       certified on <keyword keyref="impala224"/> or higher.
     </p>

     <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>

     <p>
       The typical use case for Impala and Isilon together is to use Isilon for the
       default filesystem, replacing HDFS entirely. In this configuration,
       when you create a database, table, or partition, the data always resides on
       Isilon storage and you do not need to specify any special <codeph>LOCATION</codeph>
       attribute. If you do specify a <codeph>LOCATION</codeph> attribute, its value refers
       to a path within the Isilon filesystem.
       For example:
     </p>
 <codeblock>-- If the default filesystem is Isilon, all Impala data resides there
 -- and all Impala databases and tables are located there.
 CREATE TABLE t1 (x INT, s STRING);

 -- You can specify LOCATION for database, table, or partition,
 -- using values from the Isilon filesystem.
 CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
 CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
 </codeblock>

     <p>
       Impala can write to, delete, and rename data files and database, table,
       and partition directories on Isilon storage. Therefore, Impala statements such
       as
       <codeph>CREATE TABLE</codeph>, <codeph>DROP TABLE</codeph>,
       <codeph>CREATE DATABASE</codeph>, <codeph>DROP DATABASE</codeph>,
       <codeph>ALTER TABLE</codeph>,
       and
       <codeph>INSERT</codeph> work the same with Isilon storage as with HDFS.
     </p>

     <p>
       When the Impala spill-to-disk feature is activated by a query that approaches
       the memory limit, Impala writes all the temporary data to a local (not Isilon)
       storage device. Because the I/O bandwidth for the temporary data depends on
       the number of local disks, and clusters using Isilon storage might not have
       as many local disks attached, pay special attention on Isilon-enabled clusters
       to any queries that use the spill-to-disk feature. Where practical, tune the
       queries or allocate extra memory for Impala to avoid spilling.
       Although you can specify an Isilon storage device as the destination for
       the temporary data for the spill-to-disk feature, that configuration is
       not recommended due to the need to transfer the data both ways using remote I/O.
     </p>

     <p>
       When tuning Impala queries on HDFS, you typically try to avoid any remote reads.
       When the data resides on Isilon storage, all the I/O consists of remote reads.
       Do not be alarmed when you see non-zero numbers for remote read measurements
       in query profile output. The benefit of the Impala and Isilon integration is
       primarily convenience of not having to move or copy large volumes of data to HDFS,
       rather than raw query performance. You can increase the performance of Impala
       I/O for Isilon systems by increasing the value for the
       <codeph>--num_remote_hdfs_io_threads</codeph> startup option for the
       <cmdname>impalad</cmdname> daemon.
     </p>

     <!-- <p outputclass="toc inpage"/> -->
   </conbody>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?><!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept rev="5.4.3" id="impala_isilon">

	<title>Using Impala with Isilon Storage</title>
	<titlealts audience="PDF"><navtitle>Isilon Storage</navtitle></titlealts>

	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Isilon"/>
	<data name="Category" value="Disk Storage"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="hidden">Isilon</indexterm>
	You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS.
	This capability allows convenient query access to a storage system where you might already be
	managing large volumes of data. The combination of the Impala query engine and Isilon storage is
	certified on <keyword keyref="impala224"/> or higher.
	</p>

	<p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>

	<p>
	The typical use case for Impala and Isilon together is to use Isilon for the
	default filesystem, replacing HDFS entirely. In this configuration,
	when you create a database, table, or partition, the data always resides on
	Isilon storage and you do not need to specify any special <codeph>LOCATION</codeph>
	attribute. If you do specify a <codeph>LOCATION</codeph> attribute, its value refers
	to a path within the Isilon filesystem.
	For example:
	</p>
	<codeblock>-- If the default filesystem is Isilon, all Impala data resides there
	-- and all Impala databases and tables are located there.
	CREATE TABLE t1 (x INT, s STRING);

	-- You can specify LOCATION for database, table, or partition,
	-- using values from the Isilon filesystem.
	CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
	CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
	</codeblock>

	<p>
	Impala can write to, delete, and rename data files and database, table,
	and partition directories on Isilon storage. Therefore, Impala statements such
	as
	<codeph>CREATE TABLE</codeph>, <codeph>DROP TABLE</codeph>,
	<codeph>CREATE DATABASE</codeph>, <codeph>DROP DATABASE</codeph>,
	<codeph>ALTER TABLE</codeph>,
	and
	<codeph>INSERT</codeph> work the same with Isilon storage as with HDFS.
	</p>

	<p>
	When the Impala spill-to-disk feature is activated by a query that approaches
	the memory limit, Impala writes all the temporary data to a local (not Isilon)
	storage device. Because the I/O bandwidth for the temporary data depends on
	the number of local disks, and clusters using Isilon storage might not have
	as many local disks attached, pay special attention on Isilon-enabled clusters
	to any queries that use the spill-to-disk feature. Where practical, tune the
	queries or allocate extra memory for Impala to avoid spilling.
	Although you can specify an Isilon storage device as the destination for
	the temporary data for the spill-to-disk feature, that configuration is
	not recommended due to the need to transfer the data both ways using remote I/O.
	</p>

	<p>
	When tuning Impala queries on HDFS, you typically try to avoid any remote reads.
	When the data resides on Isilon storage, all the I/O consists of remote reads.
	Do not be alarmed when you see non-zero numbers for remote read measurements
	in query profile output. The benefit of the Impala and Isilon integration is
	primarily convenience of not having to move or copy large volumes of data to HDFS,
	rather than raw query performance. You can increase the performance of Impala
	I/O for Isilon systems by increasing the value for the
	<codeph>--num_remote_hdfs_io_threads</codeph> startup option for the
	<cmdname>impalad</cmdname> daemon.
	</p>

	<!-- <p outputclass="toc inpage"/> -->
	</conbody>

	</concept>