docs/load_transform/src/asciidoc/_chapters/monitor.adoc - trafodion - Git at Google

 ////
 /**
 * @@@ START COPYRIGHT @@@
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 *
 * @@@ END COPYRIGHT @@@
 */
 ////

 [[monitor-progress]]
 = Monitor Progress

 == INSERT and UPSERT

 For an INSERT statement, rows are written to the HBase table that represents the {project-name} table when the transaction commits.
 It is more difficult to see query progress here.

 == UPSERT USING LOAD

 For an UPSERT USING LOAD statement, rows added are visible in the {project-name} table after each `ListOfPut` call succeeds.
 You can use a `SELECT COUNT(*)` statement to monitor progress. That way, you know how many rows are already in the table when the
 statement starts executing.

 ```
 SELECT COUNT(*) FROM trafodion.sch.demo ;
 ```

 == LOAD
 For LOAD, query progress goes through a few phases, which sometimes overlap:

 1. Hive scan.
 2. Sort.
 3. Create prep HFiles in HDFS bulkload staging directory (`/bulkload` by default).
 4. Move HFiles into HBase.

 You can monitor progress in step 2, sort, with this shell command:

 ```
 lsof +L1 | grep SCR | wc -l
 ```

 This command returns a count of the number of overflow files for sort. Each file is 2GB in size.
 You need to have an approximate idea of  the volume of data being loaded to know how much more
 data needs to be sorted. On a cluster, sort is done on all nodes with a pdsh-like utility.
 {project-name} data volume can also be larger than Hive data volume by a factor of 2 or 3.

 In step 3, create prep HFiles, use the following command to monitor the volume of data written
 out to the staging directory:

 ```
 hadoop fs -dus /bulkload
 ```

 The `hadoop fs` command must be run from one node and does not have to be repeated across the cluster.

 If compression and encoding are used, then the size should be similar to the Hive source data volume.
 There may be some remnant data in the staging directory from previous commands, so we have to
 take that into account. This step will start only when sort has completed.

 Step 4 is usually the shortest and typically does not exceed a few minutes.
	////
	/**
	* @@@ START COPYRIGHT @@@
	*
	* Licensed to the Apache Software Foundation (ASF) under one
	* or more contributor license agreements. See the NOTICE file
	* distributed with this work for additional information
	* regarding copyright ownership. The ASF licenses this file
	* to you under the Apache License, Version 2.0 (the
	* "License"); you may not use this file except in compliance
	* with the License. You may obtain a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*
	* Unless required by applicable law or agreed to in writing,
	* software distributed under the License is distributed on an
	* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	* KIND, either express or implied. See the License for the
	* specific language governing permissions and limitations
	* under the License.
	*
	* @@@ END COPYRIGHT @@@
	*/
	////

	[[monitor-progress]]
	= Monitor Progress

	== INSERT and UPSERT

	For an INSERT statement, rows are written to the HBase table that represents the {project-name} table when the transaction commits.
	It is more difficult to see query progress here.

	== UPSERT USING LOAD

	For an UPSERT USING LOAD statement, rows added are visible in the {project-name} table after each `ListOfPut` call succeeds.
	You can use a `SELECT COUNT(*)` statement to monitor progress. That way, you know how many rows are already in the table when the
	statement starts executing.

	```
	SELECT COUNT(*) FROM trafodion.sch.demo ;
	```

	== LOAD
	For LOAD, query progress goes through a few phases, which sometimes overlap:

	1. Hive scan.
	2. Sort.
	3. Create prep HFiles in HDFS bulkload staging directory (`/bulkload` by default).
	4. Move HFiles into HBase.

	You can monitor progress in step 2, sort, with this shell command:

	```
	lsof +L1 \| grep SCR \| wc -l
	```

	This command returns a count of the number of overflow files for sort. Each file is 2GB in size.
	You need to have an approximate idea of the volume of data being loaded to know how much more
	data needs to be sorted. On a cluster, sort is done on all nodes with a pdsh-like utility.
	{project-name} data volume can also be larger than Hive data volume by a factor of 2 or 3.

	In step 3, create prep HFiles, use the following command to monitor the volume of data written
	out to the staging directory:

	```
	hadoop fs -dus /bulkload
	```

	The `hadoop fs` command must be run from one node and does not have to be repeated across the cluster.

	If compression and encoding are used, then the size should be similar to the Hive source data volume.
	There may be some remnant data in the staging directory from previous commands, so we have to
	take that into account. This step will start only when sort has completed.

	Step 4 is usually the shortest and typically does not exceed a few minutes.