src/main/asciidoc/_chapters/slow_log_responses_from_systable.adoc - hbase - Git at Google

 ////
 /**
  *
  * Licensed to the Apache Software Foundation (ASF) under one
  * or more contributor license agreements.  See the NOTICE file
  * distributed with this work for additional information
  * regarding copyright ownership.  The ASF licenses this file
  * to you under the Apache License, Version 2.0 (the
  * "License"); you may not use this file except in compliance
  * with the License.  You may obtain a copy of the License at
  *
  *     http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
 ////

 [[slow_log_responses_from_systable]]
 ==== Get Slow/Large Response Logs from System table hbase:slowlog

 The above section provides details about Admin APIs:

 * get_slowlog_responses
 * get_largelog_responses
 * clear_slowlog_responses

 All of the above APIs access online in-memory ring buffers from
 individual RegionServers and accumulate logs from ring buffers to display
 to end user. However, since the logs are stored in memory, after RegionServer is
 restarted, all the objects held in memory of that RegionServer will be cleaned up
 and previous logs are lost. What if we want to persist all these logs forever?
 What if we want to store them in such a manner that operator can get all historical
 records with some filters? e.g get me all large/slow RPC logs that are triggered by
 user1 and are related to region:
 cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf. ?

 If we have a system table that stores such logs in increasing (not so strictly though)
 order of time, it can definitely help operators debug some historical events
 (scan, get, put, compaction, flush etc) with detailed inputs.

 Config which enabled system table to be created and store all log events is
 `hbase.regionserver.slowlog.systable.enabled`.

 The default value for this config is `false`. If provided `true`
 (Note: `hbase.regionserver.slowlog.buffer.enabled` should also be `true`),
 a cron job running in every RegionServer will persist the slow/large logs into
 table hbase:slowlog. By default cron job runs every 10 min. Duration can be configured
 with key: `hbase.slowlog.systable.chore.duration`. By default, RegionServer will
 store upto 1000(config key: `hbase.regionserver.slowlog.systable.queue.size`)
 slow/large logs in an internal queue and the chore will retrieve these logs
 from the queue and perform batch insertion in hbase:slowlog.

 hbase:slowlog has single ColumnFamily: `info`
 `info` contains multiple qualifiers which are the same attributes present as
 part of `get_slowlog_responses` API response.

 * info:call_details
 * info:client_address
 * info:method_name
 * info:param
 * info:processing_time
 * info:queue_time
 * info:region_name
 * info:response_size
 * info:server_class
 * info:start_time
 * info:type
 * info:username

 And example of 2 rows from hbase:slowlog scan result:
 [source]
 ----

  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:call_details, timestamp=2020-05-16T14:58:14.211Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:client_address, timestamp=2020-05-16T14:58:14.211Z, value=172.20.10.2:57347
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:method_name, timestamp=2020-05-16T14:58:14.211Z, value=Scan
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:param, timestamp=2020-05-16T14:58:14.211Z, value=region { type: REGION_NAME value: "hbase:meta,,1" } scan { column { family: "info" } attribute { name: "_isolationle
                                                              vel_" value: "\x5C000" } start_row: "cluster_test,33333333,99999999999999" stop_row: "cluster_test,," time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks
                                                              : true max_result_size: 2097152 reversed: true caching: 10 include_stop_row: true readType: PREAD } number_of_rows: 10 close_scanner: false client_handles_partials: true client_
                                                              handles_heartbeats: true track_scan_metrics: false
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:processing_time, timestamp=2020-05-16T14:58:14.211Z, value=18
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:queue_time, timestamp=2020-05-16T14:58:14.211Z, value=0
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:region_name, timestamp=2020-05-16T14:58:14.211Z, value=hbase:meta,,1
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:response_size, timestamp=2020-05-16T14:58:14.211Z, value=1575
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:server_class, timestamp=2020-05-16T14:58:14.211Z, value=HRegionServer
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:start_time, timestamp=2020-05-16T14:58:14.211Z, value=1589640743732
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:type, timestamp=2020-05-16T14:58:14.211Z, value=ALL
  \x024\xC1\x03\xE9\x04\xF5@                                  column=info:username, timestamp=2020-05-16T14:58:14.211Z, value=user2
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:call_details, timestamp=2020-05-16T14:59:58.764Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:client_address, timestamp=2020-05-16T14:59:58.764Z, value=172.20.10.2:57348
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:method_name, timestamp=2020-05-16T14:59:58.764Z, value=Scan
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:param, timestamp=2020-05-16T14:59:58.764Z, value=region { type: REGION_NAME value: "cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf." } scan { a
                                                              ttribute { name: "_isolationlevel_" value: "\x5C000" } start_row: "cccccccc" time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks: true max_result_size: 2
                                                              097152 caching: 2147483647 include_stop_row: false } number_of_rows: 2147483647 close_scanner: false client_handles_partials: true client_handles_heartbeats: true track_scan_met
                                                              rics: false
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:processing_time, timestamp=2020-05-16T14:59:58.764Z, value=24
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:queue_time, timestamp=2020-05-16T14:59:58.764Z, value=0
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:region_name, timestamp=2020-05-16T14:59:58.764Z, value=cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf.
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:response_size, timestamp=2020-05-16T14:59:58.764Z, value=211227
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:server_class, timestamp=2020-05-16T14:59:58.764Z, value=HRegionServer
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:start_time, timestamp=2020-05-16T14:59:58.764Z, value=1589640743932
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:type, timestamp=2020-05-16T14:59:58.764Z, value=ALL
  \x024\xC1\x06X\x81\xF6\xEC                                  column=info:username, timestamp=2020-05-16T14:59:58.764Z, value=user1
 ----

 Operator can use ColumnValueFilter to filter records based on region_name, username,
 client_address etc.

 Time range based queries will also be very useful.
 Example:
 [source]
 ----
 scan 'hbase:slowlog', { TIMERANGE => [1589621394000, 1589637999999] }
 ----
	////
	/**
	*
	* Licensed to the Apache Software Foundation (ASF) under one
	* or more contributor license agreements. See the NOTICE file
	* distributed with this work for additional information
	* regarding copyright ownership. The ASF licenses this file
	* to you under the Apache License, Version 2.0 (the
	* "License"); you may not use this file except in compliance
	* with the License. You may obtain a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*
	* Unless required by applicable law or agreed to in writing, software
	* distributed under the License is distributed on an "AS IS" BASIS,
	* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	* See the License for the specific language governing permissions and
	* limitations under the License.
	*/
	////

	[[slow_log_responses_from_systable]]
	==== Get Slow/Large Response Logs from System table hbase:slowlog

	The above section provides details about Admin APIs:

	* get_slowlog_responses
	* get_largelog_responses
	* clear_slowlog_responses

	All of the above APIs access online in-memory ring buffers from
	individual RegionServers and accumulate logs from ring buffers to display
	to end user. However, since the logs are stored in memory, after RegionServer is
	restarted, all the objects held in memory of that RegionServer will be cleaned up
	and previous logs are lost. What if we want to persist all these logs forever?
	What if we want to store them in such a manner that operator can get all historical
	records with some filters? e.g get me all large/slow RPC logs that are triggered by
	user1 and are related to region:
	cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf. ?

	If we have a system table that stores such logs in increasing (not so strictly though)
	order of time, it can definitely help operators debug some historical events
	(scan, get, put, compaction, flush etc) with detailed inputs.

	Config which enabled system table to be created and store all log events is
	`hbase.regionserver.slowlog.systable.enabled`.

	The default value for this config is `false`. If provided `true`
	(Note: `hbase.regionserver.slowlog.buffer.enabled` should also be `true`),
	a cron job running in every RegionServer will persist the slow/large logs into
	table hbase:slowlog. By default cron job runs every 10 min. Duration can be configured
	with key: `hbase.slowlog.systable.chore.duration`. By default, RegionServer will
	store upto 1000(config key: `hbase.regionserver.slowlog.systable.queue.size`)
	slow/large logs in an internal queue and the chore will retrieve these logs
	from the queue and perform batch insertion in hbase:slowlog.

	hbase:slowlog has single ColumnFamily: `info`
	`info` contains multiple qualifiers which are the same attributes present as
	part of `get_slowlog_responses` API response.

	* info:call_details
	* info:client_address
	* info:method_name
	* info:param
	* info:processing_time
	* info:queue_time
	* info:region_name
	* info:response_size
	* info:server_class
	* info:start_time
	* info:type
	* info:username

	And example of 2 rows from hbase:slowlog scan result:
	[source]
	----

	\x024\xC1\x03\xE9\x04\xF5@ column=info:call_details, timestamp=2020-05-16T14:58:14.211Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
	\x024\xC1\x03\xE9\x04\xF5@ column=info:client_address, timestamp=2020-05-16T14:58:14.211Z, value=172.20.10.2:57347
	\x024\xC1\x03\xE9\x04\xF5@ column=info:method_name, timestamp=2020-05-16T14:58:14.211Z, value=Scan
	\x024\xC1\x03\xE9\x04\xF5@ column=info:param, timestamp=2020-05-16T14:58:14.211Z, value=region { type: REGION_NAME value: "hbase:meta,,1" } scan { column { family: "info" } attribute { name: "_isolationle
	vel_" value: "\x5C000" } start_row: "cluster_test,33333333,99999999999999" stop_row: "cluster_test,," time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks
	: true max_result_size: 2097152 reversed: true caching: 10 include_stop_row: true readType: PREAD } number_of_rows: 10 close_scanner: false client_handles_partials: true client_
	handles_heartbeats: true track_scan_metrics: false
	\x024\xC1\x03\xE9\x04\xF5@ column=info:processing_time, timestamp=2020-05-16T14:58:14.211Z, value=18
	\x024\xC1\x03\xE9\x04\xF5@ column=info:queue_time, timestamp=2020-05-16T14:58:14.211Z, value=0
	\x024\xC1\x03\xE9\x04\xF5@ column=info:region_name, timestamp=2020-05-16T14:58:14.211Z, value=hbase:meta,,1
	\x024\xC1\x03\xE9\x04\xF5@ column=info:response_size, timestamp=2020-05-16T14:58:14.211Z, value=1575
	\x024\xC1\x03\xE9\x04\xF5@ column=info:server_class, timestamp=2020-05-16T14:58:14.211Z, value=HRegionServer
	\x024\xC1\x03\xE9\x04\xF5@ column=info:start_time, timestamp=2020-05-16T14:58:14.211Z, value=1589640743732
	\x024\xC1\x03\xE9\x04\xF5@ column=info:type, timestamp=2020-05-16T14:58:14.211Z, value=ALL
	\x024\xC1\x03\xE9\x04\xF5@ column=info:username, timestamp=2020-05-16T14:58:14.211Z, value=user2
	\x024\xC1\x06X\x81\xF6\xEC column=info:call_details, timestamp=2020-05-16T14:59:58.764Z, value=Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)
	\x024\xC1\x06X\x81\xF6\xEC column=info:client_address, timestamp=2020-05-16T14:59:58.764Z, value=172.20.10.2:57348
	\x024\xC1\x06X\x81\xF6\xEC column=info:method_name, timestamp=2020-05-16T14:59:58.764Z, value=Scan
	\x024\xC1\x06X\x81\xF6\xEC column=info:param, timestamp=2020-05-16T14:59:58.764Z, value=region { type: REGION_NAME value: "cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf." } scan { a
	ttribute { name: "_isolationlevel_" value: "\x5C000" } start_row: "cccccccc" time_range { from: 0 to: 9223372036854775807 } max_versions: 1 cache_blocks: true max_result_size: 2
	097152 caching: 2147483647 include_stop_row: false } number_of_rows: 2147483647 close_scanner: false client_handles_partials: true client_handles_heartbeats: true track_scan_met
	rics: false
	\x024\xC1\x06X\x81\xF6\xEC column=info:processing_time, timestamp=2020-05-16T14:59:58.764Z, value=24
	\x024\xC1\x06X\x81\xF6\xEC column=info:queue_time, timestamp=2020-05-16T14:59:58.764Z, value=0
	\x024\xC1\x06X\x81\xF6\xEC column=info:region_name, timestamp=2020-05-16T14:59:58.764Z, value=cluster_test,cccccccc,1589635796466.aa45e1571d533f5ed0bb31cdccaaf9cf.
	\x024\xC1\x06X\x81\xF6\xEC column=info:response_size, timestamp=2020-05-16T14:59:58.764Z, value=211227
	\x024\xC1\x06X\x81\xF6\xEC column=info:server_class, timestamp=2020-05-16T14:59:58.764Z, value=HRegionServer
	\x024\xC1\x06X\x81\xF6\xEC column=info:start_time, timestamp=2020-05-16T14:59:58.764Z, value=1589640743932
	\x024\xC1\x06X\x81\xF6\xEC column=info:type, timestamp=2020-05-16T14:59:58.764Z, value=ALL
	\x024\xC1\x06X\x81\xF6\xEC column=info:username, timestamp=2020-05-16T14:59:58.764Z, value=user1
	----

	Operator can use ColumnValueFilter to filter records based on region_name, username,
	client_address etc.

	Time range based queries will also be very useful.
	Example:
	[source]
	----
	scan 'hbase:slowlog', { TIMERANGE => [1589621394000, 1589637999999] }
	----