| //// |
| /** |
| * |
| * Licensed to the Apache Software Foundation (ASF) under one |
| * or more contributor license agreements. See the NOTICE file |
| * distributed with this work for additional information |
| * regarding copyright ownership. The ASF licenses this file |
| * to you under the Apache License, Version 2.0 (the |
| * "License"); you may not use this file except in compliance |
| * with the License. You may obtain a copy of the License at |
| * |
| * http://www.apache.org/licenses/LICENSE-2.0 |
| * |
| * Unless required by applicable law or agreed to in writing, software |
| * distributed under the License is distributed on an "AS IS" BASIS, |
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| * See the License for the specific language governing permissions and |
| * limitations under the License. |
| */ |
| //// |
| |
| [[snapshot_scanner]] |
| == Scan over snapshot |
| :doctype: book |
| :numbered: |
| :toc: left |
| :icons: font |
| :experimental: |
| :toc: left |
| :source-language: java |
| |
| In HBase, a scan of a table costs server-side HBase resources reading, formating, and returning data back to the client. |
| Luckily, HBase provides a TableSnapshotScanner and TableSnapshotInputFormat (introduced by link:https://issues.apache.org/jira/browse/HBASE-8369[HBASE-8369]), |
| which can scan HBase-written HFiles directly in the HDFS filesystem completely by-passing hbase. This access mode |
| performs better than going via HBase and can be used with an offline HBase with in-place or exported |
| snapshot HFiles. |
| |
| To read HFiles directly, the user must have sufficient permissions to access snapshots or in-place hbase HFiles. |
| |
| === TableSnapshotScanner |
| |
| TableSnapshotScanner provides a means for running a single client-side scan over snapshot files. |
| When using TableSnapshotScanner, we must specify a temporary directory to copy the snapshot files into. |
| The client user should have write permissions to this directory, and the dir should not be a subdirectory of |
| the hbase.rootdir. The scanner deletes the contents of the directory once the scanner is closed. |
| |
| .Use TableSnapshotScanner |
| ==== |
| [source,java] |
| ---- |
| Path restoreDir = new Path("XX"); // restore dir should not be a subdirectory of hbase.rootdir |
| Scan scan = new Scan(); |
| try (TableSnapshotScanner scanner = new TableSnapshotScanner(conf, restoreDir, snapshotName, scan)) { |
| Result result = scanner.next(); |
| while (result != null) { |
| ... |
| result = scanner.next(); |
| } |
| } |
| ---- |
| ==== |
| |
| === TableSnapshotInputFormat |
| TableSnapshotInputFormat provides a way to scan over snapshot HFiles in a MapReduce job. |
| |
| .Use TableSnapshotInputFormat |
| ==== |
| [source,java] |
| ---- |
| Job job = new Job(conf); |
| Path restoreDir = new Path("XX"); // restore dir should not be a subdirectory of hbase.rootdir |
| Scan scan = new Scan(); |
| TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, scan, MyTableMapper.class, MyMapKeyOutput.class, MyMapOutputValueWritable.class, job, true, restoreDir); |
| ---- |
| ==== |
| |
| === Permission to access snapshot and data files |
| Generally, only the HBase owner or the HDFS admin have the permission to access HFiles. |
| |
| link:https://issues.apache.org/jira/browse/HBASE-18659[HBASE-18659] uses HDFS ACLs to make HBase granted user have permission to access snapshot files. |
| |
| ==== link:https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists[HDFS ACLs] |
| |
| HDFS ACLs supports an "access ACL", which defines the rules to enforce during permission checks, and a "default ACL", |
| which defines the ACL entries that new child files or sub-directories receive automatically during creation. |
| Via HDFS ACLs, HBase syncs granted users with read permission to HFiles. |
| |
| ==== Basic idea |
| |
| The HBase files are organized in the following ways: |
| |
| * {hbase-rootdir}/.tmp/data/{namespace}/{table} |
| * {hbase-rootdir}/data/{namespace}/{table} |
| * {hbase-rootdir}/archive/data/{namespace}/{table} |
| * {hbase-rootdir}/.hbase-snapshot/{snapshotName} |
| |
| So the basic idea is to add or remove HDFS ACLs to files of the global/namespace/table directory |
| when grant or revoke permission to global/namespace/table. |
| |
| See the design doc in link:https://issues.apache.org/jira/browse/HBASE-18659[HBASE-18659] for more details. |
| |
| ==== Configuration to use this feature |
| |
| * Firstly, make sure that HDFS ACLs are enabled and umask is set to 027 |
| ---- |
| dfs.namenode.acls.enabled = true |
| fs.permissions.umask-mode = 027 |
| ---- |
| |
| * Add master coprocessor, please make sure the SnapshotScannerHDFSAclController is configured after AccessController |
| ---- |
| hbase.coprocessor.master.classes = "org.apache.hadoop.hbase.security.access.AccessController |
| ,org.apache.hadoop.hbase.security.access.SnapshotScannerHDFSAclController" |
| ---- |
| |
| * Enable this feature |
| ---- |
| hbase.acl.sync.to.hdfs.enable=true |
| ---- |
| |
| * Modify table scheme to enable this feature for a specified table, this config is |
| false by default for every table, this means the HBase granted ACLs will not be synced to HDFS |
| ---- |
| alter 't1', CONFIGURATION => {'hbase.acl.sync.to.hdfs.enable' => 'true'} |
| ---- |
| |
| ==== Limitation |
| There are some limitations for this feature: |
| |
| ===== |
| If we enable this feature, some master operations such as grant, revoke, snapshot... |
| (See the design doc for more details) will be slower as we need to sync HDFS ACLs to related hfiles. |
| ===== |
| |
| ===== |
| HDFS has a config which limits the max ACL entries num for one directory or file: |
| ---- |
| dfs.namenode.acls.max.entries = 32(default value) |
| ---- |
| The 32 entries include four fixed users for each directory or file: owner, group, other, and mask. |
| For a directory, the four users contain 8 ACL entries(access and default) and for a file, the four |
| users contain 4 ACL entries(access). This means there are 24 ACL entries left for named users or groups. |
| |
| Based on this limitation, we can only sync up to 12 HBase granted users' ACLs. This means, if a table |
| enables this feature, then the total users with table, namespace of this table, global READ permission |
| should not be greater than 12. |
| ===== |
| |
| ===== |
| There are some cases that this coprocessor has not handled or could not handle, so the user HDFS ACLs |
| are not synced normally. It will not make a reference link to another hfile of other tables. |
| ===== |