| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| # Apache Accumulo File System Archive Example (Data Only) |
| |
| This example archives file data into an Accumulo table. Files with duplicate data are only stored once. |
| The example has the following classes: |
| |
| * CharacterHistogram - A MapReduce that computes a histogram of byte frequency for each file and stores the histogram alongside the file data. An example use of the ChunkInputFormat. |
| * ChunkCombiner - An Iterator that dedupes file data and sets their visibilities to a combined visibility based on current references to the file data. |
| * ChunkInputFormat - An Accumulo InputFormat that provides keys containing file info (List<Entry<Key,Value>>) and values with an InputStream over the file (ChunkInputStream). |
| * ChunkInputStream - An input stream over file data stored in Accumulo. |
| * FileDataIngest - Takes a list of files and archives them into Accumulo keyed on hashes of the files. |
| * FileDataQuery - Retrieves file data based on the hash of the file. (Used by the dirlist.Viewer.) |
| * KeyUtil - A utility for creating and parsing null-byte separated strings into/from Text objects. |
| * VisibilityCombiner - A utility for merging visibilities into the form (VIS1)|(VIS2)|... |
| |
| This example is coupled with the [dirlist example][dirlist]. |
| |
| If you haven't already run the [dirlist example][dirlist], ingest a file with FileDataIngest. |
| |
| $ ./bin/runex filedata.FileDataIngest -t examples.dataTable --auths exampleVis --chunk 1000 /path/to/accumulo/README.md |
| |
| Open the accumulo shell and look at the data. The row is the MD5 hash of the file, which you can |
| verify by running a command such as 'md5sum' on the file. Note that in order to scan the |
| examples.dataTable the class, org.apache.accumulo.examples.filedata.ChunkCombiner, must be in |
| your classpath, or the accumulo-examples-shaded.jar should be moved to the accumulo lib directory. |
| |
| > scan -t examples.dataTable |
| |
| Run the CharacterHistogram MapReduce to add some information about the file. |
| |
| $ ./bin/runmr filedata.CharacterHistogram -t examples.dataTable --auths exampleVis --vis exampleVis |
| |
| Scan again to see the histogram stored in the 'info' column family. |
| |
| > scan -t examples.dataTable |
| |
| [dirlist]: dirlist.md |