| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" |
| "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> |
| <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> |
| <head> |
| <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" /> |
| <meta name="generator" content="AsciiDoc 8.6.8" /> |
| <title>Apache Accumulo User Manual Version 1.5</title> |
| <style type="text/css"> |
| /* Shared CSS for AsciiDoc xhtml11 and html5 backends */ |
| |
| /* Default font. */ |
| body { |
| font-family: Georgia,serif; |
| } |
| |
| /* Title font. */ |
| h1, h2, h3, h4, h5, h6, |
| div.title, caption.title, |
| thead, p.table.header, |
| #toctitle, |
| #author, #revnumber, #revdate, #revremark, |
| #footer { |
| font-family: Arial,Helvetica,sans-serif; |
| } |
| |
| body { |
| margin: 1em 5% 1em 5%; |
| } |
| |
| a { |
| color: blue; |
| text-decoration: underline; |
| } |
| a:visited { |
| color: fuchsia; |
| } |
| |
| em { |
| font-style: italic; |
| color: navy; |
| } |
| |
| strong { |
| font-weight: bold; |
| color: #083194; |
| } |
| |
| h1, h2, h3, h4, h5, h6 { |
| color: #527bbd; |
| margin-top: 1.2em; |
| margin-bottom: 0.5em; |
| line-height: 1.3; |
| } |
| |
| h1, h2, h3 { |
| border-bottom: 2px solid silver; |
| } |
| h2 { |
| padding-top: 0.5em; |
| } |
| h3 { |
| float: left; |
| } |
| h3 + * { |
| clear: left; |
| } |
| h5 { |
| font-size: 1.0em; |
| } |
| |
| div.sectionbody { |
| margin-left: 0; |
| } |
| |
| hr { |
| border: 1px solid silver; |
| } |
| |
| p { |
| margin-top: 0.5em; |
| margin-bottom: 0.5em; |
| } |
| |
| ul, ol, li > p { |
| margin-top: 0; |
| } |
| ul > li { color: #aaa; } |
| ul > li > * { color: black; } |
| |
| pre { |
| padding: 0; |
| margin: 0; |
| } |
| |
| #author { |
| color: #527bbd; |
| font-weight: bold; |
| font-size: 1.1em; |
| } |
| #email { |
| } |
| #revnumber, #revdate, #revremark { |
| } |
| |
| #footer { |
| font-size: small; |
| border-top: 2px solid silver; |
| padding-top: 0.5em; |
| margin-top: 4.0em; |
| } |
| #footer-text { |
| float: left; |
| padding-bottom: 0.5em; |
| } |
| #footer-badges { |
| float: right; |
| padding-bottom: 0.5em; |
| } |
| |
| #preamble { |
| margin-top: 1.5em; |
| margin-bottom: 1.5em; |
| } |
| div.imageblock, div.exampleblock, div.verseblock, |
| div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock, |
| div.admonitionblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.admonitionblock { |
| margin-top: 2.0em; |
| margin-bottom: 2.0em; |
| margin-right: 10%; |
| color: #606060; |
| } |
| |
| div.content { /* Block element content. */ |
| padding: 0; |
| } |
| |
| /* Block element titles. */ |
| div.title, caption.title { |
| color: #527bbd; |
| font-weight: bold; |
| text-align: left; |
| margin-top: 1.0em; |
| margin-bottom: 0.5em; |
| } |
| div.title + * { |
| margin-top: 0; |
| } |
| |
| td div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content div.title:first-child { |
| margin-top: 0.0em; |
| } |
| div.content + div.title { |
| margin-top: 0.0em; |
| } |
| |
| div.sidebarblock > div.content { |
| background: #ffffee; |
| border: 1px solid #dddddd; |
| border-left: 4px solid #f0f0f0; |
| padding: 0.5em; |
| } |
| |
| div.listingblock > div.content { |
| border: 1px solid #dddddd; |
| border-left: 5px solid #f0f0f0; |
| background: #f8f8f8; |
| padding: 0.5em; |
| } |
| |
| div.quoteblock, div.verseblock { |
| padding-left: 1.0em; |
| margin-left: 1.0em; |
| margin-right: 10%; |
| border-left: 5px solid #f0f0f0; |
| color: #777777; |
| } |
| |
| div.quoteblock > div.attribution { |
| padding-top: 0.5em; |
| text-align: right; |
| } |
| |
| div.verseblock > pre.content { |
| font-family: inherit; |
| font-size: inherit; |
| } |
| div.verseblock > div.attribution { |
| padding-top: 0.75em; |
| text-align: left; |
| } |
| /* DEPRECATED: Pre version 8.2.7 verse style literal block. */ |
| div.verseblock + div.attribution { |
| text-align: left; |
| } |
| |
| div.admonitionblock .icon { |
| vertical-align: top; |
| font-size: 1.1em; |
| font-weight: bold; |
| text-decoration: underline; |
| color: #527bbd; |
| padding-right: 0.5em; |
| } |
| div.admonitionblock td.content { |
| padding-left: 0.5em; |
| border-left: 3px solid #dddddd; |
| } |
| |
| div.exampleblock > div.content { |
| border-left: 3px solid #dddddd; |
| padding-left: 0.5em; |
| } |
| |
| div.imageblock div.content { padding-left: 0; } |
| span.image img { border-style: none; } |
| a.image:visited { color: white; } |
| |
| dl { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| dt { |
| margin-top: 0.5em; |
| margin-bottom: 0; |
| font-style: normal; |
| color: navy; |
| } |
| dd > *:first-child { |
| margin-top: 0.1em; |
| } |
| |
| ul, ol { |
| list-style-position: outside; |
| } |
| ol.arabic { |
| list-style-type: decimal; |
| } |
| ol.loweralpha { |
| list-style-type: lower-alpha; |
| } |
| ol.upperalpha { |
| list-style-type: upper-alpha; |
| } |
| ol.lowerroman { |
| list-style-type: lower-roman; |
| } |
| ol.upperroman { |
| list-style-type: upper-roman; |
| } |
| |
| div.compact ul, div.compact ol, |
| div.compact p, div.compact p, |
| div.compact div, div.compact div { |
| margin-top: 0.1em; |
| margin-bottom: 0.1em; |
| } |
| |
| tfoot { |
| font-weight: bold; |
| } |
| td > div.verse { |
| white-space: pre; |
| } |
| |
| div.hdlist { |
| margin-top: 0.8em; |
| margin-bottom: 0.8em; |
| } |
| div.hdlist tr { |
| padding-bottom: 15px; |
| } |
| dt.hdlist1.strong, td.hdlist1.strong { |
| font-weight: bold; |
| } |
| td.hdlist1 { |
| vertical-align: top; |
| font-style: normal; |
| padding-right: 0.8em; |
| color: navy; |
| } |
| td.hdlist2 { |
| vertical-align: top; |
| } |
| div.hdlist.compact tr { |
| margin: 0; |
| padding-bottom: 0; |
| } |
| |
| .comment { |
| background: yellow; |
| } |
| |
| .footnote, .footnoteref { |
| font-size: 0.8em; |
| } |
| |
| span.footnote, span.footnoteref { |
| vertical-align: super; |
| } |
| |
| #footnotes { |
| margin: 20px 0 20px 0; |
| padding: 7px 0 0 0; |
| } |
| |
| #footnotes div.footnote { |
| margin: 0 0 5px 0; |
| } |
| |
| #footnotes hr { |
| border: none; |
| border-top: 1px solid silver; |
| height: 1px; |
| text-align: left; |
| margin-left: 0; |
| width: 20%; |
| min-width: 100px; |
| } |
| |
| div.colist td { |
| padding-right: 0.5em; |
| padding-bottom: 0.3em; |
| vertical-align: top; |
| } |
| div.colist td img { |
| margin-top: 0.3em; |
| } |
| |
| @media print { |
| #footer-badges { display: none; } |
| } |
| |
| #toc { |
| margin-bottom: 2.5em; |
| } |
| |
| #toctitle { |
| color: #527bbd; |
| font-size: 1.1em; |
| font-weight: bold; |
| margin-top: 1.0em; |
| margin-bottom: 0.1em; |
| } |
| |
| div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 { |
| margin-top: 0; |
| margin-bottom: 0; |
| } |
| div.toclevel2 { |
| margin-left: 2em; |
| font-size: 0.9em; |
| } |
| div.toclevel3 { |
| margin-left: 4em; |
| font-size: 0.9em; |
| } |
| div.toclevel4 { |
| margin-left: 6em; |
| font-size: 0.9em; |
| } |
| |
| span.aqua { color: aqua; } |
| span.black { color: black; } |
| span.blue { color: blue; } |
| span.fuchsia { color: fuchsia; } |
| span.gray { color: gray; } |
| span.green { color: green; } |
| span.lime { color: lime; } |
| span.maroon { color: maroon; } |
| span.navy { color: navy; } |
| span.olive { color: olive; } |
| span.purple { color: purple; } |
| span.red { color: red; } |
| span.silver { color: silver; } |
| span.teal { color: teal; } |
| span.white { color: white; } |
| span.yellow { color: yellow; } |
| |
| span.aqua-background { background: aqua; } |
| span.black-background { background: black; } |
| span.blue-background { background: blue; } |
| span.fuchsia-background { background: fuchsia; } |
| span.gray-background { background: gray; } |
| span.green-background { background: green; } |
| span.lime-background { background: lime; } |
| span.maroon-background { background: maroon; } |
| span.navy-background { background: navy; } |
| span.olive-background { background: olive; } |
| span.purple-background { background: purple; } |
| span.red-background { background: red; } |
| span.silver-background { background: silver; } |
| span.teal-background { background: teal; } |
| span.white-background { background: white; } |
| span.yellow-background { background: yellow; } |
| |
| span.big { font-size: 2em; } |
| span.small { font-size: 0.6em; } |
| |
| span.underline { text-decoration: underline; } |
| span.overline { text-decoration: overline; } |
| span.line-through { text-decoration: line-through; } |
| |
| |
| /* |
| * xhtml11 specific |
| * |
| * */ |
| |
| tt { |
| font-family: monospace; |
| font-size: inherit; |
| color: navy; |
| } |
| |
| div.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| div.tableblock > table { |
| border: 3px solid #527bbd; |
| } |
| thead, p.table.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.table { |
| margin-top: 0; |
| } |
| /* Because the table frame attribute is overriden by CSS in most browsers. */ |
| div.tableblock > table[frame="void"] { |
| border-style: none; |
| } |
| div.tableblock > table[frame="hsides"] { |
| border-left-style: none; |
| border-right-style: none; |
| } |
| div.tableblock > table[frame="vsides"] { |
| border-top-style: none; |
| border-bottom-style: none; |
| } |
| |
| |
| /* |
| * html5 specific |
| * |
| * */ |
| |
| .monospaced { |
| font-family: monospace; |
| font-size: inherit; |
| color: navy; |
| } |
| |
| table.tableblock { |
| margin-top: 1.0em; |
| margin-bottom: 1.5em; |
| } |
| thead, p.tableblock.header { |
| font-weight: bold; |
| color: #527bbd; |
| } |
| p.tableblock { |
| margin-top: 0; |
| } |
| table.tableblock { |
| border-width: 3px; |
| border-spacing: 0px; |
| border-style: solid; |
| border-color: #527bbd; |
| border-collapse: collapse; |
| } |
| th.tableblock, td.tableblock { |
| border-width: 1px; |
| padding: 4px; |
| border-style: solid; |
| border-color: #527bbd; |
| } |
| |
| table.tableblock.frame-topbot { |
| border-left-style: hidden; |
| border-right-style: hidden; |
| } |
| table.tableblock.frame-sides { |
| border-top-style: hidden; |
| border-bottom-style: hidden; |
| } |
| table.tableblock.frame-none { |
| border-style: hidden; |
| } |
| |
| th.tableblock.halign-left, td.tableblock.halign-left { |
| text-align: left; |
| } |
| th.tableblock.halign-center, td.tableblock.halign-center { |
| text-align: center; |
| } |
| th.tableblock.halign-right, td.tableblock.halign-right { |
| text-align: right; |
| } |
| |
| th.tableblock.valign-top, td.tableblock.valign-top { |
| vertical-align: top; |
| } |
| th.tableblock.valign-middle, td.tableblock.valign-middle { |
| vertical-align: middle; |
| } |
| th.tableblock.valign-bottom, td.tableblock.valign-bottom { |
| vertical-align: bottom; |
| } |
| |
| |
| /* |
| * manpage specific |
| * |
| * */ |
| |
| body.manpage h1 { |
| padding-top: 0.5em; |
| padding-bottom: 0.5em; |
| border-top: 2px solid silver; |
| border-bottom: 2px solid silver; |
| } |
| body.manpage h2 { |
| border-style: none; |
| } |
| body.manpage div.sectionbody { |
| margin-left: 3em; |
| } |
| |
| @media print { |
| body.manpage div#toc { display: none; } |
| } |
| |
| |
| /* |
| * Theme specific overrides of the preceding (asciidoc.css) CSS. |
| * |
| */ |
| body { |
| font-family: Garamond, Georgia, serif; |
| font-size: 17px; |
| color: #3E4349; |
| line-height: 1.3em; |
| } |
| h1, h2, h3, h4, h5, h6, |
| div.title, caption.title, |
| thead, p.table.header, |
| #toctitle, |
| #author, #revnumber, #revdate, #revremark, |
| #footer { |
| font-family: Garmond, Georgia, serif; |
| font-weight: normal; |
| border-bottom-width: 0; |
| color: #3E4349; |
| } |
| div.title, caption.title { color: #596673; font-weight: bold; } |
| h1 { font-size: 240%; } |
| h2 { font-size: 180%; } |
| h3 { font-size: 150%; } |
| h4 { font-size: 130%; } |
| h5 { font-size: 115%; } |
| h6 { font-size: 100%; } |
| #header h1 { margin-top: 0; } |
| #toc { |
| color: #444444; |
| line-height: 1.5; |
| padding-top: 1.5em; |
| } |
| #toctitle { |
| font-size: 20px; |
| } |
| #toc a { |
| border-bottom: 1px dotted #999999; |
| color: #444444 !important; |
| text-decoration: none !important; |
| } |
| #toc a:hover { |
| border-bottom: 1px solid #6D4100; |
| color: #6D4100 !important; |
| text-decoration: none !important; |
| } |
| div.toclevel1 { margin-top: 0.2em; font-size: 16px; } |
| div.toclevel2 { margin-top: 0.15em; font-size: 14px; } |
| em, dt, td.hdlist1 { color: black; } |
| strong { color: #3E4349; } |
| a { color: #004B6B; text-decoration: none; border-bottom: 1px dotted #004B6B; } |
| a:visited { color: #615FA0; border-bottom: 1px dotted #615FA0; } |
| a:hover { color: #6D4100; border-bottom: 1px solid #6D4100; } |
| div.tableblock > table, table.tableblock { border: 3px solid #E8E8E8; } |
| th.tableblock, td.tableblock { border: 1px solid #E8E8E8; } |
| ul > li > * { color: #3E4349; } |
| pre, tt, .monospaced { font-family: Consolas,Menlo,'Deja Vu Sans Mono','Bitstream Vera Sans Mono',monospace; } |
| tt, .monospaced { font-size: 0.9em; color: black; |
| } |
| div.exampleblock > div.content, div.sidebarblock > div.content, div.listingblock > div.content { border-width: 0 0 0 3px; border-color: #E8E8E8; } |
| div.verseblock { border-left-width: 0; margin-left: 3em; } |
| div.quoteblock { border-left-width: 3px; margin-left: 0; margin-right: 0;} |
| div.admonitionblock td.content { border-left: 3px solid #E8E8E8; } |
| |
| |
| @media screen { |
| body { |
| max-width: 50em; /* approximately 80 characters wide */ |
| margin-left: 16em; |
| } |
| |
| #toc { |
| position: fixed; |
| top: 0; |
| left: 0; |
| bottom: 0; |
| width: 13em; |
| padding: 0.5em; |
| padding-bottom: 1.5em; |
| margin: 0; |
| overflow: auto; |
| border-right: 3px solid #f8f8f8; |
| background-color: white; |
| } |
| |
| #toc .toclevel1 { |
| margin-top: 0.5em; |
| } |
| |
| #toc .toclevel2 { |
| margin-top: 0.25em; |
| display: list-item; |
| color: #aaaaaa; |
| } |
| |
| #toctitle { |
| margin-top: 0.5em; |
| } |
| } |
| </style> |
| <script type="text/javascript"> |
| /*<![CDATA[*/ |
| var asciidoc = { // Namespace. |
| |
| ///////////////////////////////////////////////////////////////////// |
| // Table Of Contents generator |
| ///////////////////////////////////////////////////////////////////// |
| |
| /* Author: Mihai Bazon, September 2002 |
| * http://students.infoiasi.ro/~mishoo |
| * |
| * Table Of Content generator |
| * Version: 0.4 |
| * |
| * Feel free to use this script under the terms of the GNU General Public |
| * License, as long as you do not remove or alter this notice. |
| */ |
| |
| /* modified by Troy D. Hanson, September 2006. License: GPL */ |
| /* modified by Stuart Rackham, 2006, 2009. License: GPL */ |
| |
| // toclevels = 1..4. |
| toc: function (toclevels) { |
| |
| function getText(el) { |
| var text = ""; |
| for (var i = el.firstChild; i != null; i = i.nextSibling) { |
| if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants. |
| text += i.data; |
| else if (i.firstChild != null) |
| text += getText(i); |
| } |
| return text; |
| } |
| |
| function TocEntry(el, text, toclevel) { |
| this.element = el; |
| this.text = text; |
| this.toclevel = toclevel; |
| } |
| |
| function tocEntries(el, toclevels) { |
| var result = new Array; |
| var re = new RegExp('[hH]([1-'+(toclevels+1)+'])'); |
| // Function that scans the DOM tree for header elements (the DOM2 |
| // nodeIterator API would be a better technique but not supported by all |
| // browsers). |
| var iterate = function (el) { |
| for (var i = el.firstChild; i != null; i = i.nextSibling) { |
| if (i.nodeType == 1 /* Node.ELEMENT_NODE */) { |
| var mo = re.exec(i.tagName); |
| if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") { |
| result[result.length] = new TocEntry(i, getText(i), mo[1]-1); |
| } |
| iterate(i); |
| } |
| } |
| } |
| iterate(el); |
| return result; |
| } |
| |
| var toc = document.getElementById("toc"); |
| if (!toc) { |
| return; |
| } |
| |
| // Delete existing TOC entries in case we're reloading the TOC. |
| var tocEntriesToRemove = []; |
| var i; |
| for (i = 0; i < toc.childNodes.length; i++) { |
| var entry = toc.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' |
| && entry.getAttribute("class") |
| && entry.getAttribute("class").match(/^toclevel/)) |
| tocEntriesToRemove.push(entry); |
| } |
| for (i = 0; i < tocEntriesToRemove.length; i++) { |
| toc.removeChild(tocEntriesToRemove[i]); |
| } |
| |
| // Rebuild TOC entries. |
| var entries = tocEntries(document.getElementById("content"), toclevels); |
| for (var i = 0; i < entries.length; ++i) { |
| var entry = entries[i]; |
| if (entry.element.id == "") |
| entry.element.id = "_toc_" + i; |
| var a = document.createElement("a"); |
| a.href = "#" + entry.element.id; |
| a.appendChild(document.createTextNode(entry.text)); |
| var div = document.createElement("div"); |
| div.appendChild(a); |
| div.className = "toclevel" + entry.toclevel; |
| toc.appendChild(div); |
| } |
| if (entries.length == 0) |
| toc.parentNode.removeChild(toc); |
| }, |
| |
| |
| ///////////////////////////////////////////////////////////////////// |
| // Footnotes generator |
| ///////////////////////////////////////////////////////////////////// |
| |
| /* Based on footnote generation code from: |
| * http://www.brandspankingnew.net/archive/2005/07/format_footnote.html |
| */ |
| |
| footnotes: function () { |
| // Delete existing footnote entries in case we're reloading the footnodes. |
| var i; |
| var noteholder = document.getElementById("footnotes"); |
| if (!noteholder) { |
| return; |
| } |
| var entriesToRemove = []; |
| for (i = 0; i < noteholder.childNodes.length; i++) { |
| var entry = noteholder.childNodes[i]; |
| if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote") |
| entriesToRemove.push(entry); |
| } |
| for (i = 0; i < entriesToRemove.length; i++) { |
| noteholder.removeChild(entriesToRemove[i]); |
| } |
| |
| // Rebuild footnote entries. |
| var cont = document.getElementById("content"); |
| var spans = cont.getElementsByTagName("span"); |
| var refs = {}; |
| var n = 0; |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnote") { |
| n++; |
| var note = spans[i].getAttribute("data-note"); |
| if (!note) { |
| // Use [\s\S] in place of . so multi-line matches work. |
| // Because JavaScript has no s (dotall) regex flag. |
| note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1]; |
| spans[i].innerHTML = |
| "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| spans[i].setAttribute("data-note", note); |
| } |
| noteholder.innerHTML += |
| "<div class='footnote' id='_footnote_" + n + "'>" + |
| "<a href='#_footnoteref_" + n + "' title='Return to text'>" + |
| n + "</a>. " + note + "</div>"; |
| var id =spans[i].getAttribute("id"); |
| if (id != null) refs["#"+id] = n; |
| } |
| } |
| if (n == 0) |
| noteholder.parentNode.removeChild(noteholder); |
| else { |
| // Process footnoterefs. |
| for (i=0; i<spans.length; i++) { |
| if (spans[i].className == "footnoteref") { |
| var href = spans[i].getElementsByTagName("a")[0].getAttribute("href"); |
| href = href.match(/#.*/)[0]; // Because IE return full URL. |
| n = refs[href]; |
| spans[i].innerHTML = |
| "[<a href='#_footnote_" + n + |
| "' title='View footnote' class='footnote'>" + n + "</a>]"; |
| } |
| } |
| } |
| }, |
| |
| install: function(toclevels) { |
| var timerId; |
| |
| function reinstall() { |
| asciidoc.footnotes(); |
| if (toclevels) { |
| asciidoc.toc(toclevels); |
| } |
| } |
| |
| function reinstallAndRemoveTimer() { |
| clearInterval(timerId); |
| reinstall(); |
| } |
| |
| timerId = setInterval(reinstall, 500); |
| if (document.addEventListener) |
| document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false); |
| else |
| window.onload = reinstallAndRemoveTimer; |
| } |
| |
| } |
| asciidoc.install(4); |
| /*]]>*/ |
| </script> |
| </head> |
| <body class="book" style="max-width:50em"> |
| <div id="header"> |
| <h1>Apache Accumulo User Manual Version 1.5</h1> |
| <span id="author">Apache Accumulo Project</span><br /> |
| <span id="email"><code><<a href="mailto:dev@accumulo.apache.org">dev@accumulo.apache.org</a>></code></span><br /> |
| <div id="toc"> |
| <div id="toctitle">Apache Accumulo 1.5</div> |
| <noscript><p><b>JavaScript must be enabled in your browser to display the table of contents.</b></p></noscript> |
| </div> |
| </div> |
| <div id="content"> |
| <div id="preamble"> |
| <div class="sectionbody"> |
| <div class="imageblock"> |
| <div class="content"> |
| <img src="images/accumulo-logo.png" alt="images/accumulo-logo.png" /> |
| </div> |
| </div> |
| <div class="paragraph"><p>Copyright © 2011-2015 The Apache Software Foundation, Licensed under the Apache |
| License, Version 2.0. Apache Accumulo, Accumulo, Apache, and the Apache |
| Accumulo project logo are trademarks of the Apache Software Foundation.</p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_introduction">1. Introduction</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Apache Accumulo is a highly scalable structured store based on Google’s BigTable. |
| Accumulo is written in Java and operates over the Hadoop Distributed File System |
| (HDFS), which is part of the popular Apache Hadoop project. Accumulo supports |
| efficient storage and retrieval of structured data, including queries for ranges, and |
| provides support for using Accumulo tables as input and output for MapReduce |
| jobs.</p></div> |
| <div class="paragraph"><p>Accumulo features automatic load-balancing and partitioning, data compression |
| and fine-grained security labels.</p></div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_accumulo_design">2. Accumulo Design</h2> |
| <div class="sectionbody"> |
| <div class="sect2"> |
| <h3 id="_data_model">2.1. Data Model</h3> |
| <div class="paragraph"><p>Accumulo provides a richer data model than simple key-value stores, but is not a |
| fully relational database. Data is represented as key-value pairs, where the key and |
| value are comprised of the following elements:</p></div> |
| <div class="tableblock"> |
| <table rules="all" |
| width="75%" |
| frame="border" |
| cellspacing="0" cellpadding="4"> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <tbody> |
| <tr> |
| <td colspan="5" align="center" valign="top"><p class="table">Key</p></td> |
| <td rowspan="3" align="center" valign="middle"><p class="table">Value</p></td> |
| </tr> |
| <tr> |
| <td rowspan="2" align="center" valign="middle"><p class="table">Row ID</p></td> |
| <td colspan="3" align="center" valign="top"><p class="table">Column</p></td> |
| <td rowspan="2" align="center" valign="middle"><p class="table">Timestamp</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">Family</p></td> |
| <td align="center" valign="top"><p class="table">Qualifier</p></td> |
| <td align="center" valign="top"><p class="table">Visibility</p></td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <div class="paragraph"><p>All elements of the Key and the Value are represented as byte arrays except for |
| Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically |
| in ascending order. Timestamps are sorted in descending order so that later |
| versions of the same Key appear first in a sequential scan. Tables consist of a set of |
| sorted key-value pairs.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_architecture">2.2. Architecture</h3> |
| <div class="paragraph"><p>Accumulo is a distributed data storage and retrieval system and as such consists of |
| several architectural components, some of which run on many individual servers. |
| Much of the work Accumulo does involves maintaining certain properties of the |
| data, such as organization, availability, and integrity, across many commodity-class |
| machines.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_components">2.3. Components</h3> |
| <div class="paragraph"><p>An instance of Accumulo includes many TabletServers, one Garbage Collector process, |
| one Master server and many Clients.</p></div> |
| <div class="sect3"> |
| <h4 id="_tablet_server">2.3.1. Tablet Server</h4> |
| <div class="paragraph"><p>The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a |
| write-ahead log, sorting new key-value pairs in memory, periodically |
| flushing sorted key-value pairs to new files in HDFS, and responding |
| to reads from clients, forming a merge-sorted view of all keys and |
| values from all the files it has created and the sorted in-memory |
| store.</p></div> |
| <div class="paragraph"><p>TabletServers also perform recovery of a tablet |
| that was previously on a server that failed, reapplying any writes |
| found in the write-ahead log to the tablet.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_garbage_collector">2.3.2. Garbage Collector</h4> |
| <div class="paragraph"><p>Accumulo processes will share files stored in HDFS. Periodically, the Garbage |
| Collector will identify files that are no longer needed by any process, and |
| delete them.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_master">2.3.3. Master</h4> |
| <div class="paragraph"><p>The Accumulo Master is responsible for detecting and responding to TabletServer |
| failure. It tries to balance the load across TabletServer by assigning tablets carefully |
| and instructing TabletServers to unload tablets when necessary. The Master ensures all |
| tablets are assigned to one TabletServer each, and handles table creation, alteration, |
| and deletion requests from clients. The Master also coordinates startup, graceful |
| shutdown and recovery of changes in write-ahead logs when Tablet servers fail.</p></div> |
| <div class="paragraph"><p>Multiple masters may be run. The masters will choose among themselves a single master, |
| and the others will become backups if the master should fail.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_client">2.3.4. Client</h4> |
| <div class="paragraph"><p>Accumulo includes a client library that is linked to every application. The client |
| library contains logic for finding servers managing a particular tablet, and |
| communicating with TabletServers to write and retrieve key-value pairs.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_data_management">2.4. Data Management</h3> |
| <div class="paragraph"><p>Accumulo stores data in tables, which are partitioned into tablets. Tablets are |
| partitioned on row boundaries so that all of the columns and values for a particular |
| row are found together within the same tablet. The Master assigns Tablets to one |
| TabletServer at a time. This enables row-level transactions to take place without |
| using distributed locking or some other complicated synchronization mechanism. As |
| clients insert and query data, and as machines are added and removed from the |
| cluster, the Master migrates tablets to ensure they remain available and that the |
| ingest and query load is balanced across the cluster.</p></div> |
| <div class="imageblock"> |
| <div class="content"> |
| <img src="images/data_distribution.png" alt="images/data_distribution.png" width="500" /> |
| </div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_tablet_service">2.5. Tablet Service</h3> |
| <div class="paragraph"><p>When a write arrives at a TabletServer it is written to a Write-Ahead Log and |
| then inserted into a sorted data structure in memory called a MemTable. When the |
| MemTable reaches a certain size the TabletServer writes out the sorted key-value |
| pairs to a file in HDFS called Indexed Sequential Access Method (ISAM) |
| file. This process is called a minor compaction. A new MemTable is then created |
| and the fact of the compaction is recorded in the Write-Ahead Log.</p></div> |
| <div class="paragraph"><p>When a request to read data arrives at a TabletServer, the TabletServer does a |
| binary search across the MemTable as well as the in-memory indexes associated |
| with each ISAM file to find the relevant values. If clients are performing a |
| scan, several key-value pairs are returned to the client in order from the |
| MemTable and the set of ISAM files by performing a merge-sort as they are read.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_compactions">2.6. Compactions</h3> |
| <div class="paragraph"><p>In order to manage the number of files per tablet, periodically the TabletServer |
| performs Major Compactions of files within a tablet, in which some set of ISAM |
| files are combined into one file. The previous files will eventually be removed |
| by the Garbage Collector. This also provides an opportunity to permanently |
| remove deleted key-value pairs by omitting key-value pairs suppressed by a |
| delete entry when the new file is created.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_splitting">2.7. Splitting</h3> |
| <div class="paragraph"><p>When a table is created it has one tablet. As the table grows its initial |
| tablet eventually splits into two tablets. Its likely that one of these |
| tablets will migrate to another tablet server. As the table continues to grow, |
| its tablets will continue to split and be migrated. The decision to |
| automatically split a tablet is based on the size of a tablets files. The |
| size threshold at which a tablet splits is configurable per table. In addition |
| to automatic splitting, a user can manually add split points to a table to |
| create new tablets. Manually splitting a new table can parallelize reads and |
| writes giving better initial performance without waiting for automatic |
| splitting.</p></div> |
| <div class="paragraph"><p>As data is deleted from a table, tablets may shrink. Over time this can lead |
| to small or empty tablets. To deal with this, merging of tablets was |
| introduced in Accumulo 1.4. This is discussed in more detail later.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_fault_tolerance">2.8. Fault-Tolerance</h3> |
| <div class="paragraph"><p>If a TabletServer fails, the Master detects it and automatically reassigns the tablets |
| assigned from the failed server to other servers. Any key-value pairs that were in |
| memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead |
| Log (WAL) to prevent any loss of data.</p></div> |
| <div class="paragraph"><p>Tablet servers write their WALs directly to HDFS so the logs are available to all tablet |
| servers for recovery. To make the recovery process efficient, the updates within a log are |
| grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs |
| that are destined for the tablets they have now been assigned.</p></div> |
| <div class="paragraph"><p>TabletServer failures are noted on the Master’s monitor page, accessible via |
| <code>http://master-address:50095/monitor</code>.</p></div> |
| <div class="imageblock"> |
| <div class="content"> |
| <img src="images/failure_handling.png" alt="images/failure_handling.png" width="500" /> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_accumulo_shell">3. Accumulo Shell</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Accumulo provides a simple shell that can be used to examine the contents and |
| configuration settings of tables, insert/update/delete values, and change |
| configuration settings.</p></div> |
| <div class="paragraph"><p>The shell can be started by the following command:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/bin/accumulo shell -u [username]</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The shell will prompt for the corresponding password to the username specified |
| and then display the following prompt:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>Shell - Apache Accumulo Interactive Shell |
| - |
| - version 1.5 |
| - instance name: myinstance |
| - instance id: 00000000-0000-0000-0000-000000000000 |
| - |
| - type 'help' for a list of available commands |
| -</code></pre> |
| </div></div> |
| <div class="sect2"> |
| <h3 id="_basic_administration">3.1. Basic Administration</h3> |
| <div class="paragraph"><p>The Accumulo shell can be used to create and delete tables, as well as to configure |
| table and instance specific options.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> tables |
| !METADATA</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> createtable mytable</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable></code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> tables |
| !METADATA |
| mytable</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> createtable testtable</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance testtable></code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance testtable> deletetable testtable |
| deletetable { testtable } (yes|no)? yes |
| Table: [testtable] has been deleted.</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance></code></pre> |
| </div></div> |
| <div class="paragraph"><p>The Shell can also be used to insert updates and scan tables. This is useful for |
| inspecting tables.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> scan</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> insert row1 colf colq value1 |
| insert successful</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> scan |
| row1 colf:colq [] value1</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The value in brackets “[]” would be the visibility labels. Since none were used, this is empty for this row. |
| You can use the “-st” option to scan to see the timestamp for the cell, too.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_table_maintenance">3.2. Table Maintenance</h3> |
| <div class="paragraph"><p>The <strong>compact</strong> command instructs Accumulo to schedule a compaction of the table during which |
| files are consolidated and deleted entries are removed.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> compact -t mytable |
| 07 16:13:53,201 [shell.Shell] INFO : Compaction of table mytable started for given range</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The <strong>flush</strong> command instructs Accumulo to write all entries currently in memory for a given table |
| to disk.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> flush -t mytable |
| 07 16:14:19,351 [shell.Shell] INFO : Flush of table mytable |
| initiated...</code></pre> |
| </div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_user_administration">3.3. User Administration</h3> |
| <div class="paragraph"><p>The Shell can be used to add, remove, and grant privileges to users.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> createuser bob |
| Enter new password for 'bob': ********* |
| Please confirm new password for 'bob': *********</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> authenticate bob |
| Enter current password for 'bob': ********* |
| Valid</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> grant System.CREATE_TABLE -s -u bob</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance mytable> user bob |
| Enter current password for 'bob': *********</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>bob@myinstance mytable> userpermissions |
| System permissions: System.CREATE_TABLE |
| Table permissions (!METADATA): Table.READ |
| Table permissions (mytable): NONE</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>bob@myinstance mytable> createtable bobstable |
| bob@myinstance bobstable></code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>bob@myinstance bobstable> user root |
| Enter current password for 'root': *********</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance bobstable> revoke System.CREATE_TABLE -s -u bob</code></pre> |
| </div></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_writing_accumulo_clients">4. Writing Accumulo Clients</h2> |
| <div class="sectionbody"> |
| <div class="sect2"> |
| <h3 id="_running_client_code">4.1. Running Client Code</h3> |
| <div class="paragraph"><p>There are multiple ways to run Java code that uses Accumulo. Below is a list |
| of the different ways to execute client code.</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| using java executable |
| </p> |
| </li> |
| <li> |
| <p> |
| using the accumulo script |
| </p> |
| </li> |
| <li> |
| <p> |
| using the tool script |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>In order to run client code written to run against Accumulo, you will need to |
| include the jars that Accumulo depends on in your classpath. Accumulo client |
| code depends on Hadoop and Zookeeper. For Hadoop add the hadoop client jar, all |
| of the jars in the Hadoop lib directory, and the conf directory to the |
| classpath. For Zookeeper 3.3 you only need to add the Zookeeper jar, and not |
| what is in the Zookeeper lib directory. You can run the following command on a |
| configured Accumulo system to see what its using for its classpath.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/bin/accumulo classpath</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Another option for running your code is to put a jar file in |
| <code>$ACCUMULO_HOME/lib/ext</code>. After doing this you can use the accumulo |
| script to execute your code. For example if you create a jar containing the |
| class com.foo.Client and placed that in <code>lib/ext</code>, then you could use the command |
| <code>$ACCUMULO_HOME/bin/accumulo com.foo.Client</code> to execute your code.</p></div> |
| <div class="paragraph"><p>If you are writing map reduce job that access Accumulo, then you can use the |
| bin/tool.sh script to run those jobs. See the map reduce example.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_connecting">4.2. Connecting</h3> |
| <div class="paragraph"><p>All clients must first identify the Accumulo instance to which they will be |
| communicating. Code to do this is as follows:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">String</span> instanceName = <span style="color: #4c73a6">"myinstance"</span>; |
| <span style="color: #000000">String</span> zooServers = <span style="color: #4c73a6">"zooserver-one,zooserver-two"</span> |
| <span style="color: #000000">Instance</span> inst = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ZooKeeperInstance</span></span>(instanceName, zooServers); |
| <span style="color: #000000">Connector</span> conn = inst.<span style="font-weight: bold"><span style="color: #000000">getConnector</span></span>(<span style="color: #4c73a6">"user"</span>, <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">PasswordToken</span></span>(<span style="color: #4c73a6">"passwd"</span>));</tt></pre></div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_writing_data">4.3. Writing Data</h3> |
| <div class="paragraph"><p>Data are written to Accumulo by creating Mutation objects that represent all the |
| changes to the columns of a single row. The changes are made atomically in the |
| TabletServer. Clients then add Mutations to a BatchWriter which submits them to |
| the appropriate TabletServers.</p></div> |
| <div class="paragraph"><p>Mutations can be created thus:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Text</span> rowID = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"row1"</span>); |
| <span style="color: #000000">Text</span> colFam = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"myColFam"</span>); |
| <span style="color: #000000">Text</span> colQual = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"myColQual"</span>); |
| <span style="color: #000000">ColumnVisibility</span> colVis = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ColumnVisibility</span></span>(<span style="color: #4c73a6">"public"</span>); |
| <span style="color: #000000">long</span> timestamp = System.<span style="font-weight: bold"><span style="color: #000000">currentTimeMillis</span></span>(); |
| |
| <span style="color: #000000">Value</span> value = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Value</span></span>(<span style="color: #4c73a6">"myValue"</span>.<span style="font-weight: bold"><span style="color: #000000">getBytes</span></span>()); |
| |
| <span style="color: #000000">Mutation</span> mutation = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Mutation</span></span>(rowID); |
| mutation.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(colFam, colQual, colVis, timestamp, value);</tt></pre></div></div> |
| <div class="sect3"> |
| <h4 id="_batchwriter">4.3.1. BatchWriter</h4> |
| <div class="paragraph"><p>The BatchWriter is highly optimized to send Mutations to multiple TabletServers |
| and automatically batches Mutations destined for the same TabletServer to |
| amortize network overhead. Care must be taken to avoid changing the contents of |
| any Object passed to the BatchWriter since it keeps objects in memory while |
| batching.</p></div> |
| <div class="paragraph"><p>Mutations are added to a BatchWriter thus:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="font-style: italic"><span style="color: #b30000">//BatchWriterConfig has reasonable defaults</span></span> |
| <span style="color: #000000">BatchWriterConfig</span> config = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">BatchWriterConfig</span></span>(); |
| config.<span style="font-weight: bold"><span style="color: #000000">setMaxMemory</span></span>(<span style="color: #000000">10000000L</span>); <span style="font-style: italic"><span style="color: #b30000">// bytes available to batchwriter for buffering mutations</span></span> |
| |
| <span style="color: #000000">BatchWriter</span> writer = conn.<span style="font-weight: bold"><span style="color: #000000">createBatchWriter</span></span>(<span style="color: #4c73a6">"table"</span>, config) |
| |
| writer.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(mutation); |
| |
| writer.<span style="font-weight: bold"><span style="color: #000000">close</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>An example of using the batch writer can be found at |
| <code>accumulo/docs/examples/README.batch</code></p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_reading_data">4.4. Reading Data</h3> |
| <div class="paragraph"><p>Accumulo is optimized to quickly retrieve the value associated with a given key, and |
| to efficiently return ranges of consecutive keys and their associated values.</p></div> |
| <div class="sect3"> |
| <h4 id="_scanner">4.4.1. Scanner</h4> |
| <div class="paragraph"><p>To retrieve data, Clients use a Scanner, which acts like an Iterator over |
| keys and values. Scanners can be configured to start and stop at particular keys, and |
| to return a subset of the columns available.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="font-style: italic"><span style="color: #b30000">// specify which visibilities we are allowed to see</span></span> |
| <span style="color: #000000">Authorizations</span> auths = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Authorizations</span></span>(<span style="color: #4c73a6">"public"</span>); |
| |
| <span style="color: #000000">Scanner</span> scan = |
| conn.<span style="font-weight: bold"><span style="color: #000000">createScanner</span></span>(<span style="color: #4c73a6">"table"</span>, auths); |
| |
| scan.<span style="font-weight: bold"><span style="color: #000000">setRange</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Range</span></span>(<span style="color: #4c73a6">"harry"</span>,<span style="color: #4c73a6">"john"</span>)); |
| scan.<span style="font-weight: bold"><span style="color: #000000">fetchFamily</span></span>(<span style="color: #4c73a6">"attributes"</span>); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : scan) { |
| <span style="color: #000000">String</span> row = entry.<span style="font-weight: bold"><span style="color: #000000">getKey</span></span>().<span style="font-weight: bold"><span style="color: #000000">getRow</span></span>(); |
| <span style="color: #000000">Value</span> value = entry.<span style="font-weight: bold"><span style="color: #000000">getValue</span></span>(); |
| }</tt></pre></div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_isolated_scanner">4.4.2. Isolated Scanner</h4> |
| <div class="paragraph"><p>Accumulo supports the ability to present an isolated view of rows when |
| scanning. There are three possible ways that a row could change in Accumulo:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| a mutation applied to a table |
| </p> |
| </li> |
| <li> |
| <p> |
| iterators executed as part of a minor or major compaction |
| </p> |
| </li> |
| <li> |
| <p> |
| bulk import of new files |
| </p> |
| </li> |
| </ul></div> |
| <div class="paragraph"><p>Isolation guarantees that either all or none of the changes made by these |
| operations on a row are seen. Use the IsolatedScanner to obtain an isolated |
| view of an Accumulo table. When using the regular scanner it is possible to see |
| a non isolated view of a row. For example if a mutation modifies three |
| columns, it is possible that you will only see two of those modifications. |
| With the isolated scanner either all three of the changes are seen or none.</p></div> |
| <div class="paragraph"><p>The IsolatedScanner buffers rows on the client side so a large row will not |
| crash a tablet server. By default rows are buffered in memory, but the user |
| can easily supply their own buffer if they wish to buffer to disk when rows are |
| large.</p></div> |
| <div class="paragraph"><p>For an example, look at the following |
| <code>examples/simple/src/main/java/org/apache/accumulo/examples/simple/isolation/InterferenceTest.java</code></p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_batchscanner">4.4.3. BatchScanner</h4> |
| <div class="paragraph"><p>For some types of access, it is more efficient to retrieve several ranges |
| simultaneously. This arises when accessing a set of rows that are not consecutive |
| whose IDs have been retrieved from a secondary index, for example.</p></div> |
| <div class="paragraph"><p>The BatchScanner is configured similarly to the Scanner; it can be configured to |
| retrieve a subset of the columns available, but rather than passing a single Range, |
| BatchScanners accept a set of Ranges. It is important to note that the keys returned |
| by a BatchScanner are not in sorted order since the keys streamed are from multiple |
| TabletServers in parallel.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">ArrayList<Range></span> ranges = <span style="color: #0000b3">new</span> ArrayList<Range>(); |
| <span style="font-style: italic"><span style="color: #b30000">// populate list of ranges ...</span></span> |
| |
| <span style="color: #000000">BatchScanner</span> bscan = |
| conn.<span style="font-weight: bold"><span style="color: #000000">createBatchScanner</span></span>(<span style="color: #4c73a6">"table"</span>, auths, <span style="color: #000000">10</span>); |
| |
| bscan.<span style="font-weight: bold"><span style="color: #000000">setRanges</span></span>(ranges); |
| bscan.<span style="font-weight: bold"><span style="color: #000000">fetchFamily</span></span>(<span style="color: #4c73a6">"attributes"</span>); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : scan) |
| System.out.<span style="font-weight: bold"><span style="color: #000000">println</span></span>(entry.<span style="font-weight: bold"><span style="color: #000000">getValue</span></span>());</tt></pre></div></div> |
| <div class="paragraph"><p>An example of the BatchScanner can be found at |
| <code>accumulo/docs/examples/README.batch</code></p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_proxy">4.5. Proxy</h3> |
| <div class="paragraph"><p>The proxy API allows the interaction with Accumulo with languages other than Java. |
| A proxy server is provided in the codebase and a client can further be generated.</p></div> |
| <div class="sect3"> |
| <h4 id="_prequisites">4.5.1. Prequisites</h4> |
| <div class="paragraph"><p>The proxy server can live on any node in which the basic client API would work. That |
| means it must be able to communicate with the Master, ZooKeepers, NameNode, and the |
| DataNodes. A proxy client only needs the ability to communicate with the proxy server.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_configuration">4.5.2. Configuration</h4> |
| <div class="paragraph"><p>The configuration options for the proxy server live inside of a properties file. At |
| the very least, you need to supply the following properties:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory |
| tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken |
| port=42424 |
| instance=test |
| zookeepers=localhost:2181</code></pre> |
| </div></div> |
| <div class="paragraph"><p>You can find a sample configuration file in your distribution:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/proxy/proxy.properties</code></pre> |
| </div></div> |
| <div class="paragraph"><p>This sample configuration file further demonstrates an ability to back the proxy server |
| by MockAccumulo or the MiniAccumuloCluster.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_running_the_proxy_server">4.5.3. Running the Proxy Server</h4> |
| <div class="paragraph"><p>After the properties file holding the configuration is created, the proxy server |
| can be started using the following command in the Accumulo distribution (assuming |
| your properties file is named <code>config.properties</code>):</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/bin/accumulo proxy -p config.properties</code></pre> |
| </div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_creating_a_proxy_client">4.5.4. Creating a Proxy Client</h4> |
| <div class="paragraph"><p>Aside from installing the Thrift compiler, you will also need the language-specific library |
| for Thrift installed to generate client code in that language. Typically, your operating |
| system’s package manager will be able to automatically install these for you in an expected |
| location such as <code>/usr/lib/python/site-packages/thrift</code>.</p></div> |
| <div class="paragraph"><p>You can find the thrift file for generating the client:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/proxy/proxy.thrift</code></pre> |
| </div></div> |
| <div class="paragraph"><p>After a client is generated, the port specified in the configuration properties above will be |
| used to connect to the server.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_using_a_proxy_client">4.5.5. Using a Proxy Client</h4> |
| <div class="paragraph"><p>The following examples have been written in Java and the method signatures may be |
| slightly different depending on the language specified when generating client with |
| the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift’s |
| documentation for examples of connecting to a Thrift service), the methods on the |
| proxy client will be available. The first thing to do is log in:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Map</span> password = <span style="color: #0000b3">new</span> HashMap<String,String>(); |
| password.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #4c73a6">"password"</span>, <span style="color: #4c73a6">"secret"</span>); |
| <span style="color: #000000">ByteBuffer</span> token = client.<span style="font-weight: bold"><span style="color: #000000">login</span></span>(<span style="color: #4c73a6">"root"</span>, password);</tt></pre></div></div> |
| <div class="paragraph"><p>Once logged in, the token returned will be used for most subsequent calls to the client. |
| Let’s create a table, add some data, scan the table, and delete it.</p></div> |
| <div class="paragraph"><p>First, create a table.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>client.<span style="font-weight: bold"><span style="color: #000000">createTable</span></span>(token, <span style="color: #4c73a6">"myTable"</span>, <span style="color: #0000b3">true</span>, TimeType.MILLIS);</tt></pre></div></div> |
| <div class="paragraph"><p>Next, add some data:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="font-style: italic"><span style="color: #b30000">// first, create a writer on the server</span></span> |
| <span style="color: #000000">String</span> writer = client.<span style="font-weight: bold"><span style="color: #000000">createWriter</span></span>(token, <span style="color: #4c73a6">"myTable"</span>, <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">WriterOptions</span></span>()); |
| |
| <span style="font-style: italic"><span style="color: #b30000">// build column updates</span></span> |
| <span style="color: #000000">Map<ByteBuffer, List<ColumnUpdate> cells></span> cellsToUpdate = <span style="font-style: italic"><span style="color: #b30000">//...</span></span> |
| |
| <span style="font-style: italic"><span style="color: #b30000">// send updates to the server</span></span> |
| client.<span style="font-weight: bold"><span style="color: #000000">updateAndFlush</span></span>(writer, <span style="color: #4c73a6">"myTable"</span>, cellsToUpdate); |
| |
| client.<span style="font-weight: bold"><span style="color: #000000">closeWriter</span></span>(writer);</tt></pre></div></div> |
| <div class="paragraph"><p>Scan for the data and batch the return of the results on the server:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">String</span> scanner = client.<span style="font-weight: bold"><span style="color: #000000">createScanner</span></span>(token, <span style="color: #4c73a6">"myTable"</span>, <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ScanOptions</span></span>()); |
| <span style="color: #000000">ScanResult</span> results = client.<span style="font-weight: bold"><span style="color: #000000">nextK</span></span>(scanner, <span style="color: #000000">100</span>); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">KeyValue</span> keyValue : results.<span style="font-weight: bold"><span style="color: #000000">getResultsIterator</span></span>()) { |
| <span style="font-style: italic"><span style="color: #b30000">// do something with results</span></span> |
| } |
| |
| client.<span style="font-weight: bold"><span style="color: #000000">closeScanner</span></span>(scanner);</tt></pre></div></div> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_development_clients">5. Development Clients</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of |
| Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If |
| you want to write a unit test that uses Accumulo, you need a lot of infrastructure |
| in place before your test can run.</p></div> |
| <div class="sect2"> |
| <h3 id="_mock_accumulo">5.1. Mock Accumulo</h3> |
| <div class="paragraph"><p>Mock Accumulo supplies mock implementations for much of the client API. It presently |
| does not enforce users, logins, permissions, etc. It does support Iterators and Combiners. |
| Note that MockAccumulo holds all data in memory, and will not retain any data or |
| settings between runs.</p></div> |
| <div class="paragraph"><p>While normal interaction with the Accumulo client looks like this:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Instance</span> instance = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ZooKeeperInstance</span></span>(...); |
| <span style="color: #000000">Connector</span> conn = instance.<span style="font-weight: bold"><span style="color: #000000">getConnector</span></span>(user, passwordToken);</tt></pre></div></div> |
| <div class="paragraph"><p>To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Instance</span> instance = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">MockInstance</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>In fact, you can use the <code>--fake</code> option to the Accumulo shell and interact with |
| MockAccumulo:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ ./bin/accumulo shell --fake -u root -p '' |
| Shell - Apache Accumulo Interactive Shell |
| - |
| - version: 1.5 |
| - instance name: fake |
| - instance id: mock-instance-id |
| - |
| - type 'help' for a list of available commands |
| - |
| root@fake> createtable test |
| root@fake test> insert row1 cf cq value |
| root@fake test> insert row2 cf cq value2 |
| root@fake test> insert row3 cf cq value3 |
| root@fake test> scan |
| row1 cf:cq [] value |
| row2 cf:cq [] value2 |
| row3 cf:cq [] value3 |
| root@fake test> scan -b row2 -e row2 |
| row2 cf:cq [] value2 |
| root@fake test></code></pre> |
| </div></div> |
| <div class="paragraph"><p>When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat |
| and AccumuloOutputFormat classes:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">setMockInstance</span></span>(job, <span style="color: #4c73a6">"mockInstance"</span>); |
| AccumuloOutputFormat.<span style="font-weight: bold"><span style="color: #000000">setMockInstance</span></span>(job, <span style="color: #4c73a6">"mockInstance"</span>);</tt></pre></div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_mini_accumulo_cluster">5.2. Mini Accumulo Cluster</h3> |
| <div class="paragraph"><p>While the Mock Accumulo provides a lightweight implementation of the client API for unit |
| testing, it is often necessary to write more realistic end-to-end integration tests that |
| take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by |
| configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well |
| as some Tablet Servers. It runs against the local filesystem instead of having to start |
| up HDFS.</p></div> |
| <div class="paragraph"><p>To start it up, you will need to supply an empty directory and a root password as arguments:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">File</span> tempDirectory = <span style="font-style: italic"><span style="color: #b30000">// JUnit and Guava supply mechanisms for creating temp directories</span></span> |
| <span style="color: #000000">MiniAccumuloCluster</span> accumulo = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">MiniAccumuloCluster</span></span>(tempDirectory, <span style="color: #4c73a6">"password"</span>); |
| accumulo.<span style="font-weight: bold"><span style="color: #000000">start</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>Once we have our mini cluster running, we will want to interact with the Accumulo client API:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Instance</span> instance = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ZooKeeperInstance</span></span>(accumulo.<span style="font-weight: bold"><span style="color: #000000">getInstanceName</span></span>(), accumulo.<span style="font-weight: bold"><span style="color: #000000">getZooKeepers</span></span>()); |
| <span style="color: #000000">Connector</span> conn = instance.<span style="font-weight: bold"><span style="color: #000000">getConnector</span></span>(<span style="color: #4c73a6">"root"</span>, <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">PasswordToken</span></span>(<span style="color: #4c73a6">"password"</span>));</tt></pre></div></div> |
| <div class="paragraph"><p>Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>accumulo.<span style="font-weight: bold"><span style="color: #000000">stop</span></span>() |
| <span style="font-style: italic"><span style="color: #b30000">// delete your temporary folder</span></span></tt></pre></div></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_table_configuration">6. Table Configuration</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Accumulo tables have a few options that can be configured to alter the default |
| behavior of Accumulo as well as improve performance based on the data stored. |
| These include locality groups, constraints, bloom filters, iterators, and block cache.</p></div> |
| <div class="sect2"> |
| <h3 id="_locality_groups">6.1. Locality Groups</h3> |
| <div class="paragraph"><p>Accumulo supports storing sets of column families separately on disk to allow |
| clients to efficiently scan over columns that are frequently used together and to avoid |
| scanning over column families that are not requested. After a locality group is set, |
| Scanner and BatchScanner operations will automatically take advantage of them |
| whenever the fetchColumnFamilies() method is used.</p></div> |
| <div class="paragraph"><p>By default, tables place all column families into the same “default” locality group. |
| Additional locality groups can be configured at any time via the shell or |
| programmatically as follows:</p></div> |
| <div class="sect3"> |
| <h4 id="_managing_locality_groups_via_the_shell">6.1.1. Managing Locality Groups via the Shell</h4> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>usage: setgroups <group>=<col fam>{,<col fam>}{ <group>=<col fam>{,<col fam>}} |
| [-?] -t <table></code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> setgroups group_one=colf1,colf2 -t mytable</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> getgroups -t mytable</code></pre> |
| </div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_managing_locality_groups_via_the_client_api">6.1.2. Managing Locality Groups via the Client API</h4> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Connector</span> conn; |
| |
| <span style="color: #000000">HashMap<String,Set<Text>></span> localityGroups = <span style="color: #0000b3">new</span> HashMap<String, Set<Text>>(); |
| |
| <span style="color: #000000">HashSet<Text></span> metadataColumns = <span style="color: #0000b3">new</span> HashSet<Text>(); |
| metadataColumns.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"domain"</span>)); |
| metadataColumns.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"link"</span>)); |
| |
| <span style="color: #000000">HashSet<Text></span> contentColumns = <span style="color: #0000b3">new</span> HashSet<Text>(); |
| contentColumns.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"body"</span>)); |
| contentColumns.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"images"</span>)); |
| |
| localityGroups.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #4c73a6">"metadata"</span>, metadataColumns); |
| localityGroups.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #4c73a6">"content"</span>, contentColumns); |
| |
| conn.<span style="font-weight: bold"><span style="color: #000000">tableOperations</span></span>().<span style="font-weight: bold"><span style="color: #000000">setLocalityGroups</span></span>(<span style="color: #4c73a6">"mytable"</span>, localityGroups); |
| |
| <span style="font-style: italic"><span style="color: #b30000">// existing locality groups can be obtained as follows</span></span> |
| <span style="color: #000000">Map<String, Set<Text>></span> groups = |
| conn.<span style="font-weight: bold"><span style="color: #000000">tableOperations</span></span>().<span style="font-weight: bold"><span style="color: #000000">getLocalityGroups</span></span>(<span style="color: #4c73a6">"mytable"</span>);</tt></pre></div></div> |
| <div class="paragraph"><p>The assignment of Column Families to Locality Groups can be changed at any time. The |
| physical movement of column families into their new locality groups takes place via |
| the periodic Major Compaction process that takes place continuously in the |
| background. Major Compaction can also be scheduled to take place immediately |
| through the shell:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> compact -t mytable</code></pre> |
| </div></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_constraints">6.2. Constraints</h3> |
| <div class="paragraph"><p>Accumulo supports constraints applied on mutations at insert time. This can be |
| used to disallow certain inserts according to a user defined policy. Any mutation |
| that fails to meet the requirements of the constraint is rejected and sent back to the |
| client.</p></div> |
| <div class="paragraph"><p>Constraints can be enabled by setting a table property as follows:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> constraint -t mytable -a com.test.ExampleConstraint com.test.AnotherConstraint |
| user@myinstance mytable> constraint -l |
| com.test.ExampleConstraint=1 |
| com.test.AnotherConstraint=2</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Currently there are no general-purpose constraints provided with the Accumulo |
| distribution. New constraints can be created by writing a Java class that implements |
| the org.apache.accumulo.core.constraints.Constraint interface.</p></div> |
| <div class="paragraph"><p>To deploy a new constraint, create a jar file containing the class implementing the |
| new constraint and place it in the lib directory of the Accumulo installation. New |
| constraint jars can be added to Accumulo and enabled without restarting but any |
| change to an existing constraint class requires Accumulo to be restarted.</p></div> |
| <div class="paragraph"><p>An example of constraints can be found in |
| <code>accumulo/docs/examples/README.constraints</code> with corresponding code under |
| <code>accumulo/examples/simple/main/java/accumulo/examples/simple/constraints</code>.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_bloom_filters">6.3. Bloom Filters</h3> |
| <div class="paragraph"><p>As mutations are applied to an Accumulo table, several files are created per tablet. If |
| bloom filters are enabled, Accumulo will create and load a small data structure into |
| memory to determine whether a file contains a given key before opening the file. |
| This can speed up lookups considerably.</p></div> |
| <div class="paragraph"><p>To enable bloom filters, enter the following command in the Shell:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance> config -t mytable -s table.bloom.enabled=true</code></pre> |
| </div></div> |
| <div class="paragraph"><p>An extensive example of using Bloom Filters can be found at |
| <code>accumulo/docs/examples/README.bloom</code>.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_iterators">6.4. Iterators</h3> |
| <div class="paragraph"><p>Iterators provide a modular mechanism for adding functionality to be executed by |
| TabletServers when scanning or compacting data. This allows users to efficiently |
| summarize, filter, and aggregate data. In fact, the built-in features of cell-level |
| security and column fetching are implemented using Iterators. |
| Some useful Iterators are provided with Accumulo and can be found in the |
| <strong><code>org.apache.accumulo.core.iterators.user</code></strong> package. |
| In each case, any custom Iterators must be included in Accumulo’s classpath, |
| typically by including a jar in <code>$ACCUMULO_HOME/lib</code> or |
| <code>$ACCUMULO_HOME/lib/ext</code>, although the VFS classloader allows for |
| classpath manipulation using a variety of schemes including URLs and HDFS URIs.</p></div> |
| <div class="sect3"> |
| <h4 id="_setting_iterators_via_the_shell">6.4.1. Setting Iterators via the Shell</h4> |
| <div class="paragraph"><p>Iterators can be configured on a table at scan, minor compaction and/or major |
| compaction scopes. If the Iterator implements the OptionDescriber interface, the |
| setiter command can be used which will interactively prompt the user to provide |
| values for the given necessary options.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>usage: setiter [-?] -ageoff | -agg | -class <name> | -regex | |
| -reqvis | -vers [-majc] [-minc] [-n <itername>] -p <pri> |
| [-scan] [-t <table>]</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> setiter -t mytable -scan -p 15 -n myiter -class com.company.MyIterator</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The config command can always be used to manually configure iterators which is useful |
| in cases where the Iterator does not implement the OptionDescriber interface.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>config -t mytable -s table.iterator.{scan|minc|majc}.myiter=15,com.company.MyIterator |
| config -t mytable -s table.iteartor.{scan|minc|majc}.myiter.opt.myoptionname=myoptionvalue</code></pre> |
| </div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_setting_iterators_programmatically">6.4.2. Setting Iterators Programmatically</h4> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>scanner.<span style="font-weight: bold"><span style="color: #000000">addIterator</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">IteratorSetting</span></span>( |
| <span style="color: #000000">15</span>, <span style="font-style: italic"><span style="color: #b30000">// priority</span></span> |
| <span style="color: #4c73a6">"myiter"</span>, <span style="font-style: italic"><span style="color: #b30000">// name this iterator</span></span> |
| <span style="color: #4c73a6">"com.company.MyIterator"</span> <span style="font-style: italic"><span style="color: #b30000">// class name</span></span> |
| ));</tt></pre></div></div> |
| <div class="paragraph"><p>Some iterators take additional parameters from client code, as in the following |
| example:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">IteratorSetting</span> iter = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">IteratorSetting</span></span>(...); |
| iter.<span style="font-weight: bold"><span style="color: #000000">addOption</span></span>(<span style="color: #4c73a6">"myoptionname"</span>, <span style="color: #4c73a6">"myoptionvalue"</span>); |
| scanner.<span style="font-weight: bold"><span style="color: #000000">addIterator</span></span>(iter)</tt></pre></div></div> |
| <div class="paragraph"><p>Tables support separate Iterator settings to be applied at scan time, upon minor |
| compaction and upon major compaction. For most uses, tables will have identical |
| iterator settings for all three to avoid inconsistent results.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_versioning_iterators_and_timestamps">6.4.3. Versioning Iterators and Timestamps</h4> |
| <div class="paragraph"><p>Accumulo provides the capability to manage versioned data through the use of |
| timestamps within the Key. If a timestamp is not specified in the key created by the |
| client then the system will set the timestamp to the current time. Two keys with |
| identical rowIDs and columns but different timestamps are considered two versions |
| of the same key. If two inserts are made into Accumulo with the same rowID, |
| column, and timestamp, then the behavior is non-deterministic.</p></div> |
| <div class="paragraph"><p>Timestamps are sorted in descending order, so the most recent data comes first. |
| Accumulo can be configured to return the top k versions, or versions later than a |
| given date. The default is to return the one most recent version.</p></div> |
| <div class="paragraph"><p>The version policy can be changed by changing the VersioningIterator options for a |
| table as follows:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> config -t mytable -s table.iterator.scan.vers.opt.maxVersions=3 |
| user@myinstance mytable> config -t mytable -s table.iterator.minc.vers.opt.maxVersions=3 |
| user@myinstance mytable> config -t mytable -s table.iterator.majc.vers.opt.maxVersions=3</code></pre> |
| </div></div> |
| <div class="paragraph"><p>When a table is created, by default its configured to use the |
| VersioningIterator and keep one version. A table can be created without the |
| VersioningIterator with the -ndi option in the shell. Also the Java API |
| has the following method</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>connector.tableOperations.<span style="font-weight: bold"><span style="color: #000000">create</span></span>(<span style="color: #000000">String</span> tableName, <span style="color: #000000">boolean</span> limitVersion)</tt></pre></div></div> |
| <div class="sect4"> |
| <h5 id="_logical_time">Logical Time</h5> |
| <div class="paragraph"><p>Accumulo 1.2 introduces the concept of logical time. This ensures that timestamps |
| set by Accumulo always move forward. This helps avoid problems caused by |
| TabletServers that have different time settings. The per tablet counter gives unique |
| one up time stamps on a per mutation basis. When using time in milliseconds, if |
| two things arrive within the same millisecond then both receive the same |
| timestamp. When using time in milliseconds, Accumulo set times will still |
| always move forward and never backwards.</p></div> |
| <div class="paragraph"><p>A table can be configured to use logical timestamps at creation time as follows:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance> createtable -tl logical</code></pre> |
| </div></div> |
| </div> |
| <div class="sect4"> |
| <h5 id="_deletes">Deletes</h5> |
| <div class="paragraph"><p>Deletes are special keys in Accumulo that get sorted along will all the other data. |
| When a delete key is inserted, Accumulo will not show anything that has a |
| timestamp less than or equal to the delete key. During major compaction, any keys |
| older than a delete key are omitted from the new file created, and the omitted keys |
| are removed from disk as part of the regular garbage collection process.</p></div> |
| </div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_filters">6.4.4. Filters</h4> |
| <div class="paragraph"><p>When scanning over a set of key-value pairs it is possible to apply an arbitrary |
| filtering policy through the use of a Filter. Filters are types of iterators that return |
| only key-value pairs that satisfy the filter logic. Accumulo has a few built-in filters |
| that can be configured on any table: AgeOff, ColumnAgeOff, Timestamp, NoVis, and RegEx. More can be added |
| by writing a Java class that extends the |
| <code>org.apache.accumulo.core.iterators.Filter</code> class.</p></div> |
| <div class="paragraph"><p>The AgeOff filter can be configured to remove data older than a certain date or a fixed |
| amount of time from the present. The following example sets a table to delete |
| everything inserted over 30 seconds ago:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance> createtable filtertest |
| user@myinstance filtertest> setiter -t filtertest -scan -minc -majc -p 10 -n myfilter -ageoff |
| AgeOffFilter removes entries with timestamps more than <ttl> milliseconds old |
| ----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter |
| negate, default false keeps k/v that pass accept method, true rejects k/v |
| that pass accept method: |
| ----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter |
| ttl, time to live (milliseconds): 3000 |
| ----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter |
| currentTime, if set, use the given value as the absolute time in milliseconds |
| as the current time of day: |
| user@myinstance filtertest> |
| user@myinstance filtertest> scan |
| user@myinstance filtertest> insert foo a b c |
| user@myinstance filtertest> scan |
| foo a:b [] c |
| user@myinstance filtertest> sleep 4 |
| user@myinstance filtertest> scan |
| user@myinstance filtertest></code></pre> |
| </div></div> |
| <div class="paragraph"><p>To see the iterator settings for a table, use:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@example filtertest> config -t filtertest -f iterator |
| ---------+---------------------------------------------+------------------ |
| SCOPE | NAME | VALUE |
| ---------+---------------------------------------------+------------------ |
| table | table.iterator.majc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter |
| table | table.iterator.majc.myfilter.opt.ttl ...... | 3000 |
| table | table.iterator.majc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator |
| table | table.iterator.majc.vers.opt.maxVersions .. | 1 |
| table | table.iterator.minc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter |
| table | table.iterator.minc.myfilter.opt.ttl ...... | 3000 |
| table | table.iterator.minc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator |
| table | table.iterator.minc.vers.opt.maxVersions .. | 1 |
| table | table.iterator.scan.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter |
| table | table.iterator.scan.myfilter.opt.ttl ...... | 3000 |
| table | table.iterator.scan.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator |
| table | table.iterator.scan.vers.opt.maxVersions .. | 1 |
| ---------+---------------------------------------------+------------------</code></pre> |
| </div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_combiners">6.4.5. Combiners</h4> |
| <div class="paragraph"><p>Accumulo allows Combiners to be configured on tables and column |
| families. When a Combiner is set it is applied across the values |
| associated with any keys that share rowID, column family, and column qualifier. |
| This is similar to the reduce step in MapReduce, which applied some function to all |
| the values associated with a particular key.</p></div> |
| <div class="paragraph"><p>For example, if a summing combiner were configured on a table and the following |
| mutations were inserted:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>Row Family Qualifier Timestamp Value |
| rowID1 colfA colqA 20100101 1 |
| rowID1 colfA colqA 20100102 1</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The table would reflect only one aggregate value:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>rowID1 colfA colqA - 2</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Combiners can be enabled for a table using the setiter command in the shell. Below is an example.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@a14 perDayCounts> setiter -t perDayCounts -p 10 -scan -minc -majc -n daycount -class org.apache.accumulo.core.iterators.user.SummingCombiner |
| TypedValueCombiner can interpret Values as a variety of number encodings |
| (VLong, Long, or String) before combining |
| ----------> set SummingCombiner parameter columns, |
| <col fam>[:<col qual>]{,<col fam>[:<col qual>]} : day |
| ----------> set SummingCombiner parameter type, <VARNUM|LONG|STRING>: STRING</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@a14 perDayCounts> insert foo day 20080101 1 |
| root@a14 perDayCounts> insert foo day 20080101 1 |
| root@a14 perDayCounts> insert foo day 20080103 1 |
| root@a14 perDayCounts> insert bar day 20080101 1 |
| root@a14 perDayCounts> insert bar day 20080101 1</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@a14 perDayCounts> scan |
| bar day:20080101 [] 2 |
| foo day:20080101 [] 2 |
| foo day:20080103 [] 1</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Accumulo includes some useful Combiners out of the box. To find these look in |
| the <strong><code>org.apache.accumulo.core.iterators.user</code></strong> package.</p></div> |
| <div class="paragraph"><p>Additional Combiners can be added by creating a Java class that extends |
| <code>org.apache.accumulo.core.iterators.Combiner</code> and adding a jar containing that |
| class to Accumulo’s lib/ext directory.</p></div> |
| <div class="paragraph"><p>An example of a Combiner can be found under |
| <code>accumulo/examples/simple/main/java/org/apache/accumulo/examples/simple/combiner/StatsCombiner.java</code></p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_block_cache">6.5. Block Cache</h3> |
| <div class="paragraph"><p>In order to increase throughput of commonly accessed entries, Accumulo employs a block cache. |
| This block cache buffers data in memory so that it doesn’t have to be read off of disk. |
| The RFile format that Accumulo prefers is a mix of index blocks and data blocks, where the index blocks are used to find the appropriate data blocks. |
| Typical queries to Accumulo result in a binary search over several index blocks followed by a linear scan of one or more data blocks.</p></div> |
| <div class="paragraph"><p>The block cache can be configured on a per-table basis, and all tablets hosted on a tablet server share a single resource pool. |
| To configure the size of the tablet server’s block cache, set the following properties:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>tserver.cache.data.size: Specifies the size of the cache for file data blocks. |
| tserver.cache.index.size: Specifies the size of the cache for file indices.</code></pre> |
| </div></div> |
| <div class="paragraph"><p>To enable the block cache for your table, set the following properties:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>table.cache.block.enable: Determines whether file (data) block cache is enabled. |
| table.cache.index.enable: Determines whether index cache is enabled.</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The block cache can have a significant effect on alleviating hot spots, as well as reducing query latency. |
| It is enabled by default for the !METADATA table.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_compaction">6.6. Compaction</h3> |
| <div class="paragraph"><p>As data is written to Accumulo it is buffered in memory. The data buffered in |
| memory is eventually written to HDFS on a per tablet basis. Files can also be |
| added to tablets directly by bulk import. In the background tablet servers run |
| major compactions to merge multiple files into one. The tablet server has to |
| decide which tablets to compact and which files within a tablet to compact. |
| This decision is made using the compaction ratio, which is configurable on a |
| per table basis. To configure this ratio modify the following property:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>table.compaction.major.ratio</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Increasing this ratio will result in more files per tablet and less compaction |
| work. More files per tablet means more higher query latency. So adjusting |
| this ratio is a trade off between ingest and query performance. The ratio |
| defaults to 3.</p></div> |
| <div class="paragraph"><p>The way the ratio works is that a set of files is compacted into one file if the |
| sum of the sizes of the files in the set is larger than the ratio multiplied by |
| the size of the largest file in the set. If this is not true for the set of all |
| files in a tablet, the largest file is removed from consideration, and the |
| remaining files are considered for compaction. This is repeated until a |
| compaction is triggered or there are no files left to consider.</p></div> |
| <div class="paragraph"><p>The number of background threads tablet servers use to run major compactions is |
| configurable. To configure this modify the following property:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>tserver.compaction.major.concurrent.max</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Also, the number of threads tablet servers use for minor compactions is |
| configurable. To configure this modify the following property:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>tserver.compaction.minor.concurrent.max</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The numbers of minor and major compactions running and queued is visible on the |
| Accumulo monitor page. This allows you to see if compactions are backing up |
| and adjustments to the above settings are needed. When adjusting the number of |
| threads available for compactions, consider the number of cores and other tasks |
| running on the nodes such as maps and reduces.</p></div> |
| <div class="paragraph"><p>If major compactions are not keeping up, then the number of files per tablet |
| will grow to a point such that query performance starts to suffer. One way to |
| handle this situation is to increase the compaction ratio. For example, if the |
| compaction ratio were set to 1, then every new file added to a tablet by minor |
| compaction would immediately queue the tablet for major compaction. So if a |
| tablet has a 200M file and minor compaction writes a 1M file, then the major |
| compaction will attempt to merge the 200M and 1M file. If the tablet server |
| has lots of tablets trying to do this sort of thing, then major compactions |
| will back up and the number of files per tablet will start to grow, assuming |
| data is being continuously written. Increasing the compaction ratio will |
| alleviate backups by lowering the amount of major compaction work that needs to |
| be done.</p></div> |
| <div class="paragraph"><p>Another option to deal with the files per tablet growing too large is to adjust |
| the following property:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>table.file.max</code></pre> |
| </div></div> |
| <div class="paragraph"><p>When a tablet reaches this number of files and needs to flush its in-memory |
| data to disk, it will choose to do a merging minor compaction. A merging minor |
| compaction will merge the tablet’s smallest file with the data in memory at |
| minor compaction time. Therefore the number of files will not grow beyond this |
| limit. This will make minor compactions take longer, which will cause ingest |
| performance to decrease. This can cause ingest to slow down until major |
| compactions have enough time to catch up. When adjusting this property, also |
| consider adjusting the compaction ratio. Ideally, merging minor compactions |
| never need to occur and major compactions will keep up. It is possible to |
| configure the file max and compaction ratio such that only merging minor |
| compactions occur and major compactions never occur. This should be avoided |
| because doing only merging minor compactions causes O(<em>N</em><sup>2</sup>) work to be done. |
| The amount of work done by major compactions is O(<em>N</em>*log<sub><em>R</em></sub>(<em>N</em>)) where |
| <em>R</em> is the compaction ratio.</p></div> |
| <div class="paragraph"><p>Compactions can be initiated manually for a table. To initiate a minor |
| compaction, use the flush command in the shell. To initiate a major compaction, |
| use the compact command in the shell. The compact command will compact all |
| tablets in a table to one file. Even tablets with one file are compacted. This |
| is useful for the case where a major compaction filter is configured for a |
| table. In 1.4 the ability to compact a range of a table was added. To use this |
| feature specify start and stop rows for the compact command. This will only |
| compact tablets that overlap the given row range.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_pre_splitting_tables">6.7. Pre-splitting tables</h3> |
| <div class="paragraph"><p>Accumulo will balance and distribute tables across servers. Before a |
| table gets large, it will be maintained as a single tablet on a single |
| server. This limits the speed at which data can be added or queried |
| to the speed of a single node. To improve performance when the a table |
| is new, or small, you can add split points and generate new tablets.</p></div> |
| <div class="paragraph"><p>In the shell:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> createtable newTable |
| root@myinstance> addsplits -t newTable g n t</code></pre> |
| </div></div> |
| <div class="paragraph"><p>This will create a new table with 4 tablets. The table will be split |
| on the letters “g”, “n”, and “t” which will work nicely if the |
| row data start with lower-case alphabetic characters. If your row |
| data includes binary information or numeric information, or if the |
| distribution of the row information is not flat, then you would pick |
| different split points. Now ingest and query can proceed on 4 nodes |
| which can improve performance.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_merging_tablets">6.8. Merging tablets</h3> |
| <div class="paragraph"><p>Over time, a table can get very large, so large that it has hundreds |
| of thousands of split points. Once there are enough tablets to spread |
| a table across the entire cluster, additional splits may not improve |
| performance, and may create unnecessary bookkeeping. The distribution |
| of data may change over time. For example, if row data contains date |
| information, and data is continually added and removed to maintain a |
| window of current information, tablets for older rows may be empty.</p></div> |
| <div class="paragraph"><p>Accumulo supports tablet merging, which can be used to reduce |
| the number of split points. The following command will merge all rows |
| from “A” to “Z” into a single tablet:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> merge -t myTable -s A -e Z</code></pre> |
| </div></div> |
| <div class="paragraph"><p>If the result of a merge produces a tablet that is larger than the |
| configured split size, the tablet may be split by the tablet server. |
| Be sure to increase your tablet size prior to any merges if the goal |
| is to have larger tablets:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> config -t myTable -s table.split.threshold=2G</code></pre> |
| </div></div> |
| <div class="paragraph"><p>In order to merge small tablets, you can ask Accumulo to merge |
| sections of a table smaller than a given size.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> merge -t myTable -s 100M</code></pre> |
| </div></div> |
| <div class="paragraph"><p>By default, small tablets will not be merged into tablets that are |
| already larger than the given size. This can leave isolated small |
| tablets. To force small tablets to be merged into larger tablets use |
| the “--{}--force” option:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> merge -t myTable -s 100M --force</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Merging away small tablets works on one section at a time. If your |
| table contains many sections of small split points, or you are |
| attempting to change the split size of the entire table, it will be |
| faster to set the split point and merge the entire table:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> config -t myTable -s table.split.threshold=256M |
| root@myinstance> merge -t myTable</code></pre> |
| </div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_delete_range">6.9. Delete Range</h3> |
| <div class="paragraph"><p>Consider an indexing scheme that uses date information in each row. |
| For example “20110823-15:20:25.013” might be a row that specifies a |
| date and time. In some cases, we might like to delete rows based on |
| this date, say to remove all the data older than the current year. |
| Accumulo supports a delete range operation which efficiently |
| removes data between two rows. For example:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> deleterange -t myTable -s 2010 -e 2011</code></pre> |
| </div></div> |
| <div class="paragraph"><p>This will delete all rows starting with “2010” and it will stop at |
| any row starting “2011”. You can delete any data prior to 2011 |
| with:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@myinstance> deleterange -t myTable -e 2011 --force</code></pre> |
| </div></div> |
| <div class="paragraph"><p>The shell will not allow you to delete an unbounded range (no start) |
| unless you provide the “--{}--force” option.</p></div> |
| <div class="paragraph"><p>Range deletion is implemented using splits at the given start/end |
| positions, and will affect the number of splits in the table.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_cloning_tables">6.10. Cloning Tables</h3> |
| <div class="paragraph"><p>A new table can be created that points to an existing table’s data. This is a |
| very quick metadata operation, no data is actually copied. The cloned table |
| and the source table can change independently after the clone operation. One |
| use case for this feature is testing. For example to test a new filtering |
| iterator, clone the table, add the filter to the clone, and force a major |
| compaction. To perform a test on less data, clone a table and then use delete |
| range to efficiently remove a lot of data from the clone. Another use case is |
| generating a snapshot to guard against human error. To create a snapshot, |
| clone a table and then disable write permissions on the clone.</p></div> |
| <div class="paragraph"><p>The clone operation will point to the source table’s files. This is why the |
| flush option is present and is enabled by default in the shell. If the flush |
| option is not enabled, then any data the source table currently has in memory |
| will not exist in the clone.</p></div> |
| <div class="paragraph"><p>A cloned table copies the configuration of the source table. However the |
| permissions of the source table are not copied to the clone. After a clone is |
| created, only the user that created the clone can read and write to it.</p></div> |
| <div class="paragraph"><p>In the following example we see that data inserted after the clone operation is |
| not visible in the clone.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@a14> createtable people |
| root@a14 people> insert 890435 name last Doe |
| root@a14 people> insert 890435 name first John |
| root@a14 people> clonetable people test |
| root@a14 people> insert 890436 name first Jane |
| root@a14 people> insert 890436 name last Doe |
| root@a14 people> scan |
| 890435 name:first [] John |
| 890435 name:last [] Doe |
| 890436 name:first [] Jane |
| 890436 name:last [] Doe |
| root@a14 people> table test |
| root@a14 test> scan |
| 890435 name:first [] John |
| 890435 name:last [] Doe |
| root@a14 test></code></pre> |
| </div></div> |
| <div class="paragraph"><p>The du command in the shell shows how much space a table is using in HDFS. |
| This command can also show how much overlapping space two cloned tables have in |
| HDFS. In the example below du shows table ci is using 428M. Then ci is cloned |
| to cic and du shows that both tables share 428M. After three entries are |
| inserted into cic and its flushed, du shows the two tables still share 428M but |
| cic has 226 bytes to itself. Finally, table cic is compacted and then du shows |
| that each table uses 428M.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@a14> du ci |
| 428,482,573 [ci] |
| root@a14> clonetable ci cic |
| root@a14> du ci cic |
| 428,482,573 [ci, cic] |
| root@a14> table cic |
| root@a14 cic> insert r1 cf1 cq1 v1 |
| root@a14 cic> insert r1 cf1 cq2 v2 |
| root@a14 cic> insert r1 cf1 cq3 v3 |
| root@a14 cic> flush -t cic -w |
| 27 15:00:13,908 [shell.Shell] INFO : Flush of table cic completed. |
| root@a14 cic> du ci cic |
| 428,482,573 [ci, cic] |
| 226 [cic] |
| root@a14 cic> compact -t cic -w |
| 27 15:00:35,871 [shell.Shell] INFO : Compacting table ... |
| 27 15:03:03,303 [shell.Shell] INFO : Compaction of table cic completed for given range |
| root@a14 cic> du ci cic |
| 428,482,573 [ci] |
| 428,482,612 [cic] |
| root@a14 cic></code></pre> |
| </div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_exporting_tables">6.11. Exporting Tables</h3> |
| <div class="paragraph"><p>Accumulo supports exporting tables for the purpose of copying tables to another |
| cluster. Exporting and importing tables preserves the tables configuration, |
| splits, and logical time. Tables are exported and then copied via the hadoop |
| distcp command. To export a table, it must be offline and stay offline while |
| discp runs. The reason it needs to stay offline is to prevent files from being |
| deleted. A table can be cloned and the clone taken offline inorder to avoid |
| losing access to the table. See docs/examples/README.export for an example.</p></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_table_design">7. Table Design</h2> |
| <div class="sectionbody"> |
| <div class="sect2"> |
| <h3 id="_basic_table">7.1. Basic Table</h3> |
| <div class="paragraph"><p>Since Accumulo tables are sorted by row ID, each table can be thought of as being |
| indexed by the row ID. Lookups performed by row ID can be executed quickly, by doing |
| a binary search, first across the tablets, and then within a tablet. Clients should |
| choose a row ID carefully in order to support their desired application. A simple rule |
| is to select a unique identifier as the row ID for each entity to be stored and assign |
| all the other attributes to be tracked to be columns under this row ID. For example, |
| if we have the following data in a comma-separated file:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>userid,age,address,account-balance</code></pre> |
| </div></div> |
| <div class="paragraph"><p>We might choose to store this data using the userid as the rowID and the rest of the |
| data in column families:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Mutation</span> m = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Mutation</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(userid)); |
| m.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"age"</span>), age); |
| m.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"address"</span>), address); |
| m.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"balance"</span>), account_balance); |
| |
| writer.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(m);</tt></pre></div></div> |
| <div class="paragraph"><p>We could then retrieve any of the columns for a specific userid by specifying the |
| userid as the range of a scanner and fetching specific columns:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Range</span> r = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Range</span></span>(userid, userid); <span style="font-style: italic"><span style="color: #b30000">// single row</span></span> |
| <span style="color: #000000">Scanner</span> s = conn.<span style="font-weight: bold"><span style="color: #000000">createScanner</span></span>(<span style="color: #4c73a6">"userdata"</span>, auths); |
| s.<span style="font-weight: bold"><span style="color: #000000">setRange</span></span>(r); |
| s.<span style="font-weight: bold"><span style="color: #000000">fetchColumnFamily</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"age"</span>)); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : s) |
| System.out.<span style="font-weight: bold"><span style="color: #000000">println</span></span>(entry.<span style="font-weight: bold"><span style="color: #000000">getValue</span></span>().<span style="font-weight: bold"><span style="color: #000000">toString</span></span>());</tt></pre></div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_rowid_design">7.2. RowID Design</h3> |
| <div class="paragraph"><p>Often it is necessary to transform the rowID in order to have rows ordered in a way |
| that is optimal for anticipated access patterns. A good example of this is reversing |
| the order of components of internet domain names in order to group rows of the |
| same parent domain together:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>com.google.code |
| com.google.labs |
| com.google.mail |
| com.yahoo.mail |
| com.yahoo.research</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Some data may result in the creation of very large rows - rows with many columns. |
| In this case the table designer may wish to split up these rows for better load |
| balancing while keeping them sorted together for scanning purposes. This can be |
| done by appending a random substring at the end of the row:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>com.google.code_00 |
| com.google.code_01 |
| com.google.code_02 |
| com.google.labs_00 |
| com.google.mail_00 |
| com.google.mail_01</code></pre> |
| </div></div> |
| <div class="paragraph"><p>It could also be done by adding a string representation of some period of time such as date to the week |
| or month:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>com.google.code_201003 |
| com.google.code_201004 |
| com.google.code_201005 |
| com.google.labs_201003 |
| com.google.mail_201003 |
| com.google.mail_201004</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Appending dates provides the additional capability of restricting a scan to a given |
| date range.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_indexing">7.3. Indexing</h3> |
| <div class="paragraph"><p>In order to support lookups via more than one attribute of an entity, additional |
| indexes can be built. However, because Accumulo tables can support any number of |
| columns without specifying them beforehand, a single additional index will often |
| suffice for supporting lookups of records in the main table. Here, the index has, as |
| the rowID, the Value or Term from the main table, the column families are the same, |
| and the column qualifier of the index table contains the rowID from the main table.</p></div> |
| <div class="tableblock"> |
| <table rules="rows" |
| width="75%" |
| frame="border" |
| cellspacing="0" cellpadding="4"> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <thead> |
| <tr> |
| <th align="center" valign="top">RowID </th> |
| <th align="center" valign="top">Column Family </th> |
| <th align="center" valign="top">Column Qualifier </th> |
| <th align="center" valign="top">Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td align="center" valign="top"><p class="table">Term</p></td> |
| <td align="center" valign="top"><p class="table">Field Name</p></td> |
| <td align="center" valign="top"><p class="table">MainRowID</p></td> |
| <td align="center" valign="top"><p class="table"></p></td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <div class="paragraph"><p>Note: We store rowIDs in the column qualifier rather than the Value so that we can |
| have more than one rowID associated with a particular term within the index. If we |
| stored this in the Value we would only see one of the rows in which the value |
| appears since Accumulo is configured by default to return the one most recent |
| value associated with a key.</p></div> |
| <div class="paragraph"><p>Lookups can then be done by scanning the Index Table first for occurrences of the |
| desired values in the columns specified, which returns a list of row ID from the main |
| table. These can then be used to retrieve each matching record, in their entirety, or a |
| subset of their columns, from the Main Table.</p></div> |
| <div class="paragraph"><p>To support efficient lookups of multiple rowIDs from the same table, the Accumulo |
| client library provides a BatchScanner. Users specify a set of Ranges to the |
| BatchScanner, which performs the lookups in multiple threads to multiple servers |
| and returns an Iterator over all the rows retrieved. The rows returned are NOT in |
| sorted order, as is the case with the basic Scanner interface.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="font-style: italic"><span style="color: #b30000">// first we scan the index for IDs of rows matching our query</span></span> |
| |
| <span style="color: #000000">Text</span> term = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"mySearchTerm"</span>); |
| |
| <span style="color: #000000">HashSet<Range></span> matchingRows = <span style="color: #0000b3">new</span> HashSet<Range>(); |
| |
| <span style="color: #000000">Scanner</span> indexScanner = <span style="font-weight: bold"><span style="color: #000000">createScanner</span></span>(<span style="color: #4c73a6">"index"</span>, auths); |
| indexScanner.<span style="font-weight: bold"><span style="color: #000000">setRange</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Range</span></span>(term, term)); |
| |
| <span style="font-style: italic"><span style="color: #b30000">// we retrieve the matching rowIDs and create a set of ranges</span></span> |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : indexScanner) |
| matchingRows.<span style="font-weight: bold"><span style="color: #000000">add</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Range</span></span>(entry.<span style="font-weight: bold"><span style="color: #000000">getKey</span></span>().<span style="font-weight: bold"><span style="color: #000000">getColumnQualifier</span></span>())); |
| |
| <span style="font-style: italic"><span style="color: #b30000">// now we pass the set of rowIDs to the batch scanner to retrieve them</span></span> |
| <span style="color: #000000">BatchScanner</span> bscan = conn.<span style="font-weight: bold"><span style="color: #000000">createBatchScanner</span></span>(<span style="color: #4c73a6">"table"</span>, auths, <span style="color: #000000">10</span>); |
| |
| bscan.<span style="font-weight: bold"><span style="color: #000000">setRanges</span></span>(matchingRows); |
| bscan.<span style="font-weight: bold"><span style="color: #000000">fetchColumnFamily</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"attributes"</span>)); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : bscan) |
| System.out.<span style="font-weight: bold"><span style="color: #000000">println</span></span>(entry.<span style="font-weight: bold"><span style="color: #000000">getValue</span></span>());</tt></pre></div></div> |
| <div class="paragraph"><p>One advantage of the dynamic schema capabilities of Accumulo is that different |
| fields may be indexed into the same physical table. However, it may be necessary to |
| create different index tables if the terms must be formatted differently in order to |
| maintain proper sort order. For example, real numbers must be formatted |
| differently than their usual notation in order to be sorted correctly. In these cases, |
| usually one index per unique data type will suffice.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_entity_attribute_and_graph_tables">7.4. Entity-Attribute and Graph Tables</h3> |
| <div class="paragraph"><p>Accumulo is ideal for storing entities and their attributes, especially of the |
| attributes are sparse. It is often useful to join several datasets together on common |
| entities within the same table. This can allow for the representation of graphs, |
| including nodes, their attributes, and connections to other nodes.</p></div> |
| <div class="paragraph"><p>Rather than storing individual events, Entity-Attribute or Graph tables store |
| aggregate information about the entities involved in the events and the |
| relationships between entities. This is often preferrable when single events aren’t |
| very useful and when a continuously updated summarization is desired.</p></div> |
| <div class="paragraph"><p>The physical schema for an entity-attribute or graph table is as follows:</p></div> |
| <div class="tableblock"> |
| <table rules="rows" |
| width="75%" |
| frame="border" |
| cellspacing="0" cellpadding="4"> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <thead> |
| <tr> |
| <th align="center" valign="top">RowID </th> |
| <th align="center" valign="top">Column Family </th> |
| <th align="center" valign="top">Column Qualifier </th> |
| <th align="center" valign="top">Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td align="center" valign="top"><p class="table">EntityID</p></td> |
| <td align="center" valign="top"><p class="table">Attribute Name</p></td> |
| <td align="center" valign="top"><p class="table">Attribute Value</p></td> |
| <td align="center" valign="top"><p class="table">Weight</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">EntityID</p></td> |
| <td align="center" valign="top"><p class="table">Edge Type</p></td> |
| <td align="center" valign="top"><p class="table">Related EntityID</p></td> |
| <td align="center" valign="top"><p class="table">Weight</p></td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <div class="paragraph"><p>For example, to keep track of employees, managers and products the following |
| entity-attribute table could be used. Note that the weights are not always necessary |
| and are set to 0 when not used.</p></div> |
| <div class="tableblock"> |
| <table rules="rows" |
| width="75%" |
| frame="border" |
| cellspacing="0" cellpadding="4"> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <thead> |
| <tr> |
| <th align="center" valign="top">RowID </th> |
| <th align="center" valign="top">Column Family </th> |
| <th align="center" valign="top">Column Qualifier </th> |
| <th align="center" valign="top">Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td align="center" valign="top"><p class="table">E001</p></td> |
| <td align="center" valign="top"><p class="table">name</p></td> |
| <td align="center" valign="top"><p class="table">bob</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E001</p></td> |
| <td align="center" valign="top"><p class="table">department</p></td> |
| <td align="center" valign="top"><p class="table">sales</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E001</p></td> |
| <td align="center" valign="top"><p class="table">hire_date</p></td> |
| <td align="center" valign="top"><p class="table">20030102</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E001</p></td> |
| <td align="center" valign="top"><p class="table">units_sold</p></td> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">780</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E002</p></td> |
| <td align="center" valign="top"><p class="table">name</p></td> |
| <td align="center" valign="top"><p class="table">george</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E002</p></td> |
| <td align="center" valign="top"><p class="table">department</p></td> |
| <td align="center" valign="top"><p class="table">sales</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E002</p></td> |
| <td align="center" valign="top"><p class="table">manager_of</p></td> |
| <td align="center" valign="top"><p class="table">E001</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E002</p></td> |
| <td align="center" valign="top"><p class="table">manager_of</p></td> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">name</p></td> |
| <td align="center" valign="top"><p class="table">harry</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">department</p></td> |
| <td align="center" valign="top"><p class="table">accounts_recv</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">hire_date</p></td> |
| <td align="center" valign="top"><p class="table">20000405</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">units_sold</p></td> |
| <td align="center" valign="top"><p class="table">P002</p></td> |
| <td align="center" valign="top"><p class="table">566</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">E003</p></td> |
| <td align="center" valign="top"><p class="table">units_sold</p></td> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">232</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">product_name</p></td> |
| <td align="center" valign="top"><p class="table">nike_airs</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">product_type</p></td> |
| <td align="center" valign="top"><p class="table">shoe</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">in_stock</p></td> |
| <td align="center" valign="top"><p class="table">germany</p></td> |
| <td align="center" valign="top"><p class="table">900</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P001</p></td> |
| <td align="center" valign="top"><p class="table">in_stock</p></td> |
| <td align="center" valign="top"><p class="table">brazil</p></td> |
| <td align="center" valign="top"><p class="table">200</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P002</p></td> |
| <td align="center" valign="top"><p class="table">product_name</p></td> |
| <td align="center" valign="top"><p class="table">basic_jacket</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P002</p></td> |
| <td align="center" valign="top"><p class="table">product_type</p></td> |
| <td align="center" valign="top"><p class="table">clothing</p></td> |
| <td align="center" valign="top"><p class="table">0</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P002</p></td> |
| <td align="center" valign="top"><p class="table">in_stock</p></td> |
| <td align="center" valign="top"><p class="table">usa</p></td> |
| <td align="center" valign="top"><p class="table">3454</p></td> |
| </tr> |
| <tr> |
| <td align="center" valign="top"><p class="table">P002</p></td> |
| <td align="center" valign="top"><p class="table">in_stock</p></td> |
| <td align="center" valign="top"><p class="table">germany</p></td> |
| <td align="center" valign="top"><p class="table">700</p></td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <div class="paragraph"><p>To allow efficient updating of edge weights, an aggregating iterator can be |
| configured to add the value of all mutations applied with the same key. These types |
| of tables can easily be created from raw events by simply extracting the entities, |
| attributes, and relationships from individual events and inserting the keys into |
| Accumulo each with a count of 1. The aggregating iterator will take care of |
| maintaining the edge weights.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_document_partitioned_indexing">7.5. Document-Partitioned Indexing</h3> |
| <div class="paragraph"><p>Using a simple index as described above works well when looking for records that |
| match one of a set of given criteria. When looking for records that match more than |
| one criterion simultaneously, such as when looking for documents that contain all of |
| the words ‘the’ and ‘white’ and ‘house’, there are several issues.</p></div> |
| <div class="paragraph"><p>First is that the set of all records matching any one of the search terms must be sent |
| to the client, which incurs a lot of network traffic. The second problem is that the |
| client is responsible for performing set intersection on the sets of records returned |
| to eliminate all but the records matching all search terms. The memory of the client |
| may easily be overwhelmed during this operation.</p></div> |
| <div class="paragraph"><p>For these reasons Accumulo includes support for a scheme known as sharded |
| indexing, in which these set operations can be performed at the TabletServers and |
| decisions about which records to include in the result set can be made without |
| incurring network traffic.</p></div> |
| <div class="paragraph"><p>This is accomplished via partitioning records into bins that each reside on at most |
| one TabletServer, and then creating an index of terms per record within each bin as |
| follows:</p></div> |
| <div class="tableblock"> |
| <table rules="rows" |
| width="75%" |
| frame="border" |
| cellspacing="0" cellpadding="4"> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <col width="25%" /> |
| <thead> |
| <tr> |
| <th align="center" valign="top">RowID </th> |
| <th align="center" valign="top">Column Family </th> |
| <th align="center" valign="top">Column Qualifier </th> |
| <th align="center" valign="top">Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td align="center" valign="top"><p class="table">BinID</p></td> |
| <td align="center" valign="top"><p class="table">Term</p></td> |
| <td align="center" valign="top"><p class="table">DocID</p></td> |
| <td align="center" valign="top"><p class="table">Weight</p></td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <div class="paragraph"><p>Documents or records are mapped into bins by a user-defined ingest application. By |
| storing the BinID as the RowID we ensure that all the information for a particular |
| bin is contained in a single tablet and hosted on a single TabletServer since |
| Accumulo never splits rows across tablets. Storing the Terms as column families |
| serves to enable fast lookups of all the documents within this bin that contain the |
| given term.</p></div> |
| <div class="paragraph"><p>Finally, we perform set intersection operations on the TabletServer via a special |
| iterator called the Intersecting Iterator. Since documents are partitioned into many |
| bins, a search of all documents must search every bin. We can use the BatchScanner |
| to scan all bins in parallel. The Intersecting Iterator should be enabled on a |
| BatchScanner within user query code as follows:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>Text[] terms = {<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"the"</span>), <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"white"</span>), <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"house"</span>)}; |
| |
| <span style="color: #000000">BatchScanner</span> bs = conn.<span style="font-weight: bold"><span style="color: #000000">createBatchScanner</span></span>(table, auths, <span style="color: #000000">20</span>); |
| <span style="color: #000000">IteratorSetting</span> iter = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">IteratorSetting</span></span>(<span style="color: #000000">20</span>, <span style="color: #4c73a6">"ii"</span>, IntersectingIterator.<span style="color: #0000b3">class</span>); |
| IntersectingIterator.<span style="font-weight: bold"><span style="color: #000000">setColumnFamilies</span></span>(iter, terms); |
| bs.<span style="font-weight: bold"><span style="color: #000000">addScanIterator</span></span>(iter); |
| bs.<span style="font-weight: bold"><span style="color: #000000">setRanges</span></span>(Collections.<span style="font-weight: bold"><span style="color: #000000">singleton</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Range</span></span>())); |
| |
| <span style="color: #0000b3">for</span>(<span style="color: #000000">Entry<Key,Value></span> entry : bs) { |
| System.out.<span style="font-weight: bold"><span style="color: #000000">println</span></span>(<span style="color: #4c73a6">" "</span> + entry.<span style="font-weight: bold"><span style="color: #000000">getKey</span></span>().<span style="font-weight: bold"><span style="color: #000000">getColumnQualifier</span></span>()); |
| }</tt></pre></div></div> |
| <div class="paragraph"><p>This code effectively has the BatchScanner scan all tablets of a table, looking for |
| documents that match all the given terms. Because all tablets are being scanned for |
| every query, each query is more expensive than other Accumulo scans, which |
| typically involve a small number of TabletServers. This reduces the number of |
| concurrent queries supported and is subject to what is known as the ‘straggler’ |
| problem in which every query runs as slow as the slowest server participating.</p></div> |
| <div class="paragraph"><p>Of course, fast servers will return their results to the client which can display them |
| to the user immediately while they wait for the rest of the results to arrive. If the |
| results are unordered this is quite effective as the first results to arrive are as good |
| as any others to the user.</p></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_high_speed_ingest">8. High-Speed Ingest</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Accumulo is often used as part of a larger data processing and storage system. To |
| maximize the performance of a parallel system involving Accumulo, the ingestion |
| and query components should be designed to provide enough parallelism and |
| concurrency to avoid creating bottlenecks for users and other systems writing to |
| and reading from Accumulo. There are several ways to achieve high ingest |
| performance.</p></div> |
| <div class="sect2"> |
| <h3 id="_pre_splitting_new_tables">8.1. Pre-Splitting New Tables</h3> |
| <div class="paragraph"><p>New tables consist of a single tablet by default. As mutations are applied, the table |
| grows and splits into multiple tablets which are balanced by the Master across |
| TabletServers. This implies that the aggregate ingest rate will be limited to fewer |
| servers than are available within the cluster until the table has reached the point |
| where there are tablets on every TabletServer.</p></div> |
| <div class="paragraph"><p>Pre-splitting a table ensures that there are as many tablets as desired available |
| before ingest begins to take advantage of all the parallelism possible with the cluster |
| hardware. Tables can be split at any time by using the shell:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> addsplits -sf /local_splitfile -t mytable</code></pre> |
| </div></div> |
| <div class="paragraph"><p>For the purposes of providing parallelism to ingest it is not necessary to create more |
| tablets than there are physical machines within the cluster as the aggregate ingest |
| rate is a function of the number of physical machines. Note that the aggregate ingest |
| rate is still subject to the number of machines running ingest clients, and the |
| distribution of rowIDs across the table. The aggregation ingest rate will be |
| suboptimal if there are many inserts into a small number of rowIDs.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_multiple_ingester_clients">8.2. Multiple Ingester Clients</h3> |
| <div class="paragraph"><p>Accumulo is capable of scaling to very high rates of ingest, which is dependent upon |
| not just the number of TabletServers in operation but also the number of ingest |
| clients. This is because a single client, while capable of batching mutations and |
| sending them to all TabletServers, is ultimately limited by the amount of data that |
| can be processed on a single machine. The aggregate ingest rate will scale linearly |
| with the number of clients up to the point at which either the aggregate I/O of |
| TabletServers or total network bandwidth capacity is reached.</p></div> |
| <div class="paragraph"><p>In operational settings where high rates of ingest are paramount, clusters are often |
| configured to dedicate some number of machines solely to running Ingester Clients. |
| The exact ratio of clients to TabletServers necessary for optimum ingestion rates |
| will vary according to the distribution of resources per machine and by data type.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_bulk_ingest">8.3. Bulk Ingest</h3> |
| <div class="paragraph"><p>Accumulo supports the ability to import files produced by an external process such |
| as MapReduce into an existing table. In some cases it may be faster to load data this |
| way rather than via ingesting through clients using BatchWriters. This allows a large |
| number of machines to format data the way Accumulo expects. The new files can |
| then simply be introduced to Accumulo via a shell command.</p></div> |
| <div class="paragraph"><p>To configure MapReduce to format data in preparation for bulk loading, the job |
| should be set to use a range partitioner instead of the default hash partitioner. The |
| range partitioner uses the split points of the Accumulo table that will receive the |
| data. The split points can be obtained from the shell and used by the MapReduce |
| RangePartitioner. Note that this is only useful if the existing table is already split |
| into multiple tablets.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> getsplits |
| aa |
| ab |
| ac |
| ... |
| zx |
| zy |
| zz</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to |
| be introduced to Accumulo. Once this is complete, the files can be added to |
| Accumulo via the shell:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>user@myinstance mytable> importdirectory /files_dir /failures</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Note that the paths referenced are directories within the same HDFS instance over |
| which Accumulo is running. Accumulo places any files that failed to be added to the |
| second directory specified.</p></div> |
| <div class="paragraph"><p>A complete example of using Bulk Ingest can be found at |
| <code>accumulo/docs/examples/README.bulkIngest</code></p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_logical_time_for_bulk_ingest">8.4. Logical Time for Bulk Ingest</h3> |
| <div class="paragraph"><p>Logical time is important for bulk imported data, for which the client code may |
| be choosing a timestamp. At bulk import time, the user can choose to enable |
| logical time for the set of files being imported. When its enabled, Accumulo |
| uses a specialized system iterator to lazily set times in a bulk imported file. |
| This mechanism guarantees that times set by unsynchronized multi-node |
| applications (such as those running on MapReduce) will maintain some semblance |
| of causal ordering. This mitigates the problem of the time being wrong on the |
| system that created the file for bulk import. These times are not set when the |
| file is imported, but whenever it is read by scans or compactions. At import, a |
| time is obtained and always used by the specialized system iterator to set that |
| time.</p></div> |
| <div class="paragraph"><p>The timestamp assigned by Accumulo will be the same for every key in the file. |
| This could cause problems if the file contains multiple keys that are identical |
| except for the timestamp. In this case, the sort order of the keys will be |
| undefined. This could occur if an insert and an update were in the same bulk |
| import file.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_mapreduce_ingest">8.5. MapReduce Ingest</h3> |
| <div class="paragraph"><p>It is possible to efficiently write many mutations to Accumulo in parallel via a |
| MapReduce job. In this scenario the MapReduce is written to process data that lives |
| in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See |
| the MapReduce section under Analytics for details.</p></div> |
| <div class="paragraph"><p>An example of using MapReduce can be found under |
| <code>accumulo/docs/examples/README.mapred</code></p></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_analytics">9. Analytics</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Accumulo supports more advanced data processing than simply keeping keys |
| sorted and performing efficient lookups. Analytics can be developed by using |
| MapReduce and Iterators in conjunction with Accumulo tables.</p></div> |
| <div class="sect2"> |
| <h3 id="_mapreduce">9.1. MapReduce</h3> |
| <div class="paragraph"><p>Accumulo tables can be used as the source and destination of MapReduce jobs. To |
| use an Accumulo table with a MapReduce job (specifically with the new Hadoop API |
| as of version 0.20), configure the job parameters to use the AccumuloInputFormat |
| and AccumuloOutputFormat. Accumulo specific parameters can be set via these |
| two format classes to do the following:</p></div> |
| <div class="ulist"><ul> |
| <li> |
| <p> |
| Authenticate and provide user credentials for the input |
| </p> |
| </li> |
| <li> |
| <p> |
| Restrict the scan to a range of rows |
| </p> |
| </li> |
| <li> |
| <p> |
| Restrict the input to a subset of available columns |
| </p> |
| </li> |
| </ul></div> |
| <div class="sect3"> |
| <h4 id="_mapper_and_reducer_classes">9.1.1. Mapper and Reducer classes</h4> |
| <div class="paragraph"><p>To read from an Accumulo table create a Mapper with the following class |
| parameterization and be sure to configure the AccumuloInputFormat.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #0000b3">class</span> MyMapper <span style="color: #0000b3">extends</span> Mapper<Key,Value,WritableComparable,Writable> { |
| <span style="color: #0000b3">public</span> <span style="color: #000000">void</span> <span style="font-weight: bold"><span style="color: #000000">map</span></span>(<span style="color: #000000">Key</span> k, <span style="color: #000000">Value</span> v, <span style="color: #000000">Context</span> c) { |
| <span style="font-style: italic"><span style="color: #b30000">// transform key and value data here</span></span> |
| } |
| }</tt></pre></div></div> |
| <div class="paragraph"><p>To write to an Accumulo table, create a Reducer with the following class |
| parameterization and be sure to configure the AccumuloOutputFormat. The key |
| emitted from the Reducer identifies the table to which the mutation is sent. This |
| allows a single Reducer to write to more than one table if desired. A default table |
| can be configured using the AccumuloOutputFormat, in which case the output table |
| name does not have to be passed to the Context object within the Reducer.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #0000b3">class</span> MyReducer <span style="color: #0000b3">extends</span> Reducer<WritableComparable, Writable, Text, Mutation> { |
| <span style="color: #0000b3">public</span> <span style="color: #000000">void</span> <span style="font-weight: bold"><span style="color: #000000">reduce</span></span>(<span style="color: #000000">WritableComparable</span> key, <span style="color: #000000">Iterable<Text></span> values, <span style="color: #000000">Context</span> c) { |
| <span style="color: #000000">Mutation</span> m; |
| <span style="font-style: italic"><span style="color: #b30000">// create the mutation based on input key and value</span></span> |
| c.<span style="font-weight: bold"><span style="color: #000000">write</span></span>(<span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"output-table"</span>), m); |
| } |
| }</tt></pre></div></div> |
| <div class="paragraph"><p>The Text object passed as the output should contain the name of the table to which |
| this mutation should be applied. The Text can be null in which case the mutation |
| will be applied to the default table name specified in the AccumuloOutputFormat |
| options.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_accumuloinputformat_options">9.1.2. AccumuloInputFormat options</h4> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Job</span> job = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Job</span></span>(<span style="font-weight: bold"><span style="color: #000000">getConf</span></span>()); |
| |
| AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">setInputInfo</span></span>(job, |
| <span style="color: #4c73a6">"user"</span>, |
| <span style="color: #4c73a6">"passwd"</span>.<span style="font-weight: bold"><span style="color: #000000">getBytes</span></span>(), |
| <span style="color: #4c73a6">"table"</span>, |
| <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Authorizations</span></span>()); |
| |
| AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">setZooKeeperInstance</span></span>(job, <span style="color: #4c73a6">"myinstance"</span>, |
| <span style="color: #4c73a6">"zooserver-one,zooserver-two"</span>);</tt></pre></div></div> |
| <div class="paragraph"><p><strong>Optional settings:</strong></p></div> |
| <div class="paragraph"><p>To restrict Accumulo to a set of row ranges:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">ArrayList<Range></span> ranges = <span style="color: #0000b3">new</span> ArrayList<Range>(); |
| <span style="font-style: italic"><span style="color: #b30000">// populate array list of row ranges ...</span></span> |
| AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">setRanges</span></span>(job, ranges);</tt></pre></div></div> |
| <div class="paragraph"><p>To restrict Accumulo to a list of columns:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">ArrayList<Pair<Text,Text>></span> columns = <span style="color: #0000b3">new</span> ArrayList<Pair<Text,Text>>(); |
| <span style="font-style: italic"><span style="color: #b30000">// populate list of columns</span></span> |
| AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">fetchColumns</span></span>(job, columns);</tt></pre></div></div> |
| <div class="paragraph"><p>To use a regular expression to match row IDs:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>AccumuloInputFormat.<span style="font-weight: bold"><span style="color: #000000">setRegex</span></span>(job, RegexType.ROW, <span style="color: #4c73a6">"^.*"</span>);</tt></pre></div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_accumulooutputformat_options">9.1.3. AccumuloOutputFormat options</h4> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">boolean</span> createTables = <span style="color: #0000b3">true</span>; |
| <span style="color: #000000">String</span> defaultTable = <span style="color: #4c73a6">"mytable"</span>; |
| |
| AccumuloOutputFormat.<span style="font-weight: bold"><span style="color: #000000">setOutputInfo</span></span>(job, |
| <span style="color: #4c73a6">"user"</span>, |
| <span style="color: #4c73a6">"passwd"</span>.<span style="font-weight: bold"><span style="color: #000000">getBytes</span></span>(), |
| createTables, |
| defaultTable); |
| |
| AccumuloOutputFormat.<span style="font-weight: bold"><span style="color: #000000">setZooKeeperInstance</span></span>(job, <span style="color: #4c73a6">"myinstance"</span>, |
| <span style="color: #4c73a6">"zooserver-one,zooserver-two"</span>);</tt></pre></div></div> |
| <div class="paragraph"><p><strong>Optional Settings:</strong></p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>AccumuloOutputFormat.<span style="font-weight: bold"><span style="color: #000000">setMaxLatency</span></span>(job, <span style="color: #000000">300000</span>); <span style="font-style: italic"><span style="color: #b30000">// milliseconds</span></span> |
| AccumuloOutputFormat.<span style="font-weight: bold"><span style="color: #000000">setMaxMutationBufferSize</span></span>(job, <span style="color: #000000">50000000</span>); <span style="font-style: italic"><span style="color: #b30000">// bytes</span></span></tt></pre></div></div> |
| <div class="paragraph"><p>An example of using MapReduce with Accumulo can be found at |
| <code>accumulo/docs/examples/README.mapred</code></p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_combiners_2">9.2. Combiners</h3> |
| <div class="paragraph"><p>Many applications can benefit from the ability to aggregate values across common |
| keys. This can be done via Combiner iterators and is similar to the Reduce step in |
| MapReduce. This provides the ability to define online, incrementally updated |
| analytics without the overhead or latency associated with batch-oriented |
| MapReduce jobs.</p></div> |
| <div class="paragraph"><p>All that is needed to aggregate values of a table is to identify the fields over which |
| values will be grouped, insert mutations with those fields as the key, and configure |
| the table with a combining iterator that supports the summarizing operation |
| desired.</p></div> |
| <div class="paragraph"><p>The only restriction on an combining iterator is that the combiner developer |
| should not assume that all values for a given key have been seen, since new |
| mutations can be inserted at any time. This precludes using the total number of |
| values in the aggregation such as when calculating an average, for example.</p></div> |
| <div class="sect3"> |
| <h4 id="_feature_vectors">9.2.1. Feature Vectors</h4> |
| <div class="paragraph"><p>An interesting use of combining iterators within an Accumulo table is to store |
| feature vectors for use in machine learning algorithms. For example, many |
| algorithms such as k-means clustering, support vector machines, anomaly detection, |
| etc. use the concept of a feature vector and the calculation of distance metrics to |
| learn a particular model. The columns in an Accumulo table can be used to efficiently |
| store sparse features and their weights to be incrementally updated via the use of an |
| combining iterator.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_statistical_modeling">9.3. Statistical Modeling</h3> |
| <div class="paragraph"><p>Statistical models that need to be updated by many machines in parallel could be |
| similarly stored within an Accumulo table. For example, a MapReduce job that is |
| iteratively updating a global statistical model could have each map or reduce worker |
| reference the parts of the model to be read and updated through an embedded |
| Accumulo client.</p></div> |
| <div class="paragraph"><p>Using Accumulo this way enables efficient and fast lookups and updates of small |
| pieces of information in a random access pattern, which is complementary to |
| MapReduce’s sequential access model.</p></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_security">10. Security</h2> |
| <div class="sectionbody"> |
| <div class="paragraph"><p>Accumulo extends the BigTable data model to implement a security mechanism |
| known as cell-level security. Every key-value pair has its own security label, stored |
| under the column visibility element of the key, which is used to determine whether |
| a given user meets the security requirements to read the value. This enables data of |
| various security levels to be stored within the same row, and users of varying |
| degrees of access to query the same table, while preserving data confidentiality.</p></div> |
| <div class="sect2"> |
| <h3 id="_security_label_expressions">10.1. Security Label Expressions</h3> |
| <div class="paragraph"><p>When mutations are applied, users can specify a security label for each value. This is |
| done as the Mutation is created by passing a ColumnVisibility object to the put() |
| method:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Text</span> rowID = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"row1"</span>); |
| <span style="color: #000000">Text</span> colFam = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"myColFam"</span>); |
| <span style="color: #000000">Text</span> colQual = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Text</span></span>(<span style="color: #4c73a6">"myColQual"</span>); |
| <span style="color: #000000">ColumnVisibility</span> colVis = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ColumnVisibility</span></span>(<span style="color: #4c73a6">"public"</span>); |
| <span style="color: #000000">long</span> timestamp = System.<span style="font-weight: bold"><span style="color: #000000">currentTimeMillis</span></span>(); |
| |
| <span style="color: #000000">Value</span> value = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Value</span></span>(<span style="color: #4c73a6">"myValue"</span>); |
| |
| <span style="color: #000000">Mutation</span> mutation = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Mutation</span></span>(rowID); |
| mutation.<span style="font-weight: bold"><span style="color: #000000">put</span></span>(colFam, colQual, colVis, timestamp, value);</tt></pre></div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_security_label_expression_syntax">10.2. Security Label Expression Syntax</h3> |
| <div class="paragraph"><p>Security labels consist of a set of user-defined tokens that are required to read the |
| value the label is associated with. The set of tokens required can be specified using |
| syntax that supports logical AND and OR combinations of tokens, as well as nesting |
| groups of tokens together.</p></div> |
| <div class="paragraph"><p>For example, suppose within our organization we want to label our data values with |
| security labels defined in terms of user roles. We might have tokens such as:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>admin |
| audit |
| system</code></pre> |
| </div></div> |
| <div class="paragraph"><p>These can be specified alone or combined using logical operators:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>// Users must have admin privileges: |
| admin</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>// Users must have admin and audit privileges |
| admin&audit</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>// Users with either admin or audit privileges |
| admin|audit</code></pre> |
| </div></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>// Users must have audit and one or both of admin or system |
| (admin|system)&audit</code></pre> |
| </div></div> |
| <div class="paragraph"><p>When both <code>|</code> and <code>&</code> operators are used, parentheses must be used to specify |
| precedence of the operators.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_authorization">10.3. Authorization</h3> |
| <div class="paragraph"><p>When clients attempt to read data from Accumulo, any security labels present are |
| examined against the set of authorizations passed by the client code when the |
| Scanner or BatchScanner are created. If the authorizations are determined to be |
| insufficient to satisfy the security label, the value is suppressed from the set of |
| results sent back to the client.</p></div> |
| <div class="paragraph"><p>Authorizations are specified as a comma-separated list of tokens the user possesses:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="font-style: italic"><span style="color: #b30000">// user possesses both admin and system level access</span></span> |
| <span style="color: #000000">Authorization</span> auths = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">Authorization</span></span>(<span style="color: #4c73a6">"admin"</span>,<span style="color: #4c73a6">"system"</span>); |
| |
| <span style="color: #000000">Scanner</span> s = connector.<span style="font-weight: bold"><span style="color: #000000">createScanner</span></span>(<span style="color: #4c73a6">"table"</span>, auths);</tt></pre></div></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_user_authorizations">10.4. User Authorizations</h3> |
| <div class="paragraph"><p>Each Accumulo user has a set of associated security labels. To manipulate |
| these in the shell while using the default authorizor, use the setuaths and getauths commands. |
| These may also be modified for the default authorizor using the java security operations API.</p></div> |
| <div class="paragraph"><p>When a user creates a scanner a set of Authorizations is passed. If the |
| authorizations passed to the scanner are not a subset of the users |
| authorizations, then an exception will be thrown.</p></div> |
| <div class="paragraph"><p>To prevent users from writing data they can not read, add the visibility |
| constraint to a table. Use the -evc option in the createtable shell command to |
| enable this constraint. For existing tables use the following shell command to |
| enable the visibility constraint. Ensure the constraint number does not |
| conflict with any existing constraints.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>config -t table -s table.constraint.1=org.apache.accumulo.core.security.VisibilityConstraint</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Any user with the alter table permission can add or remove this constraint. |
| This constraint is not applied to bulk imported data, if this a concern then |
| disable the bulk import permission.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_pluggable_security">10.5. Pluggable Security</h3> |
| <div class="paragraph"><p>New in 1.5 of Accumulo is a pluggable security mechanism. It can be broken into three actions — authentication, authorization, and permission handling. By default all of these are handled in |
| Zookeeper, which is how things were handled in Accumulo 1.4 and before. It is worth noting at this |
| point, that it is a new feature in 1.5 and may be adjusted in future releases without the standard |
| deprecation cycle.</p></div> |
| <div class="paragraph"><p>Authentication simply handles the ability for a user to verify their integrity. A combination of |
| principal and authentication token are used to verify a user is who they say they are. An |
| authentication token should be constructed, either directly through its constructor, but it is |
| advised to use the init(Property) method to populate an authentication token. It is expected that a |
| user knows what the appropriate token to use for their system is. The default token is |
| <code>PasswordToken</code>.</p></div> |
| <div class="paragraph"><p>Once a user is authenticated by the Authenticator, the user has access to the other actions within |
| Accumulo. All actions in Accumulo are ACLed, and this ACL check is handled by the Permission |
| Handler. This is what manages all of the permissions, which are divided in system and per table |
| level. From there, if a user is doing an action which requires authorizations, the Authorizor is |
| queried to determine what authorizations the user has.</p></div> |
| <div class="paragraph"><p>This setup allows a variety of different mechanisms to be used for handling different aspects of |
| Accumulo’s security. A system like Kerberos can be used for authentication, then a system like LDAP |
| could be used to determine if a user has a specific permission, and then it may default back to the |
| default ZookeeperAuthorizor to determine what Authorizations a user is ultimately allowed to use. |
| This is a pluggable system so custom components can be created depending on your need.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_secure_authorizations_handling">10.6. Secure Authorizations Handling</h3> |
| <div class="paragraph"><p>For applications serving many users, it is not expected that an Accumulo user |
| will be created for each application user. In this case an Accumulo user with |
| all authorizations needed by any of the applications users must be created. To |
| service queries, the application should create a scanner with the application |
| user’s authorizations. These authorizations could be obtained from a trusted 3rd |
| party.</p></div> |
| <div class="paragraph"><p>Often production systems will integrate with Public-Key Infrastructure (PKI) and |
| designate client code within the query layer to negotiate with PKI servers in order |
| to authenticate users and retrieve their authorization tokens (credentials). This |
| requires users to specify only the information necessary to authenticate themselves |
| to the system. Once user identity is established, their credentials can be accessed by |
| the client code and passed to Accumulo outside of the reach of the user.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_query_services_layer">10.7. Query Services Layer</h3> |
| <div class="paragraph"><p>Since the primary method of interaction with Accumulo is through the Java API, |
| production environments often call for the implementation of a Query layer. This |
| can be done using web services in containers such as Apache Tomcat, but is not a |
| requirement. The Query Services Layer provides a mechanism for providing a |
| platform on which user facing applications can be built. This allows the application |
| designers to isolate potentially complex query logic, and enables a convenient point |
| at which to perform essential security functions.</p></div> |
| <div class="paragraph"><p>Several production environments choose to implement authentication at this layer, |
| where users identifiers are used to retrieve their access credentials which are then |
| cached within the query layer and presented to Accumulo through the |
| Authorizations mechanism.</p></div> |
| <div class="paragraph"><p>Typically, the query services layer sits between Accumulo and user workstations.</p></div> |
| </div> |
| </div> |
| </div> |
| <div class="sect1"> |
| <h2 id="_administration">11. Administration</h2> |
| <div class="sectionbody"> |
| <div class="sect2"> |
| <h3 id="_hardware">11.1. Hardware</h3> |
| <div class="paragraph"><p>Because we are running essentially two or three systems simultaneously layered |
| across the cluster: HDFS, Accumulo and MapReduce, it is typical for hardware to |
| consist of 4 to 8 cores, and 8 to 32 GB RAM. This is so each running process can have |
| at least one core and 2 - 4 GB each.</p></div> |
| <div class="paragraph"><p>One core running HDFS can typically keep 2 to 4 disks busy, so each machine may |
| typically have as little as 2 x 300GB disks and as much as 4 x 1TB or 2TB disks.</p></div> |
| <div class="paragraph"><p>It is possible to do with less than this, such as with 1u servers with 2 cores and 4GB |
| each, but in this case it is recommended to only run up to two processes per |
| machine — i.e. DataNode and TabletServer or DataNode and MapReduce worker but |
| not all three. The constraint here is having enough available heap space for all the |
| processes on a machine.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_network">11.2. Network</h3> |
| <div class="paragraph"><p>Accumulo communicates via remote procedure calls over TCP/IP for both passing |
| data and control messages. In addition, Accumulo uses HDFS clients to |
| communicate with HDFS. To achieve good ingest and query performance, sufficient |
| network bandwidth must be available between any two machines.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_installation">11.3. Installation</h3> |
| <div class="paragraph"><p>Choose a directory for the Accumulo installation. This directory will be referenced |
| by the environment variable <code>$ACCUMULO_HOME</code>. Run the following:</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ tar xzf accumulo-1.5.0-bin.tar.gz # unpack to subdirectory |
| $ mv accumulo-1.5.0 $ACCUMULO_HOME # move to desired location</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Repeat this step at each machine within the cluster. Usually all machines have the |
| same <code>$ACCUMULO_HOME</code>.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_dependencies">11.4. Dependencies</h3> |
| <div class="paragraph"><p>Accumulo requires HDFS and ZooKeeper to be configured and running |
| before starting. Password-less SSH should be configured between at least the |
| Accumulo master and TabletServer machines. It is also a good idea to run Network |
| Time Protocol (NTP) within the cluster to ensure nodes' clocks don’t get too out of |
| sync, which can cause problems with automatically timestamped data.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_configuration_2">11.5. Configuration</h3> |
| <div class="paragraph"><p>Accumulo is configured by editing several Shell and XML files found in |
| <code>$ACCUMULO_HOME/conf</code>. The structure closely resembles Hadoop’s configuration |
| files.</p></div> |
| <div class="sect3"> |
| <h4 id="_edit_conf_accumulo_env_sh">11.5.1. Edit conf/accumulo-env.sh</h4> |
| <div class="paragraph"><p>Accumulo needs to know where to find the software it depends on. Edit accumulo-env.sh |
| and specify the following:</p></div> |
| <div class="olist arabic"><ol class="arabic"> |
| <li> |
| <p> |
| Enter the location of the installation directory of Accumulo for <code>$ACCUMULO_HOME</code> |
| </p> |
| </li> |
| <li> |
| <p> |
| Enter your system’s Java home for <code>$JAVA_HOME</code> |
| </p> |
| </li> |
| <li> |
| <p> |
| Enter the location of Hadoop for <code>$HADOOP_PREFIX</code> |
| </p> |
| </li> |
| <li> |
| <p> |
| Choose a location for Accumulo logs and enter it for <code>$ACCUMULO_LOG_DIR</code> |
| </p> |
| </li> |
| <li> |
| <p> |
| Enter the location of ZooKeeper for <code>$ZOOKEEPER_HOME</code> |
| </p> |
| </li> |
| </ol></div> |
| <div class="paragraph"><p>By default Accumulo TabletServers are set to use 1GB of memory. You may change |
| this by altering the value of <code>$ACCUMULO_TSERVER_OPTS</code>. Note the syntax is that of |
| the Java JVM command line options. This value should be less than the physical |
| memory of the machines running TabletServers.</p></div> |
| <div class="paragraph"><p>There are similar options for the master’s memory usage and the garbage collector |
| process. Reduce these if they exceed the physical RAM of your hardware and |
| increase them, within the bounds of the physical RAM, if a process fails because of |
| insufficient memory.</p></div> |
| <div class="paragraph"><p>Note that you will be specifying the Java heap space in accumulo-env.sh. You should |
| make sure that the total heap space used for the Accumulo tserver and the Hadoop |
| DataNode and TaskTracker is less than the available memory on each slave node in |
| the cluster. On large clusters, it is recommended that the Accumulo master, Hadoop |
| NameNode, secondary NameNode, and Hadoop JobTracker all be run on separate |
| machines to allow them to use more heap space. If you are running these on the |
| same machine on a small cluster, likewise make sure their heap space settings fit |
| within the available memory.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_native_map">11.5.2. Native Map</h4> |
| <div class="paragraph"><p>The tablet server uses a data structure called a MemTable to store sorted key/value |
| pairs in memory when they are first received from the client. When a minor compaction |
| occurs, this data structure is written to HDFS. The MemTable will default to using |
| memory in the JVM but a JNI version, called the native map, can be used to significantly |
| speed up performance by utilizing the memory space of the native operating system. The |
| native map also avoids the performance implications brought on by garbage collection |
| in the JVM by causing it to pause much less frequently.</p></div> |
| <div class="paragraph"><p>32-bit and 64-bit Linux versions of the native map ship with the Accumulo dist package. |
| For other operating systems, the native map can be built from the codebase in two ways- |
| from maven or from the Makefile.</p></div> |
| <div class="olist arabic"><ol class="arabic"> |
| <li> |
| <p> |
| Build from maven using the following command: <code>mvn clean package -Pnative</code>. |
| </p> |
| </li> |
| <li> |
| <p> |
| Build from the c++ source by running <code>make</code> in the <code>$ACCUMULO_HOME/server/src/main/c++</code> directory. |
| </p> |
| </li> |
| </ol></div> |
| <div class="paragraph"><p>After building the native map from the source, you will find the artifact in |
| <code>$ACCUMULO_HOME/lib/native</code>. Upon starting up, the tablet server will look |
| in this directory for the map library. If the file is renamed or moved from its |
| target directory, the tablet server may not be able to find it.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_cluster_specification">11.5.3. Cluster Specification</h4> |
| <div class="paragraph"><p>On the machine that will serve as the Accumulo master:</p></div> |
| <div class="olist arabic"><ol class="arabic"> |
| <li> |
| <p> |
| Write the IP address or domain name of the Accumulo Master to the <code>$ACCUMULO_HOME/conf/masters</code> file. |
| </p> |
| </li> |
| <li> |
| <p> |
| Write the IP addresses or domain name of the machines that will be TabletServers in <code>$ACCUMULO_HOME/conf/slaves</code>, one per line. |
| </p> |
| </li> |
| </ol></div> |
| <div class="paragraph"><p>Note that if using domain names rather than IP addresses, DNS must be configured |
| properly for all machines participating in the cluster. DNS can be a confusing source |
| of errors.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_accumulo_settings">11.5.4. Accumulo Settings</h4> |
| <div class="paragraph"><p>Specify appropriate values for the following settings in <code>$ACCUMULO_HOME/conf/accumulo-site.xml</code>:</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #0000b3"><property></span> |
| <span style="color: #0000b3"><name></span>instance.zookeeper.host<span style="color: #0000b3"></name></span> |
| <span style="color: #0000b3"><value></span>zooserver-one:2181,zooserver-two:2181<span style="color: #0000b3"></value></span> |
| <span style="color: #0000b3"><description></span>list of zookeeper servers<span style="color: #0000b3"></description></span> |
| <span style="color: #0000b3"></property></span></tt></pre></div></div> |
| <div class="paragraph"><p>This enables Accumulo to find ZooKeeper. Accumulo uses ZooKeeper to coordinate |
| settings between processes and helps finalize TabletServer failure.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #0000b3"><property></span> |
| <span style="color: #0000b3"><name></span>instance.secret<span style="color: #0000b3"></name></span> |
| <span style="color: #0000b3"><value></span>DEFAULT<span style="color: #0000b3"></value></span> |
| <span style="color: #0000b3"></property></span></tt></pre></div></div> |
| <div class="paragraph"><p>The instance needs a secret to enable secure communication between servers. Configure your |
| secret and make sure that the <code>accumulo-site.xml</code> file is not readable to other users.</p></div> |
| <div class="paragraph"><p>Some settings can be modified via the Accumulo shell and take effect immediately, but |
| some settings require a process restart to take effect. See the configuration documentation |
| (available on the monitor web pages) for details.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_deploy_configuration">11.5.5. Deploy Configuration</h4> |
| <div class="paragraph"><p>Copy the masters, slaves, accumulo-env.sh, and if necessary, accumulo-site.xml |
| from the <code>$ACCUMULO_HOME/conf/</code> directory on the master to all the machines |
| specified in the slaves file.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_initialization">11.6. Initialization</h3> |
| <div class="paragraph"><p>Accumulo must be initialized to create the structures it uses internally to locate |
| data across the cluster. HDFS is required to be configured and running before |
| Accumulo can be initialized.</p></div> |
| <div class="paragraph"><p>Once HDFS is started, initialization can be performed by executing |
| <code>$ACCUMULO_HOME/bin/accumulo init</code> . This script will prompt for a name |
| for this instance of Accumulo. The instance name is used to identify a set of tables |
| and instance-specific settings. The script will then write some information into |
| HDFS so Accumulo can start properly.</p></div> |
| <div class="paragraph"><p>The initialization script will prompt you to set a root password. Once Accumulo is |
| initialized it can be started.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_running">11.7. Running</h3> |
| <div class="sect3"> |
| <h4 id="_starting_accumulo">11.7.1. Starting Accumulo</h4> |
| <div class="paragraph"><p>Make sure Hadoop is configured on all of the machines in the cluster, including |
| access to a shared HDFS instance. Make sure HDFS and ZooKeeper are running. |
| Make sure ZooKeeper is configured and running on at least one machine in the |
| cluster. |
| Start Accumulo using the <code>bin/start-all.sh</code> script.</p></div> |
| <div class="paragraph"><p>To verify that Accumulo is running, check the Status page as described under |
| <em>Monitoring</em>. In addition, the Shell can provide some information about the status of |
| tables via reading the !METADATA table.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_stopping_accumulo">11.7.2. Stopping Accumulo</h4> |
| <div class="paragraph"><p>To shutdown cleanly, run <code>bin/stop-all.sh</code> and the master will orchestrate the |
| shutdown of all the tablet servers. Shutdown waits for all minor compactions to finish, so it may |
| take some time for particular configurations.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_adding_a_node">11.7.3. Adding a Node</h4> |
| <div class="paragraph"><p>Update your <code>$ACCUMULO_HOME/conf/slaves</code> (or <code>$ACCUMULO_CONF_DIR/slaves</code>) file to account for the addition.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/bin/accumulo admin start <host(s)> {<host> ...}</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Alternatively, you can ssh to each of the hosts you want to add and run |
| <code>$ACCUMULO_HOME/bin/start-here.sh</code>.</p></div> |
| <div class="paragraph"><p>Make sure the host in question has the new configuration, or else the tablet |
| server won’t start; at a minimum this needs to be on the host(s) being added, |
| but in practice it’s good to ensure consistent configuration across all nodes.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_decomissioning_a_node">11.7.4. Decomissioning a Node</h4> |
| <div class="paragraph"><p>If you need to take a node out of operation, you can trigger a graceful shutdown of a tablet |
| server. Accumulo will automatically rebalance the tablets across the available tablet servers.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>$ACCUMULO_HOME/bin/accumulo admin stop <host(s)> {<host> ...}</code></pre> |
| </div></div> |
| <div class="paragraph"><p>Alternatively, you can ssh to each of the hosts you want to remove and run |
| <code>$ACCUMULO_HOME/bin/stop-here.sh</code>.</p></div> |
| <div class="paragraph"><p>Be sure to update your <code>$ACCUMULO_HOME/conf/slaves</code> (or <code>$ACCUMULO_CONF_DIR/slaves</code>) file to |
| account for the removal of these hosts. Bear in mind that the monitor will not re-read the |
| slaves file automatically, so it will report the decomissioned servers as down; it’s |
| recommended that you restart the monitor so that the node list is up to date.</p></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_monitoring">11.8. Monitoring</h3> |
| <div class="paragraph"><p>The Accumulo Master provides an interface for monitoring the status and health of |
| Accumulo components. This interface can be accessed by pointing a web browser to |
| <code>http://accumulomaster:50095/status</code></p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_tracing">11.9. Tracing</h3> |
| <div class="paragraph"><p>It can be difficult to determine why some operations are taking longer |
| than expected. For example, you may be looking up items with very low |
| latency, but sometimes the lookups take much longer. Determining the |
| cause of the delay is difficult because the system is distributed, and |
| the typical lookup is fast.</p></div> |
| <div class="paragraph"><p>Accumulo has been instrumented to record the time that various |
| operations take when tracing is turned on. The fact that tracing is |
| enabled follows all the requests made on behalf of the user throughout |
| the distributed infrastructure of accumulo, and across all threads of |
| execution.</p></div> |
| <div class="paragraph"><p>These time spans will be inserted into the <code>trace</code> table in |
| Accumulo. You can browse recent traces from the Accumulo monitor |
| page. You can also read the <code>trace</code> table directly like any |
| other table.</p></div> |
| <div class="paragraph"><p>The design of Accumulo’s distributed tracing follows that of |
| <a href="http://research.google.com/pubs/pub36356.html">Google’s Dapper</a>.</p></div> |
| <div class="sect3"> |
| <h4 id="_tracers">11.9.1. Tracers</h4> |
| <div class="paragraph"><p>To collect traces, Accumulo needs at least one server listed in |
| <code>$ACCUMULO_HOME/conf/tracers</code>. The server collects traces |
| from clients and writes them to the <code>trace</code> table. The Accumulo |
| user that the tracer connects to Accumulo with can be configured with |
| the following properties</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>trace.user |
| trace.token.property.password</code></pre> |
| </div></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_instrumenting_a_client">11.9.2. Instrumenting a Client</h4> |
| <div class="paragraph"><p>Tracing can be used to measure a client operation, such as a scan, as |
| the operation traverses the distributed system. To enable tracing for |
| your application call</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>DistributedTrace.<span style="font-weight: bold"><span style="color: #000000">enable</span></span>(instance, <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">ZooReader</span></span>(instance), hostname, <span style="color: #4c73a6">"myApplication"</span>);</tt></pre></div></div> |
| <div class="paragraph"><p>Once tracing has been enabled, a client can wrap an operation in a trace.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>Trace.<span style="font-weight: bold"><span style="color: #000000">on</span></span>(<span style="color: #4c73a6">"Client Scan"</span>); |
| <span style="color: #000000">BatchScanner</span> scanner = conn.<span style="font-weight: bold"><span style="color: #000000">createBatchScanner</span></span>(...); |
| <span style="font-style: italic"><span style="color: #b30000">// Configure your scanner</span></span> |
| <span style="color: #0000b3">for</span> (<span style="color: #000000">Entry</span> entry : scanner) { |
| } |
| Trace.<span style="font-weight: bold"><span style="color: #000000">off</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>Additionally, the user can create additional Spans within a Trace.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>Trace.<span style="font-weight: bold"><span style="color: #000000">on</span></span>(<span style="color: #4c73a6">"Client Update"</span>); |
| ... |
| <span style="color: #000000">Span</span> readSpan = Trace.<span style="font-weight: bold"><span style="color: #000000">start</span></span>(<span style="color: #4c73a6">"Read"</span>); |
| ... |
| readSpan.<span style="font-weight: bold"><span style="color: #000000">stop</span></span>(); |
| ... |
| <span style="color: #000000">Span</span> writeSpan = Trace.<span style="font-weight: bold"><span style="color: #000000">start</span></span>(<span style="color: #4c73a6">"Write"</span>); |
| ... |
| writeSpan.<span style="font-weight: bold"><span style="color: #000000">stop</span></span>(); |
| Trace.<span style="font-weight: bold"><span style="color: #000000">off</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>Like Dapper, Accumulo tracing supports user defined annotations to associate additional data with a Trace.</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt>... |
| <span style="color: #000000">int</span> numberOfEntriesRead = <span style="color: #000000">0</span>; |
| <span style="color: #000000">Span</span> readSpan = Trace.<span style="font-weight: bold"><span style="color: #000000">start</span></span>(<span style="color: #4c73a6">"Read"</span>); |
| <span style="font-style: italic"><span style="color: #b30000">// Do the read, update the counter</span></span> |
| ... |
| readSpan.<span style="font-weight: bold"><span style="color: #000000">data</span></span>(<span style="color: #4c73a6">"Number of Entries Read"</span>, String.<span style="font-weight: bold"><span style="color: #000000">valueOf</span></span>(numberOfEntriesRead));</tt></pre></div></div> |
| <div class="paragraph"><p>Some client operations may have a high volume within your |
| application. As such, you may wish to only sample a percentage of |
| operations for tracing. As seen below, the CountSampler can be used to |
| help enable tracing for 1-in-1000 operations</p></div> |
| <div class="listingblock"> |
| <div class="content"><!-- Generator: GNU source-highlight 3.1.7 |
| by Lorenzo Bettini |
| http://www.lorenzobettini.it |
| http://www.gnu.org/software/src-highlite --> |
| <pre><tt><span style="color: #000000">Sampler</span> sampler = <span style="color: #0000b3">new</span> <span style="font-weight: bold"><span style="color: #000000">CountSampler</span></span>(<span style="color: #000000">1000</span>); |
| ... |
| <span style="color: #0000b3">if</span> (sampler.<span style="font-weight: bold"><span style="color: #000000">next</span></span>()) { |
| Trace.<span style="font-weight: bold"><span style="color: #000000">on</span></span>(<span style="color: #4c73a6">"Read"</span>); |
| } |
| ... |
| Trace.<span style="font-weight: bold"><span style="color: #000000">offNoFlush</span></span>();</tt></pre></div></div> |
| <div class="paragraph"><p>It should be noted that it is safe to turn off tracing even if it |
| isn’t currently active. The <code>Trace.offNoFlush()</code> should be used if the |
| user does not wish to have <code>Trace.off()</code> block while flushing trace |
| data.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_viewing_collected_traces">11.9.3. Viewing Collected Traces</h4> |
| <div class="paragraph"><p>To view collected traces, use the "Recent Traces" link on the Monitor |
| UI. You can also programmatically access and print traces using the |
| <code>TraceDump</code> class.</p></div> |
| </div> |
| <div class="sect3"> |
| <h4 id="_tracing_from_the_shell">11.9.4. Tracing from the Shell</h4> |
| <div class="paragraph"><p>You can enable tracing for operations run from the shell by using the |
| <code>trace on</code> and <code>trace off</code> commands.</p></div> |
| <div class="literalblock"> |
| <div class="content"> |
| <pre><code>root@test test> trace on |
| root@test test> scan |
| a b:c [] d |
| root@test test> trace off |
| Waiting for trace information |
| Waiting for trace information |
| Trace started at 2013/08/26 13:24:08.332 |
| Time Start Service@Location Name |
| 3628+0 shell@localhost shell:root |
| 8+1690 shell@localhost scan |
| 7+1691 shell@localhost scan:location |
| 6+1692 tserver@localhost startScan |
| 5+1692 tserver@localhost tablet read ahead 6</code></pre> |
| </div></div> |
| </div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_logging">11.10. Logging</h3> |
| <div class="paragraph"><p>Accumulo processes each write to a set of log files. By default these are found under |
| <code>$ACCUMULO/logs/</code>.</p></div> |
| </div> |
| <div class="sect2"> |
| <h3 id="_recovery">11.11. Recovery</h3> |
| <div class="paragraph"><p>In the event of TabletServer failure or error on shutting Accumulo down, some |
| mutations may not have been minor compacted to HDFS properly. In this case, |
| Accumulo will automatically reapply such mutations from the write-ahead log |
| either when the tablets from the failed server are reassigned by the Master (in the |
| case of a single TabletServer failure) or the next time Accumulo starts (in the event of |
| failure during shutdown).</p></div> |
| <div class="paragraph"><p>Recovery is performed by asking a tablet server to sort the logs so that tablets can easily find their missing |
| updates. The sort status of each file is displayed on |
| Accumulo monitor status page. Once the recovery is complete any |
| tablets involved should return to an “online” state. Until then those tablets will be |
| unavailable to clients.</p></div> |
| <div class="paragraph"><p>The Accumulo client library is configured to retry failed mutations and in many |
| cases clients will be able to continue processing after the recovery process without |
| throwing an exception.</p></div> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div id="footnotes"><hr /></div> |
| <div id="footer"> |
| <div id="footer-text"> |
| Last updated 2014-03-14 07:47:13 PDT |
| </div> |
| </div> |
| </body> |
| </html> |