blob: c6a6ebae39cc1159e3a32b63c142f74dea77bbff [file] [log] [blame]
<!DOCTYPE html>
<!-- Start _layouts/doc_page.html-->
<html lang="en">
<head>
<!-- Start _include/site_head.html -->
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="datasketches">
<title>DataSketches | </title>
<link rel="shortcut icon" href="/img/favicon.png">
<!-- original source: https://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css -->
<link rel="stylesheet" href="/css/font-awesome.min.css">
<!-- original source: https://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css -->
<link rel="stylesheet" href="/css/bootstrap.min.css">
<link rel="stylesheet" href="/css/fonts.css" type="text/css">
<link rel="stylesheet" href="/css/main.css">
<link rel="stylesheet" href="/css/header.css">
<link rel="stylesheet" href="/css/footer.css">
<link rel="stylesheet" href="/css/syntax.css">
<link rel="stylesheet" href="/css/docs.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},showMathMenu:false,showMathMenuMSIE:false,showProcessingMessages:false});
</script>
<!-- original source: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMX_HTML-full -->
<script type="text/javascript" src="/js/MathJax.js?config=TeX-AMS_HTML"></script>
<!-- original source: https://code.jquery.com/jquery.min.js -->
<script src="/js/jquery.min.js"></script>
<!-- original source: https://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js -->
<script src="/js/bootstrap.min.js"></script> <!-- 3.2.0-->
<!-- End _include/site_head.html -->
</head>
<body>
<!-- Start _include/nav_bar.html -->
<div class="navbar navbar-inverse navbar-static-top ds-nav">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="/" style="padding-top: 0px; padding-bottom: 0px;">
<span class="ds-small-h-logo"></span></a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav navbar-right">
<li>
<a href="/docs/Background/TheChallenge.html">
<span class="fa fa-info-circle"></span> DOCUMENTATION</a>
</li>
<li>
<a href="/docs/Community/Downloads.html">
<span class="fa fa-download"></span> DOWNLOAD</a>
</li>
<!--
<li>
<a href="/docs/Architecture/Components.html">
<span class="fa fa-github"></span> GITHUB</a>
</li>
-->
<li>
<a href="/docs/Community/Research.html">
<span class="fa fa-paper-plane"></span> RESEARCH</a>
</li>
<li>
<a href="/docs/Community/index.html" style="padding-top: 0; padding-bottom: 0;">
<img class="ds-small-man" src="/img/datasketches-ManWhite.svg"/>COMMUNITY</a>
</li>
<li>
<ul class="nav navbar-nav navbar-right ds-nav">
<li class="dropdown ds-nav" >
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false" style="padding-top: 0; padding-bottom: 0;"><img class="apache-logo" src="/img/feather.svg"/>Apache <span class="caret"></span></a>
<ul class="dropdown-menu ds-nav">
<li><a href="https://www.apache.org/" target="_blank">Foundation</a></li>
<li><a href="https://www.apache.org/events/current-event" target="_blank">Events</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
<li><a href="https://privacy.apache.org/policies/privacy-policy-public.html" target="_blank">Privacy Policy</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
<!-- End _include/nav_bar.html -->
<!-- Start _include/javadocs.html -->
<div class="ds-header">
<div class="container">
<h4>API Snapshots:
<a href="https://apache.github.io/datasketches-java/4.2.0/">Java Core</a>,
<a href="https://apache.github.io/datasketches-cpp/5.0.0/">C++ Core</a>,
<a href="https://apache.github.io/datasketches-python/main/">Python</a>,
<a href="https://apache.github.io/datasketches-memory/master/">Memory</a>,
<a href="/api/pig/snapshot/apidocs/index.html">Pig</a>,
<a href="/api/hive/snapshot/apidocs/index.html">Hive</a>,
</h4>
</div>
</div>
<!-- End _include/javadocs.html -->
<div class="container">
<div class="row">
<!-- Start ToC Block -->
<div class="col-md-3">
<div class="searchbox" style="position:relative">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<!-- Start _includes/toc.html -->
<!-- Computer Generated File, Do Not Edit! -->
<link rel="stylesheet" href="/css/toc.css">
<div id="toc" class="nav toc hidden-print">
<p id="background">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_background">Background</a>
</p>
<div class="collapse" id="collapse_background">
<li><a href="/docs/Background/TheChallenge.html">•The Challenge</a></li>
<li><a href="/docs/Background/SketchOrigins.html">•Sketch Origins</a></li>
<li><a href="/docs/Background/SketchElements.html">•Sketch Elements</a></li>
<li><a href="/docs/Background/Presentations.html">•Presentations</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/DataSketches_deck.pdf">•Overview Slide Deck</a></li>
</div>
<p id="architecture-and-design">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_architecture_and_design">Architecture And Design</a>
</p>
<div class="collapse" id="collapse_architecture_and_design">
<li><a href="/docs/Architecture/MajorSketchFamilies.html">•The Major Sketch Families</a></li>
<li><a href="/docs/Architecture/LargeScale.html">•Large Scale Computing</a></li>
<li><a href="/docs/Architecture/KeyFeatures.html">•Key Features</a></li>
<li><a href="/docs/Architecture/SketchFeaturesMatrix.html">•Sketch Features Matrix</a></li>
<li><a href="/docs/Architecture/Components.html">•Components</a></li>
<li><a href="/docs/Architecture/SketchesByComponent.html">•Sketches by Component</a></li>
<li><a href="/docs/Architecture/SketchCriteria.html">•Sketch Criteria</a></li>
<p id="memory-component">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_memory_component">Memory Component</a>
</p>
<div class="collapse" id="collapse_memory_component">
<li><a href="/docs/Memory/MemoryComponent.html">•Memory Component</a></li>
<li><a href="/docs/Memory/MemoryPerformance.html">•Memory Component Performance</a></li>
</div>
<li><a href="/docs/Architecture/OrderSensitivity.html">•Notes on Order Sensitivity</a></li>
<li><a href="/docs/Architecture/Concurrency.html">•Notes on Concurrency</a></li>
</div>
<p id="sketch-families">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sketch_families">Sketch Families</a>
</p>
<div class="collapse" id="collapse_sketch_families">
<p id="distinct-counting">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_distinct_counting">Distinct Counting</a>
</p>
<div class="collapse" id="collapse_distinct_counting">
<li><a href="/docs/DistinctCountFeaturesMatrix.html">•Features Matrix</a></li>
<li><a href="/docs/DistinctCountMeritComparisons.html">•Figures-of-Merit Comparison</a></li>
<p id="cpc-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_cpc_sketches">CPC Sketches</a>
</p>
<div class="collapse" id="collapse_cpc_sketches">
<li><a href="/docs/CPC/CPC.html">•CPC Sketch</a></li>
<li><a href="/docs/CPC/CpcPerformance.html">•CPC Sketch Performance</a></li>
<p id="cpc-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_cpc_examples">CPC Examples</a>
</p>
<div class="collapse" id="collapse_cpc_examples">
<li><a href="/docs/CPC/CpcJavaExample.html">•CPC Sketch Java Example</a></li>
<li><a href="/docs/CPC/CpcCppExample.html">•CPC Sketch C++ Example</a></li>
<li><a href="/docs/CPC/CpcPigExample.html">•CPC Sketch Pig UDFs</a></li>
<li><a href="/docs/CPC/CpcHiveExample.html">•CPC Sketch Hive UDFs</a></li>
</div>
</div>
<p id="hyperloglog-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hyperloglog_sketches">HyperLogLog Sketches</a>
</p>
<div class="collapse" id="collapse_hyperloglog_sketches">
<li><a href="/docs/HLL/HLL.html">•HLL Sketch</a></li>
<li><a href="/docs/HLL/HllMap.html">•HLL Map Sketch</a></li>
<p id="hll-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hll_examples">HLL Examples</a>
</p>
<div class="collapse" id="collapse_hll_examples">
<li><a href="/docs/HLL/HllJavaExample.html">•HLL Sketch Java Example</a></li>
<li><a href="/docs/HLL/HllCppExample.html">•HLL Sketch C++ Example</a></li>
<li><a href="/docs/HLL/HllPigUDFs.html">•HLL Sketch Pig UDFs</a></li>
<li><a href="/docs/HLL/HllHiveUDFs.html">•HLL Sketch Hive UDFs</a></li>
</div>
<p id="hll-studies">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hll_studies">HLL Studies</a>
</p>
<div class="collapse" id="collapse_hll_studies">
<li><a href="/docs/HLL/HllPerformance.html">•HLL Sketch Performance</a></li>
<li><a href="/docs/HLL/Hll_vs_CS_Hllpp.html">•HLL vs Clearspring HLL++</a></li>
<li><a href="/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html">•HLL Sketch vs Druid HyperLogLogCollector</a></li>
</div>
</div>
<p id="theta-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_sketches">Theta Sketches</a>
</p>
<div class="collapse" id="collapse_theta_sketches">
<li><a href="/docs/Theta/ThetaSketchFramework.html">•Theta Sketch Framework</a></li>
<p id="theta-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_examples">Theta Examples</a>
</p>
<div class="collapse" id="collapse_theta_examples">
<li><a href="/docs/Theta/ConcurrentThetaSketch.html">•Concurrent Theta Sketch</a></li>
<li><a href="/docs/Theta/ThetaJavaExample.html">•Theta Sketch Java Example</a></li>
<li><a href="/docs/Theta/ThetaSparkExample.html">•Theta Sketch Spark Example</a></li>
<li><a href="/docs/Theta/ThetaPigUDFs.html">•Theta Sketch Pig UDFs</a></li>
<li><a href="/docs/Theta/ThetaHiveUDFs.html">•Theta Sketch Hive UDFs</a></li>
</div>
<p id="kmv-tutorial">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_kmv_tutorial">KMV Tutorial</a>
</p>
<div class="collapse" id="collapse_kmv_tutorial">
<li><a href="/docs/Theta/InverseEstimate.html">•The Inverse Estimate</a></li>
<li><a href="/docs/Theta/KMVempty.html">•Empty Sketch</a></li>
<li><a href="/docs/Theta/KMVfirstEst.html">•First Estimator</a></li>
<li><a href="/docs/Theta/KMVbetterEst.html">•Better Estimator</a></li>
<li><a href="/docs/Theta/KMVrejection.html">•Rejection Rules</a></li>
<li><a href="/docs/Theta/KMVupdateVkth.html">•Update V(kth) Rule</a></li>
</div>
<p id="set-operations-and-p-sampling">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_set_operations_and_p-sampling">Set Operations and P-sampling</a>
</p>
<div class="collapse" id="collapse_set_operations_and_p-sampling">
<li><a href="/docs/Theta/ThetaSketchSetOps.html">•Set Operations</a></li>
<li><a href="/docs/Theta/ThetaSetOpsCornerCases.html">•Model & Test Set Operations</a></li>
<li><a href="/docs/Theta/ThetaPSampling.html"><i>p</i>-Sampling</a></li>
</div>
<p id="accuracy">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_accuracy">Accuracy</a>
</p>
<div class="collapse" id="collapse_accuracy">
<li><a href="/docs/Theta/ThetaAccuracy.html">•Basic Accuracy</a></li>
<li><a href="/docs/Theta/ThetaAccuracyPlots.html">•Accuracy Plots</a></li>
<li><a href="/docs/Theta/ThetaErrorTable.html">•Relative Error Table</a></li>
<li><a href="/docs/Theta/ThetaSketchSetOpsAccuracy.html">•SetOp Accuracy</a></li>
<li><a href="/docs/Theta/AccuracyOfDifferentKUnions.html">•Unions With Different k</a></li>
</div>
<p id="size">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_size">Size</a>
</p>
<div class="collapse" id="collapse_size">
<li><a href="/docs/Theta/ThetaSize.html">•Theta Sketch Size</a></li>
</div>
<p id="speed">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_speed">Speed</a>
</p>
<div class="collapse" id="collapse_speed">
<li><a href="/docs/Theta/ThetaUpdateSpeed.html">•Update Speed</a></li>
<li><a href="/docs/Theta/ThetaMergeSpeed.html">•Merge Speed</a></li>
</div>
<p id="theta-sketch-theory">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_sketch_theory">Theta Sketch Theory</a>
</p>
<div class="collapse" id="collapse_theta_sketch_theory">
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchFramework.pdf">•Theta Sketch Framework (PDF)</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchEquations.pdf">•Theta Sketch Equations (PDF)</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/DataSketches.pdf">•DataSketches (PDF)</a></li>
<li><a href="/docs/Theta/ThetaConfidenceIntervals.html">•Confidence Intervals Notes</a></li>
<li><a href="/docs/Theta/ThetaMergingAlgorithm.html">•Merging Algorithm Notes</a></li>
<li><a href="/docs/Theta/ThetaReferences.html">•Theta References</a></li>
</div>
</div>
<p id="tuple-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_tuple_sketches">Tuple Sketches</a>
</p>
<div class="collapse" id="collapse_tuple_sketches">
<li><a href="/docs/Tuple/TupleOverview.html">•Tuple Overview</a></li>
<p id="tuple-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_tuple_examples">Tuple Examples</a>
</p>
<div class="collapse" id="collapse_tuple_examples">
<li><a href="/docs/Tuple/TupleJavaExample.html">•Tuple Java Example</a></li>
<li><a href="/docs/Tuple/TupleEngagementExample.html">•Tuple Engagement Example</a></li>
<li><a href="/docs/Tuple/TuplePigUDFs.html">•Tuple Pig UDFs</a></li>
<li><a href="/docs/Tuple/TupleHiveUDFs.html">•Tuple Hive UDFs</a></li>
</div>
</div>
</div>
<p id="most-frequent">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_most_frequent">Most Frequent</a>
</p>
<div class="collapse" id="collapse_most_frequent">
<li><a href="/docs/Frequency/FrequencySketchesOverview.html">•Frequency Sketches Overview</a></li>
<p id="frequent-item-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_frequent_item_sketches">Frequent Item Sketches</a>
</p>
<div class="collapse" id="collapse_frequent_item_sketches">
<li><a href="/docs/Frequency/FrequentItemsOverview.html">•Frequent Items Overview</a></li>
<li><a href="/docs/Frequency/FrequentItemsErrorTable.html">•Frequent Items Error Table</a></li>
<li><a href="/docs/Frequency/FrequentItemsReferences.html">•Frequent Items References</a></li>
<li><a href="/docs/Frequency/FrequentItemsPerformance.html">•Frequent Items Performance</a></li>
<p id="most-frequent-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_most_frequent_examples">Most Frequent Examples</a>
</p>
<div class="collapse" id="collapse_most_frequent_examples">
<li><a href="/docs/Frequency/FrequentItemsJavaExample.html">•Frequent Items Java Example</a></li>
<li><a href="/docs/Frequency/FrequentItemsCppExample.html">•Frequent Items C++ Example</a></li>
<li><a href="/docs/Frequency/FrequentItemsPigUDFs.html">•Frequent Items Pig UDFs</a></li>
<li><a href="/docs/Frequency/FrequentItemsHiveUDFs.html">•Frequent Items Hive UDFs</a></li>
</div>
</div>
<p id="frequent-distinct-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_frequent_distinct_sketches">Frequent Distinct Sketches</a>
</p>
<div class="collapse" id="collapse_frequent_distinct_sketches">
<li><a href="/docs/Frequency/FrequentDistinctTuplesSketch.html">•Frequent Distinct Tuples Sketch</a></li>
</div>
</div>
<p id="quantiles-and-histograms">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_and_histograms">Quantiles And Histograms</a>
</p>
<div class="collapse" id="collapse_quantiles_and_histograms">
<li><a href="/docs/Quantiles/SketchingQuantilesAndRanksTutorial.html">•Quantiles and Ranks Tutorial</a></li>
<li><a href="/docs/Quantiles/QuantilesOverview.html">•Quantiles Overview</a></li>
<li><a href="/docs/KLL/KLLSketch.html">•KLL Floats sketch</a></li>
<li><a href="/docs/KLL/KLLAccuracyAndSize.html">•KLL Sketch Accuracy and Size</a></li>
<li><a href="/docs/REQ/ReqSketch.html">•REQ Floats sketch</a></li>
<li><a href="/docs/Quantiles/OrigQuantilesSketch.html">•Original QuantilesSketch</a></li>
<p id="quantiles-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_examples">Quantiles Examples</a>
</p>
<div class="collapse" id="collapse_quantiles_examples">
<li><a href="/docs/Quantiles/QuantilesJavaExample.html">•Quantiles Sketch Java Example</a></li>
<li><a href="/docs/KLL/KLLCppExample.html">•KLL Quantiles Sketch C++ Example</a></li>
<li><a href="/docs/Quantiles/QuantilesPigUDFs.html">•Quantiles Sketch Pig UDFs</a></li>
<li><a href="/docs/Quantiles/QuantilesHiveUDFs.html">•Quantiles Sketch Hive UDFs</a></li>
</div>
<p id="quantiles-studies">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_studies">Quantiles Studies</a>
</p>
<div class="collapse" id="collapse_quantiles_studies">
<li><a href="/docs/QuantilesStudies/DruidApproxHistogramStudy.html">•Druid Approximate Histogram</a></li>
<li><a href="/docs/QuantilesStudies/MomentsSketchStudy.html">•Moments Sketch Study</a></li>
<li><a href="/docs/QuantilesStudies/QuantilesStreamAStudy.html">•Quantiles StreamA Study</a></li>
<li><a href="/docs/QuantilesStudies/ExactQuantiles.html">•Exact Quantiles for Studies</a></li>
</div>
<p id="quantiles-sketch-theory">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_sketch_theory">Quantiles Sketch Theory</a>
</p>
<div class="collapse" id="collapse_quantiles_sketch_theory">
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/Quantiles_KLL.pdf">•Optimal Quantile Approximation in Streams</a></li>
<li><a href="/docs/Quantiles/QuantilesReferences.html">•Quantiles References</a></li>
</div>
</div>
<p id="sampling">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sampling">Sampling</a>
</p>
<div class="collapse" id="collapse_sampling">
<li><a href="/docs/Sampling/ReservoirSampling.html">•Reservoir Sampling</a></li>
<li><a href="/docs/Sampling/ReservoirSamplingPerformance.html">•Reservoir Sampling Performance</a></li>
<li><a href="/docs/Sampling/VarOptSampling.html">•VarOpt Sampling</a></li>
<p id="sampling-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sampling_examples">Sampling Examples</a>
</p>
<div class="collapse" id="collapse_sampling_examples">
<li><a href="/docs/Sampling/ReservoirSamplingJava.html">•Reservoir Sampling Java Example</a></li>
<li><a href="/docs/Sampling/ReservoirSamplingPigUDFs.html">•Reservoir Sampling Pig UDFs</a></li>
<li><a href="/docs/Sampling/VarOptSamplingJava.html">•VarOpt Sampling Java Example</a></li>
<li><a href="/docs/Sampling/VarOptPigUDFs.html">•VarOpt Sampling Pig UDFs</a></li>
</div>
</div>
</div>
<p id="system-integrations">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_system_integrations">System Integrations</a>
</p>
<div class="collapse" id="collapse_system_integrations">
<li><a href="/docs/SystemIntegrations/ApacheDruidIntegration.html">•Using Sketches in ApacheDruid</a></li>
<li><a href="/docs/SystemIntegrations/ApacheHiveIntegration.html">•Using Sketches in Apache Hive</a></li>
<li><a href="/docs/SystemIntegrations/ApachePigIntegration.html">•Using Sketches in Apache Pig</a></li>
<li><a href="/docs/SystemIntegrations/PostgreSQLIntegration.html">•Using Sketches in PostgreSQL</a></li>
</div>
<p id="community">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_community">Community</a>
</p>
<div class="collapse" id="collapse_community">
<li><a href="/docs/Community/index.html">•Community</a></li>
<li><a href="/docs/Community/Downloads.html">•Downloads</a></li>
<li><a href="/docs/Community/NewCommitterProcess.html">•Committer Process</a></li>
<li><a href="/docs/Community/ReleaseProcessForCppComponents.html">•Release Process For CPP Components</a></li>
<li><a href="/docs/Community/ReleaseProcessForJavaComponents.html">•Release Process For Java Components</a></li>
<li><a href="/docs/Community/Transitioning.html">•Transitioning from prior GitHub Site</a></li>
</div>
<p id="research">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_research">Research</a>
</p>
<div class="collapse" id="collapse_research">
<li><a href="/docs/Community/Research.html">•Research</a></li>
</div>
</div>
<!-- End _includes/toc.html -->
<!-- Start _includes/tocScript.html -->
<script>
(function () {
var findLineItem = function (path) {
return document.querySelector(`#toc [href="${path}"]`);
};
function findNavItem(path) {
return document.querySelector(`.nav [href="${path}"]`);
}
var highlighLineItem = function (element) {
element.classList.add('highlight');
};
var checkHasClass = function (element, className) {
return element.className.split(' ').find(function (item) { return item === className || '' })
}
var findAllCollapseParents = function (element) {
var collapseMenus = [];
var elementPointer = element;
while (elementPointer !== document.body) {
if (checkHasClass(elementPointer, 'collapse')) {
collapseMenus.push(elementPointer);
}
elementPointer = elementPointer.parentElement
}
return collapseMenus
};
var openMenuItem = function (element) {
// $(element).collapse('show') would start a transition, adding `in` class instead.
element.classList.add('in');
};
var openAllFromList = function (elementList) {
elementList.forEach(openMenuItem);
};
var highlightAndOpenMenu = function () {
// Highlight & expand nav item in the TOC
var currentLineItem = findLineItem(document.location.pathname);
highlighLineItem(currentLineItem);
openAllFromList(findAllCollapseParents(currentLineItem));
// Highlight nav item in top navigation
highlighLineItem(findNavItem(document.location.pathname));
};
$(highlightAndOpenMenu);
}());
</script>
<!-- End _includes/tocScript.html -->
</div>
<!-- End ToC Block -->
<div class="col-md-9 doc-content">
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<h1 id="tuple-sketch-engagement-example">Tuple Sketch Engagement Example</h1>
<h2 id="the-challenge--measuring-customer-engagement">The Challenge : Measuring Customer Engagement</h2>
<p>When customers visit our websites, blogs, or stores it is very useful to understand how engaged they are with us and our products. There are many ways to characterize customer engagement, but one common way is to understand how frequently our customers are returning to visit.</p>
<p>For example, let’s study the following histogram:</p>
<p><img class="doc-img-full" src="/docs/img/tuple/EngagementHistogram.png" alt="EngagementHistogram.png" /></p>
<p>The X-axis is the number of days that a specific customer (identified by some unique ID) visits our site in a 30 day period.</p>
<p>The Y-axis is the number of distinct visitors (customers) that have visited our site X number of times during the 30 day period.</p>
<p>Reading this histogram we can see that about 100 distinct visitors visited our site exactly one day out of the 30 day period. About 11 visitors visited our site on 5 different days of the 30 day period. And, it seems that we have one customer that visited our site every day of the 30 day period! We certainly want to encourage more of these loyal customers.</p>
<p>Different businesses will have different overall time periods of interest and different resolutions for the repeat visit intervals. They can be over years, months, weeks or days, hours or even minutes. It is up to the business to decide what time intervals are of interest to measure. What we show here is clearly a made-up example to convey the concept.</p>
<p>So how do we do this? Especially, how can we do this efficiently, quickly, and suitable for near-real-time results?</p>
<p>Well, we have a sketching app for that!</p>
<h2 id="the-input-stream">The Input Stream</h2>
<p>The input data we need to create the above histogram can be viewed as a stream of tuples, where each tuple as at least two components, a time-stamp and an unique identifier (ID) that is a proxy for a customer or visitor. In real systems, the tuples may have many other attributes, but for our purposes here, we only need these two. The stream of tuples might be a live stream flowing in a network, or data being streamed from storage. It doesn’t matter.</p>
<p>In order for a sketch to work properly it must see all relevant data for a particular day, domain or dimensional coordinates that that particular sketch is assigned to represent. Sketches are mergeable, thus parallelizable, which means that the domain can be partitioned into many substreams feeding separate sketches. At the appropriate time the substream sketches can be merged together into a single sketch to provide a snapshot-in-time analysis of the whole domain.</p>
<p>It is critical to emphasize that the input stream must not be pre-sampled (for example, a 10% random sample) as this will seriously impact the accuracy of any estimates derived from the sketch. It is perfectly fine to pre-filter the input stream to remove robot traffic, for example, which will totally remove that traffic from the analysis.</p>
<h2 id="duplicates">Duplicates</h2>
<p>We want our customers to come back and visit us many times, which will create tuples with duplicate IDs in the stream. This is a good thing, but for this analysis we need to handle duplicate ID’s in two different ways that we separate by two different stages of the analysis.</p>
<h3 id="stage-1-fine-grain-interval-sketching">Stage 1: Fine-grain interval sketching</h3>
<p>In our example our fine-grain interval is a day and the overall interval is 30 days. In the first stage we want to process all the tuples for one day in a way that ultimately results in a single sketch for that day. This may mean many sketches operating in parallel to process all the records for one day, but they are ultimately merged down to a single sketch representing all the data for one day.</p>
<p>Since we want to analyze data for 30 days, at the end of Stage 1, we will have 30 sketches representing each of the 30 days.</p>
<p>In this first stage we only want to count visits by any one customer <strong>once</strong> for a single day, even if a customer visits us multiple times during that day. Thus, we want to ignore any duplicate occurrences of the same ID within the same day.</p>
<h3 id="stage-2-merge-and-count-across-days">Stage 2: Merge and count across days</h3>
<p>Once we have our 30 day sketches, we merge all 30 sketches together into one final sketch. This time, however, we want to count the number of duplicates that occur for any single ID across different days. This will give us the number of days that any unique ID appeared across all 30 days.</p>
<h2 id="the-integersketch-and-helper-classes">The IntegerSketch and Helper classes</h2>
<p>To help us code our example we will leverage the <a href="https://github.com/apache/datasketches-java/tree/master/src/main/java/org/apache/datasketches/tuple/aninteger">IntegerSketch package</a> from the library. This package consists of 5 classes, the <em>IntegerSketch</em> and 4 helper classes, all of which extend generic classes of the parent <em>tuple</em> package. Normally, the user/developer would develop these 5 classes to solve a particular analysis problem. These 5 classes can serve as an example of how to create your own Tuple Sketch solutions and we will use them to solve our customer engagement problem.</p>
<p>Please refer to the <a href="/docs/Tuple/TupleOverview.html">Tuple Overview</a> section on this website for a quick review of how the Tuple Sketch works.</p>
<h3 id="integersketch-class">IntegerSketch class</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">IntegerSketch</span> <span class="kd">extends</span> <span class="nc">UpdatableSketch</span><span class="o">&lt;</span><span class="nc">Integer</span><span class="o">,</span> <span class="nc">IntegerSummary</span><span class="o">&gt;</span> <span class="o">{</span>
</code></pre></div></div>
<p>The IntegerSketch class extends the generic UpdatableSketch specifying two type parameters, an Integer and an IntegerSummary.</p>
<p>The Integer type specifies the data type that will update the IntegerSummary. The IntegerSummary specifies the structure of the summary field and what rules to use when updating the field with an Integer type.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">public</span> <span class="nf">IntegerSketch</span><span class="o">(</span><span class="kd">final</span> <span class="kt">int</span> <span class="n">lgK</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">IntegerSummary</span><span class="o">.</span><span class="na">Mode</span> <span class="n">mode</span><span class="o">)</span> <span class="o">{</span>
<span class="kd">super</span><span class="o">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">lgK</span><span class="o">,</span> <span class="nc">ResizeFactor</span><span class="o">.</span><span class="na">X8</span><span class="o">.</span><span class="na">ordinal</span><span class="o">(),</span> <span class="mf">1.0</span><span class="no">F</span><span class="o">,</span> <span class="k">new</span> <span class="nc">IntegerSummaryFactory</span><span class="o">(</span><span class="n">mode</span><span class="o">));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This first constructor takes an integer and a Mode. The integer <em>lgK</em> is a parameter that impacts the maximum size of the sketch object both in memory and when stored, and specifies what the accuracy of the sketch will be. The larger the value the larger the sketch and the more accurate it will be. The “lg” in front of the “K” is a shorthand for Log_base2. This parameter must be an integer beweeen 4 and 26, with 12 being a typical value. With the value 12, there will be up to 2^12 = 4096 possible rows retained by the sketch where each row consists of a key and a summary field. In theory, the summary field can be anything, but for our example it is just a single integer.</p>
<p>We will not be using the second constructor.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="kd">final</span> <span class="nc">String</span> <span class="n">key</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">Integer</span> <span class="n">value</span><span class="o">)</span> <span class="o">{</span>
<span class="kd">super</span><span class="o">.</span><span class="na">update</span><span class="o">(</span><span class="n">key</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="kd">final</span> <span class="kt">long</span> <span class="n">key</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">Integer</span> <span class="n">value</span><span class="o">)</span> <span class="o">{</span>
<span class="kd">super</span><span class="o">.</span><span class="na">update</span><span class="o">(</span><span class="n">key</span><span class="o">,</span> <span class="n">value</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>The IntegerSketch has two update methods, one for <em>String</em> keys and an <em>Integer</em> value and the other for <em>long</em> keys and an <em>Integer</em> value.
The user system code would call one of these two methods to update the sketch. In our example, we will call the second update method with an integer value representing a user ID and a value of one for the Integer. The key will be hashed and passed to the internal sketching algorithm that will determine if the key-value pair should be retained by the sketch or not. If it is retained, the 2nd parameter will be passed to the IntegerSummary class for handling.</p>
<h3 id="integersummary-class">IntegerSummary class</h3>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">IntegerSummary</span> <span class="kd">implements</span> <span class="nc">UpdatableSummary</span><span class="o">&lt;</span><span class="nc">Integer</span><span class="o">&gt;</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kt">int</span> <span class="n">value_</span><span class="o">;</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="nc">Mode</span> <span class="n">mode_</span><span class="o">;</span>
<span class="o">...</span>
</code></pre></div></div>
<p>The <em>IntegerSummary</em> class is central to understanding how tuple sketches work in general and how we will configure it for our example.</p>
<p>The IntegerSummary class extends the generic UpdatableSummary specifying one parameter, Integer, the data type that will update this summary. This summary object is very simple. It has one updatable value field of type <em>int</em> and a <em>final</em> Mode field, which tells this summary object the rule to use when updating <em>value</em>.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/**
* The aggregation modes for this Summary
*/</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">enum</span> <span class="nc">Mode</span> <span class="o">{</span>
<span class="cm">/**
* The aggregation mode is the summation function.
* &lt;p&gt;New retained value = previous retained value + incoming value&lt;/p&gt;
*/</span>
<span class="nc">Sum</span><span class="o">,</span>
<span class="cm">/**
* The aggregation mode is the minimum function.
* &lt;p&gt;New retained value = min(previous retained value, incoming value)&lt;/p&gt;
*/</span>
<span class="nc">Min</span><span class="o">,</span>
<span class="cm">/**
* The aggregation mode is the maximum function.
* &lt;p&gt;New retained value = max(previous retained value, incoming value)&lt;/p&gt;
*/</span>
<span class="nc">Max</span><span class="o">,</span>
<span class="cm">/**
* The aggregation mode is always one.
* &lt;p&gt;New retained value = 1&lt;/p&gt;
*/</span>
<span class="nc">AlwaysOne</span>
<span class="o">}</span>
</code></pre></div></div>
<p>The <em>Mode</em> enum defines the different rules that can be used when updating the summary. In this case we have four rules: Sum, Min, Max, and AlwaysOne. For our example, we will only use Sum and AlwaysOne. There is only one public constructor which specifies the mode that we wish to use. The <em>getValue()</em> method allows us to extract the value of the summary when the sketching is done.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">update</span><span class="o">(</span><span class="kd">final</span> <span class="nc">Integer</span> <span class="n">value</span><span class="o">)</span> <span class="o">{</span>
<span class="k">switch</span> <span class="o">(</span><span class="n">mode_</span><span class="o">)</span> <span class="o">{</span>
<span class="k">case</span> <span class="nl">Sum:</span>
<span class="n">value_</span> <span class="o">+=</span> <span class="n">value</span><span class="o">;</span>
<span class="k">break</span><span class="o">;</span>
<span class="k">case</span> <span class="nl">Min:</span>
<span class="k">if</span> <span class="o">(</span><span class="n">value</span> <span class="o">&lt;</span> <span class="n">value_</span><span class="o">)</span> <span class="o">{</span> <span class="n">value_</span> <span class="o">=</span> <span class="n">value</span><span class="o">;</span> <span class="o">}</span>
<span class="k">break</span><span class="o">;</span>
<span class="k">case</span> <span class="nl">Max:</span>
<span class="k">if</span> <span class="o">(</span><span class="n">value</span> <span class="o">&gt;</span> <span class="n">value_</span><span class="o">)</span> <span class="o">{</span> <span class="n">value_</span> <span class="o">=</span> <span class="n">value</span><span class="o">;</span> <span class="o">}</span>
<span class="k">break</span><span class="o">;</span>
<span class="k">case</span> <span class="nl">AlwaysOne:</span>
<span class="n">value_</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span>
<span class="k">break</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This method is called by the sketch algorithms to update the summary with the value provided by the IntegerSketch update method described above. This is the code that implements the aggregation rules specified by the Mode.</p>
<h3 id="integersummarysetoperations-class">IntegerSummarySetOperations class</h3>
<p>This class allows us to define different updating rules for two different set operations: <em>Union</em> and <em>Intersection</em>. In this context “Union” is synonymous with “merge”. In our example we will only use the Union set operation.</p>
<p>It is important to note here that this set operations class also uses the mode updating logic of the IntegerSummary class. These updating modes can be different than the mode used when the IntegerSummary is used with the IntegerSketch class.</p>
<h3 id="integersummaryfactory-class">IntegerSummaryFactory class</h3>
<p>This class is only called by the underlying sketch code when a new key-value pair needs to be retained by the sketch and a new empty Summary needs to be associated with the new key, and the new summary may need to be updated by the incoming value.</p>
<h3 id="integersummarydeserializer-class">IntegerSummaryDeserializer class</h3>
<p>This class is only called by the underlying sketch code when deserializing a sketch and its summaries from a stored image. We will not be using this class in our example.</p>
<h2 id="the-engagementtest-class">The <a href="https://github.com/apache/datasketches-java/blob/master/src/test/java/org/apache/datasketches/tuple/aninteger/EngagementTest.java">EngagementTest</a> class</h2>
<p>Note 1: the version in the GitHub master is more up-to-date than the version of this class in the 1.1.0-incubating release. This tutorial references the code in master.</p>
<p>Note 2: You can run the following <em>computeEngagementHistogram()</em> method as a test, but in order to see the output you will need to un-comment the printf(…) statement at the very end of the class.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">computeEngagementHistogram</span><span class="o">()</span> <span class="o">{</span>
<span class="kt">int</span> <span class="n">lgK</span> <span class="o">=</span> <span class="mi">8</span><span class="o">;</span> <span class="c1">//Using a larger sketch &gt;= 9 will produce exact results for this little example</span>
<span class="kt">int</span> <span class="no">K</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="n">lgK</span><span class="o">;</span>
<span class="kt">int</span> <span class="n">days</span> <span class="o">=</span> <span class="mi">30</span><span class="o">;</span>
<span class="kt">int</span> <span class="n">v</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
<span class="nc">IntegerSketch</span><span class="o">[]</span> <span class="n">skArr</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IntegerSketch</span><span class="o">[</span><span class="n">days</span><span class="o">];</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">days</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
<span class="n">skArr</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IntegerSketch</span><span class="o">(</span><span class="n">lgK</span><span class="o">,</span> <span class="nc">AlwaysOne</span><span class="o">);</span>
<span class="o">}</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;=</span> <span class="n">days</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span> <span class="c1">//31 generating indices for symmetry</span>
<span class="kt">int</span> <span class="n">numIds</span> <span class="o">=</span> <span class="n">numIDs</span><span class="o">(</span><span class="n">days</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
<span class="kt">int</span> <span class="n">numDays</span> <span class="o">=</span> <span class="n">numDays</span><span class="o">(</span><span class="n">days</span><span class="o">,</span> <span class="n">i</span><span class="o">);</span>
<span class="kt">int</span> <span class="n">myV</span> <span class="o">=</span> <span class="n">v</span><span class="o">++;</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">d</span> <span class="o">&lt;</span> <span class="n">numDays</span><span class="o">;</span> <span class="n">d</span><span class="o">++)</span> <span class="o">{</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">id</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">id</span> <span class="o">&lt;</span> <span class="n">numIds</span><span class="o">;</span> <span class="n">id</span><span class="o">++)</span> <span class="o">{</span>
<span class="n">skArr</span><span class="o">[</span><span class="n">d</span><span class="o">].</span><span class="na">update</span><span class="o">(</span><span class="n">myV</span> <span class="o">+</span> <span class="n">id</span><span class="o">,</span> <span class="mi">1</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="n">v</span> <span class="o">+=</span> <span class="n">numIds</span><span class="o">;</span>
<span class="o">}</span>
<span class="n">unionOps</span><span class="o">(</span><span class="no">K</span><span class="o">,</span> <span class="nc">Sum</span><span class="o">,</span> <span class="n">skArr</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>This little engagement test uses a power-law distribution of number of days visited versus the number of visitors in order to model what actual data might look like. It is not essential to understand how the data is generated, but if you are curious it will be discussed at the end.</p>
<p>In lines 7 - 10, we create a simple array of 30 sketches for the 30 days. Note that we set the update mode to <em>AlwaysOne</em>. (Because this little test does not generate any duplicates in the first stage, the mode <em>Sum</em> would also work.)</p>
<p>The triple-nested for-loops update the 30 sketches using a pair of parametric generating functions discussed later. Line 22 passes the array of sketches to the <em>unionOps(…)</em> method, which will output the results.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">private</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">unionOps</span><span class="o">(</span><span class="kt">int</span> <span class="no">K</span><span class="o">,</span> <span class="nc">IntegerSummary</span><span class="o">.</span><span class="na">Mode</span> <span class="n">mode</span><span class="o">,</span> <span class="nc">IntegerSketch</span> <span class="o">...</span> <span class="n">sketches</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">IntegerSummarySetOperations</span> <span class="n">setOps</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">IntegerSummarySetOperations</span><span class="o">(</span><span class="n">mode</span><span class="o">,</span> <span class="n">mode</span><span class="o">);</span>
<span class="nc">Union</span><span class="o">&lt;</span><span class="nc">IntegerSummary</span><span class="o">&gt;</span> <span class="n">union</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Union</span><span class="o">&lt;&gt;(</span><span class="no">K</span><span class="o">,</span> <span class="n">setOps</span><span class="o">);</span>
<span class="kt">int</span> <span class="n">len</span> <span class="o">=</span> <span class="n">sketches</span><span class="o">.</span><span class="na">length</span><span class="o">;</span>
<span class="k">for</span> <span class="o">(</span><span class="nc">IntegerSketch</span> <span class="n">isk</span> <span class="o">:</span> <span class="n">sketches</span><span class="o">)</span> <span class="o">{</span>
<span class="n">union</span><span class="o">.</span><span class="na">update</span><span class="o">(</span><span class="n">isk</span><span class="o">);</span>
<span class="o">}</span>
<span class="nc">CompactSketch</span><span class="o">&lt;</span><span class="nc">IntegerSummary</span><span class="o">&gt;</span> <span class="n">result</span> <span class="o">=</span> <span class="n">union</span><span class="o">.</span><span class="na">getResult</span><span class="o">();</span>
<span class="nc">SketchIterator</span><span class="o">&lt;</span><span class="nc">IntegerSummary</span><span class="o">&gt;</span> <span class="n">itr</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">iterator</span><span class="o">();</span>
<span class="kt">int</span><span class="o">[]</span> <span class="n">numDaysArr</span> <span class="o">=</span> <span class="k">new</span> <span class="kt">int</span><span class="o">[</span><span class="n">len</span> <span class="o">+</span> <span class="mi">1</span><span class="o">];</span> <span class="c1">//zero index is ignored</span>
<span class="k">while</span> <span class="o">(</span><span class="n">itr</span><span class="o">.</span><span class="na">next</span><span class="o">())</span> <span class="o">{</span>
<span class="c1">//For each unique visitor from the result sketch, get the # days visited</span>
<span class="kt">int</span> <span class="n">numDaysVisited</span> <span class="o">=</span> <span class="n">itr</span><span class="o">.</span><span class="na">getSummary</span><span class="o">().</span><span class="na">getValue</span><span class="o">();</span>
<span class="c1">//increment the number of visitors that visited numDays</span>
<span class="n">numDaysArr</span><span class="o">[</span><span class="n">numDaysVisited</span><span class="o">]++;</span> <span class="c1">//values range from 1 to 30</span>
<span class="o">}</span>
<span class="n">println</span><span class="o">(</span><span class="s">"\nEngagement Histogram:"</span><span class="o">);</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Number of Unique Visitors by Number of Days Visited"</span><span class="o">);</span>
<span class="n">printf</span><span class="o">(</span><span class="s">"%12s%12s%12s%12s\n"</span><span class="o">,</span><span class="s">"Days Visited"</span><span class="o">,</span> <span class="s">"Estimate"</span><span class="o">,</span> <span class="s">"LB"</span><span class="o">,</span> <span class="s">"UB"</span><span class="o">);</span>
<span class="kt">int</span> <span class="n">sumVisits</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getTheta</span><span class="o">();</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">numDaysArr</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span>
<span class="kt">int</span> <span class="n">visitorsAtDaysVisited</span> <span class="o">=</span> <span class="n">numDaysArr</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
<span class="k">if</span> <span class="o">(</span><span class="n">visitorsAtDaysVisited</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span> <span class="k">continue</span><span class="o">;</span> <span class="o">}</span>
<span class="n">sumVisits</span> <span class="o">+=</span> <span class="n">visitorsAtDaysVisited</span> <span class="o">*</span> <span class="n">i</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">estVisitorsAtDaysVisited</span> <span class="o">=</span> <span class="n">visitorsAtDaysVisited</span> <span class="o">/</span> <span class="n">theta</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">lbVisitorsAtDaysVisited</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getLowerBound</span><span class="o">(</span><span class="n">numStdDev</span><span class="o">,</span> <span class="n">visitorsAtDaysVisited</span><span class="o">);</span>
<span class="kt">double</span> <span class="n">ubVisitorsAtDaysVisited</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getUpperBound</span><span class="o">(</span><span class="n">numStdDev</span><span class="o">,</span> <span class="n">visitorsAtDaysVisited</span><span class="o">);</span>
<span class="n">printf</span><span class="o">(</span><span class="s">"%12d%12.0f%12.0f%12.0f\n"</span><span class="o">,</span>
<span class="n">i</span><span class="o">,</span> <span class="n">estVisitorsAtDaysVisited</span><span class="o">,</span> <span class="n">lbVisitorsAtDaysVisited</span><span class="o">,</span> <span class="n">ubVisitorsAtDaysVisited</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1">//The estimate and bounds of the total number of visitors comes directly from the sketch.</span>
<span class="kt">double</span> <span class="n">visitors</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getEstimate</span><span class="o">();</span>
<span class="kt">double</span> <span class="n">lbVisitors</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getLowerBound</span><span class="o">(</span><span class="n">numStdDev</span><span class="o">);</span>
<span class="kt">double</span> <span class="n">ubVisitors</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="na">getUpperBound</span><span class="o">(</span><span class="n">numStdDev</span><span class="o">);</span>
<span class="n">printf</span><span class="o">(</span><span class="s">"\n%12s%12s%12s%12s\n"</span><span class="o">,</span><span class="s">"Totals"</span><span class="o">,</span> <span class="s">"Estimate"</span><span class="o">,</span> <span class="s">"LB"</span><span class="o">,</span> <span class="s">"UB"</span><span class="o">);</span>
<span class="n">printf</span><span class="o">(</span><span class="s">"%12s%12.0f%12.0f%12.0f\n"</span><span class="o">,</span> <span class="s">"Visitors"</span><span class="o">,</span> <span class="n">visitors</span><span class="o">,</span> <span class="n">lbVisitors</span><span class="o">,</span> <span class="n">ubVisitors</span><span class="o">);</span>
<span class="c1">//The total number of visits, however, is a scaled metric and takes advantage of the fact that</span>
<span class="c1">//the retained entries in the sketch is a uniform random sample of all unique visitors, and</span>
<span class="c1">//the the rest of the unique users will likely behave in the same way.</span>
<span class="kt">double</span> <span class="n">estVisits</span> <span class="o">=</span> <span class="n">sumVisits</span> <span class="o">/</span> <span class="n">theta</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">lbVisits</span> <span class="o">=</span> <span class="o">(</span><span class="n">estVisits</span> <span class="o">*</span> <span class="n">lbVisitors</span><span class="o">)</span> <span class="o">/</span> <span class="n">visitors</span><span class="o">;</span>
<span class="kt">double</span> <span class="n">ubVisits</span> <span class="o">=</span> <span class="o">(</span><span class="n">estVisits</span> <span class="o">*</span> <span class="n">ubVisitors</span><span class="o">)</span> <span class="o">/</span> <span class="n">visitors</span><span class="o">;</span>
<span class="n">printf</span><span class="o">(</span><span class="s">"%12s%12.0f%12.0f%12.0f\n\n"</span><span class="o">,</span> <span class="s">"Visits"</span><span class="o">,</span> <span class="n">estVisits</span><span class="o">,</span> <span class="n">lbVisits</span><span class="o">,</span> <span class="n">ubVisits</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>In the unionOps method, line 2 initializes the <em>IntegerSummarySetOperations</em> class with the given mode, which for stage 2 of our example must be <em>Sum</em>. Line 3 creates a new Union class initialized with the setOps class.</p>
<p>In lines 6-8 the union is updated with all of the sketches from the array.</p>
<p>In lines 9-10, the result is obtained from the union as a <em>CompactSketch</em> and a <em>SketchIterator</em> is obtained from the result so we can process all the retained rows of the sketch.</p>
<p>In lines 12-19, we accumulate the frequencies of occurences of rows with the same count value into the <em>numDaysArr</em>.</p>
<p>The remainder of the method is just the mechanics of printing out the results to the console, and computing the error bounds for each row and for the totals. The output should look something like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Engagement Histogram:
Number of Unique Visitors by Number of Days Visited
Days Visited Estimate LB UB
1 98 92 104
2 80 75 86
3 32 30 36
4 16 15 19
5 10 9 13
6 5 5 8
7 4 4 7
8 4 4 7
9 3 3 6
10 2 2 4
11 3 3 6
12 2 2 4
14 2 2 4
15 2 2 4
17 2 2 4
19 2 2 4
21 1 1 3
24 1 1 3
27 1 1 3
30 1 1 3
Totals Estimate LB UB
Visitors 272 263 281
Visits 917 886 948
</code></pre></div></div>
<p>This is the data that is plotted as a histogram at the top of this tutorial.</p>
<h2 id="generating-the-synthetic-data">Generating the Synthetic Data</h2>
<p>This section is for folks interested in how the data for this example was generated. This is optional reading.</p>
<p>Much of the data we analyze from the Internet has the classical “long-tailed” distribution or a power-law distribution. When frequencies of occurences of some type is plotted on a log-log axis we tend to get a negatively sloping, mostly-straight line. There have been numerous books and papers written about this phenomenon, but it is quite real and any analysis tools used to analyze big data should take this into account.</p>
<p>For this example, it was useful to generate data that at least had some resemblence to what a user might actually experience with their own data.</p>
<p>To plot a straight line on a graph we use the familiar <em>y = mx + b</em> formula from high school, where <em>m</em> is the slope and <em>b</em> is the y-intercept. In our example, we want the line to start at the maximum number of days, <em>d</em>, and have a negative slope so our equation becomes <em>y = -mx + d</em>.</p>
<p>As we noted above, we actually want this to be a straight line on a log-log graph, so each of the variables <em>y</em>, <em>x</em> and <em>d</em> are actually log variables. Now our equation becomes</p>
<p style="text-align: center;"><i>log y = -m log x + log d.</i></p>
<p>To transform this into a pair of parametric equations we perform the following operations. First we multiply through by <em>d / log(d)</em> to isolate <em>d</em> by itself on both sides</p>
<p style="text-align: center;"><i>d log y / log d = -d m log x / log d + d.</i></p>
<p>Then we insert our parametric variable <em>i</em>, which will vary from zero to <em>d</em>, in the middle:</p>
<p style="text-align: center;"><i>d log y / log d = i = -d m log x / log d + d.</i></p>
<p>Solving for both <em>x</em> and <em>y</em> separately gives</p>
<p style="text-align: center;"><i>y = exp(i/d log d),</i></p>
<p style="text-align: center;"><i>x = exp((d-i)/(d m) log d).</i></p>
<p>These are continuous functions and when plotted we can see our negative sloping plot (here <em>m = 1</em>) starting at <em>y=30</em> and ending at <em>x = 30</em>. The parametric variable <em>i</em> varies from 0 to 30, inclusively.</p>
<p><img class="doc-img-half" src="/docs/img/tuple/ContinuousLogLog.png" alt="ContinuousLogLog.png" /></p>
<p>This, of course, results in non-integer coordinates, which is not what we want. Descretizing the equations becomes</p>
<p style="text-align: center;"><i>y = round(exp(i/d log d)),</i></p>
<p style="text-align: center;"><i>x = round(exp((d-i)/(d m) log d)).</i></p>
<p>This produces</p>
<p><img class="doc-img-half" src="/docs/img/tuple/DiscreteLogLog.png" alt="DiscreteLogLog.png" /></p>
<p>Note that these plots are symmetric about the faint 45 degree line.</p>
<p>The points on this graph represent the parameters for the two inner <em>for</em> loops used to generate the final data fed to the sketches.</p>
</div> <!-- End content -->
</div> <!-- End row -->
</div> <!-- End Container -->
<!-- Start _include/page_footer.html -->
<footer class="ds-footer">
<div class="container">
<div class="text-center">
<p>
<div>Copyright © 2024 <a href="https://www.apache.org">Apache Software Foundation</a>,
Licensed under the Apache License, Version 2.0. All Rights Reserved.
| <a href="https://privacy.apache.org/policies/privacy-policy-public.html">Privacy Policy</a><br/>
Apache DataSketches, Apache, the Apache feather logo, and the Apache DataSketches project logos are trademarks of The Apache Software Foundation.<br/>
All other marks mentioned may be trademarks or registered trademarks of their respective owners.
</div>
</p>
</div>
</div>
</footer>
<!-- End _include/page_footer.html -->
</body>
</html>
<!-- End _layouts/doc_page.html-->