| --- |
| layout: front_page |
| --- |
| <!-- Start /index.md --> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <link rel="stylesheet" type="text/css" href="css/index.css"> |
| <link rel="stylesheet" type="text/css" href="css/header.css"> |
| |
| <main class="ds-masthead"> |
| <div class="container"> |
| <div class="row"> |
| <div class="col-md-8 col-md-offset-2 text-center"> |
| <span class="ds-bootlogo"></span> |
| <p class="lead" style="font-size: 20px; line-height: 1.0; margin-bottom: 15px; color: #FFFFFF">A software library of |
| <a href="https://en.wikipedia.org/wiki/Stochastic" style="color: #EDE379"><i>stochastic</i></a> |
| <a href="https://en.wikipedia.org/wiki/Streaming_algorithm" style="color: #EDE379"><i>streaming algorithms</i></a></p> |
| <p class="lead" style="font-size: 18px; line-height: 1.0; margin-bottom: 15px; color: #FFFFFF"><i>"A truly excellent example of theoretically-informed algorithm engineering"</i> -- Graham Cormode</p> |
| </div> |
| </div> |
| </div> |
| </main> |
| |
| <div class="container"> |
| <div class="row"> |
| <div class="text-justify" style="font-size: 18px; padding-left: 25px; padding-right: 25px"> |
| <p><b>The Business Challenge:</b> Analyzing Big Data Quickly.</p> |
| <p>In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include <i>count distinct</i>, quantiles, most-frequent items, joins, matrix computations, and graph analysis.</p> |
| |
| <p>If approximate results are acceptable, there is a class of specialized algorithms, called streaming algorithms, or <a href="/docs/Background/SketchOrigins.html">sketches</a> that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.</p> |
| |
| <p>For any system that needs to extract useful information from big data these sketches are a required toolkit that should be tightly integrated into their analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days or hours to minutes or seconds on a number of its internal platforms.</p> |
| |
| <p>This project is dedicated to providing a broad selection of sketch algorithms of production quality. Contributions are welcome from those interested in further development of this science and art.</p> |
| </div> |
| </div> |
| <div class="row text-center main-marketing"> |
| <div class="col-md-4"> |
| <a href="/docs/Architecture/LargeScale.html#speed"> |
| <span class="fa fa-fighter-jet fa-4x"></span> |
| <h2>Fast</h2> |
| </a> |
| <p class="text-justify"><a href="/docs/Background/SketchOrigins.html">Sketches</a> are <i>fast</i>. |
| The sketch algorithms in this library process data in a single pass and are suitable for |
| both real-time and batch. |
| Sketches enable streaming computation of set expression cardinalities, quantiles, frequency estimation and more. |
| In addition, designing a system around sketching allows simplification of system's architecture and reduction in overall compute resources required for these heretofore difficult computational tasks.</p> |
| </div> |
| |
| <div class="col-md-4"> |
| <a href="/docs/Architecture/LargeScale.html#specific-sketch-features-for-large-data"> |
| <span class="fa fa-database fa-4x"></span> |
| <h2>Big Data</h2> |
| </a> |
| <p class="text-justify">This library has been specifically designed for production systems that must process massive data. |
| The library includes adaptors for Apache Hive, Apache Pig, and PostgreSQL (C++). These adaptors also stand as examples for adaptors for other systems. |
| The sketches in this library are designed to have compatible binary representations across languages (Java, C++, Python) and platforms. |
| </p> |
| </div> |
| |
| <div class="col-md-4"> |
| <a href="/docs/Architecture/KeyFeatures.html#key-algorithms"> |
| <span class="fa fa-bar-chart-o fa-4x"></span> |
| <h2>Analysis</h2> |
| </a> |
| <p class="text-justify">Built-in Theta Sketch set operators (Union, Intersection, Difference) |
| produce sketches as a result (and not just a number) enabling full set expressions of cardinality, |
| such as ((A ∪ B) ∩ (C ∪ D)) \ (E ∪ F). |
| This capability along with predictable and superior accuracy |
| (compared with <i>Include/Exclude</i> approaches) enable unprecedented analysis capabilities |
| for fast queries. </p> |
| </div> |
| </div> |
| </div> |
| <!-- End /index.md --> |