blob: dc4b7b31c98bf9940e886053779c3b9d8a2a1db2 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>BigDansing: A System for Big Data Cleansing</title>
<meta name="generator" content="Hugo 0.46" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/owl.carousel.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/bootstrap.min.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/font-awesome.min.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/airspace-local-fonts.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/airspace.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/style.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/ionicons.min.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/animate.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/responsive.css" />
<link rel="stylesheet" href="https://rheem-ecosystem.github.io/css/syntax.css" />
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.2.0/css/all.css" integrity="sha384-hWVjflwFxL6sNzntih27bfxkr27PmbbK/iSvJ+a4+0owXq79v+lsFkW54bOGbiDQ" crossorigin="anonymous">
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="js/vendor/jquery-1.10.2.min.js"><\/script>')</script>
<script src="https://rheem-ecosystem.github.io/js/bootstrap.min.js"></script>
<script src="https://rheem-ecosystem.github.io/js/owl.carousel.min.js"></script>
<script src="https://rheem-ecosystem.github.io/js/plugins.js"></script>
<script src="https://rheem-ecosystem.github.io/js/min/waypoints.min.js"></script>
<script src="https://rheem-ecosystem.github.io/js/jquery.counterup.js"></script>
<script src="https://rheem-ecosystem.github.io/js/main.js"></script>
<script>
var doNotTrack = false;
if (!doNotTrack) {
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-124750019-1', 'auto');
ga('send', 'pageview');
}
</script>
<script>
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-124750019-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
</head>
<body>
<header>
<a href="https://github.com/rheem-ecosystem/rheem"><img style="position: absolute; top: 0; left: 0; border: 0; z-index: 12; " src="https://camo.githubusercontent.com/121cd7cbdc3e4855075ea8b558508b91ac463ac2/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f6769746875622f726962626f6e732f666f726b6d655f6c6566745f677265656e5f3030373230302e706e67" alt="Fork me on GitHub" data-canonical-src="https://s3.amazonaws.com/github/ribbons/forkme_left_green_007200.png"></a>
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="https://rheem-ecosystem.github.io/">
<img src="https://rheem-ecosystem.github.io/img/logo.png" alt="Rheem X-Platform Logo" class="logo-rheem">
</a>
</div>
<div class="collapse navbar-collapse move" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
<li><a href="https://rheem-ecosystem.github.io/">Home</a></li>
<li><a href="/about/">About</a></li>
<li><a href="/documentation/">Documentation</a></li>
<li><a href="/publication/">Publications</a></li>
<li><a href="/people/">People</a></li>
<li><a href="/download/">Download</a></li>
<li><a href="/contact/">Contact</a></li>
</ul>
</div>
</div>
</nav>
</div>
</div>
</div>
</header>
<div class="post">
<section class="section" style="border: 1px dotted #ddd;">
<div class="container">
<div class="row">
<div>
<div class="block">
<h1>BigDansing: A System for Big Data Cleansing</h1>
<div class="post-info-wrapper">
<p class="italic">By <span class="bold">Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang and Si Yin</span> on <span class="bold">2015</span></p>
</div>
<hr />
<p><p>Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.</p>
</p>
<hr />
<p class="center-text" style="width: 100%">
<a href="https://rheem-ecosystem.github.io/pdf/paper/bigdansing.pdf" class="btn btn-main"><i class="fa fa-file-pdf-o"></i> Download</a>
<br>
</p>
</div>
</div>
</div>
</div>
</section>
</div>
<p class="center-text" style="padding: 30px;">
<a class="btn btn-main" href="https://rheem-ecosystem.github.io//publication">Back to Publications</a>
</p>
<section id="call-to-action">
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<div class="block">
<h2>Turning a Zoo into a Circus</h2>
<p><a class="nothing" href="https://rheem-ecosystem.github.io/about">Read more</a> on how <b>RHEEM</b> tame the Zoo of existing data processing platforms to work together.</p>
</div>
</div>
</div>
</div>
</section>
<footer>
<div class="container-fluid">
<div class="row">
<div class="col-md-12">
<div class="footer-manu">
<ul>
<li><a href="https://www.qcri.org/">About QCRI</a></li>
<li><a href="/contact">Contact us</a></li>
<li><a href="https://github.com/rheem-ecosystem">Fork me</a></li>
<li><a href="https://github.com/rheem-ecosystem/blob/master/LICENSE.TXT">License</a></li>
</ul>
</div>
<p>Copyright &copy; Developed by the dRHEEMers @ QCRI. All rights reserved.</p>
</div>
</div>
</div>
</footer>
</body>
</html>