blob: 61e1756c6cc39b41e6e7135a7531f677615e7988 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Indian Languages Parallel Corpora</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<!-- Le styles -->
<link href="/bootstrap/css/bootstrap.css" rel="stylesheet" />
<link href="/joshua.css" rel="stylesheet" />
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="bootstrap/js/html5shiv.js"></script>
<![endif]-->
<!-- Fav and touch icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
<link rel="shortcut icon" href="bootstrap/ico/favicon.png">
</head>
<body>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="brand" href="/">Joshua</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li class="active"><a href="index.html">Indian Languages</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</div>
</div>
<div class="container">
<div class="row">
<div class="span8">
<h1>Datasets</h1>
<h2>Indian Parallel Languages</h2>
<span id="download">
<a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a>
</span>
</div>
</div>
<hr />
<div class="row">
<div class="span8">
This page describes a set of six parallel corpora obtained by translating popular
Wikipedia documents in six languages from the Indian sub-continent into English. The
languages are:
<ul>
<li>Bengali</li>
<li>Hindi</li>
<li>Malayalam</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Urdu</li>
</ul>
<p>
The collection and release of this data is described in the following paper:
</p>
<blockquote>
<i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/>
<a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris
Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles
Osborne</a> <br/>
<a href="http://statmt.org/wmt12">WMT 2012</a> <br/>
<a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a>
<a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a>
</blockquote>
<h2>Download & License</h2>
The Indian parallel corpora dataset
is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on
Github</a>. You can clone that, or download a release tarball by clicking the big green
button above. The corpus is licensed under
the <a href="http://creativecommons.org/">Creative
Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike
3.0 Unported License</a> (CC BY-SA 3.0).
<h2>Scores</h2>
<p>
Below are the best translation scores (case-insensitive BLEU-4) that have been reported
on the provided test sets. The Google results were recorded in the fall of 2011 (and
are described in Post et al. (2012)). Google does not have a Malayalam system.
</p>
<div>
<table>
<tr>
<th style="width:150px">Citation</th>
<th>BN</th>
<th>HI</th>
<th>ML</th>
<th>TA</th>
<th>TE</th>
<th>UR</th>
</tr>
<tr>
<td class="system">Google</td>
<td>20.01</td>
<td>25.21</td>
<td>&ndash;</td>
<td>13.51</td>
<td>16.03</td>
<td>23.09</td>
</tr>
<tr>
<td class="system"><a href="http://aclweb.org/anthology/W/W12/W12-3152.pdf">Post et al. (2012)</a></td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
</table>
</div>
</div>
<div class="span4">
<div>
<img width="250px" src="images/map1.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p>
<img width="250px" src="images/map2.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p>
</div>
</div>
</div>
</div> <!-- /container -->
<!-- Le javascript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="bootstrap/js/jquery.js"></script>
<script src="bootstrap/js/bootstrap-transition.js"></script>
<script src="bootstrap/js/bootstrap-alert.js"></script>
<script src="bootstrap/js/bootstrap-modal.js"></script>
<script src="bootstrap/js/bootstrap-dropdown.js"></script>
<script src="bootstrap/js/bootstrap-scrollspy.js"></script>
<script src="bootstrap/js/bootstrap-tab.js"></script>
<script src="bootstrap/js/bootstrap-tooltip.js"></script>
<script src="bootstrap/js/bootstrap-popover.js"></script>
<script src="bootstrap/js/bootstrap-button.js"></script>
<script src="bootstrap/js/bootstrap-collapse.js"></script>
<script src="bootstrap/js/bootstrap-carousel.js"></script>
<script src="bootstrap/js/bootstrap-typeahead.js"></script>
</body>
</html>