blob: f05d16ee4074f03476745f6220dca3c2d4a7c360 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Indian Languages Parallel Corpora</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<!-- Le styles -->
<link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
#download {
background-color: green;
font-size: 14pt;
font-weight: bold;
text-align: center;
color: white;
border-radius: 5px;
padding: 4px;
}
#download a:link {
color: white;
}
#download a:hover {
color: lightgrey;
}
#download a:visited {
color: white;
}
a.pdf {
font-variant: small-caps;
/* font-weight: bold; */
font-size: 10pt;
color: white;
background: brown;
padding: 2px;
}
a.bibtex {
font-variant: small-caps;
/* font-weight: bold; */
font-size: 10pt;
color: white;
background: orange;
padding: 2px;
}
img.sponsor {
height: 120px;
margin: 5px;
}
</style>
<link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="bootstrap/js/html5shiv.js"></script>
<![endif]-->
<!-- Fav and touch icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png">
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png">
<link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png">
<link rel="shortcut icon" href="bootstrap/ico/favicon.png">
</head>
<body>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="brand" href="#">Joshua</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li class="active"><a href="/">Home</a></li>
<li><a href="index.html">Indian Languages</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</div>
</div>
<div class="container">
<div class="row">
<div class="span8">
<h1>Indian Languages Parallel Corpora</h1>
</div>
<div>
<p>
<br/>
<span id="download">
<a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a>
</span>
</p>
</div>
</div>
<hr />
<div class="row">
<div class="span8">
<h2>Description</h2>
This page describes a set of parallel corpora between English and six languages from the
Indian sub-continent:
<ul>
<li>Bengali</li>
<li>Hindi</li>
<li>Malayalam</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Urdu</li>
</ul>
<p>
They can be used to train (and evaluate) models
for <a href="http://en.wikipedia.org/wiki/Statistical_machine_translation">automatically
translating</a> text into and out of these languages. They were collected by
translating Indian Wikipedia articles into English using Amazon's Mechanical Turk.
Their collection and release are described in the paper:
</p>
<blockquote>
<i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/>
<a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris
Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles
Osborne</a> <br/>
<a href="http://statmt.org/wmt12">WMT 2012</a> <br/>
<a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a>
<a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a>
</blockquote>
<h2>Download & License</h2>
The Indian parallel corpora dataset
is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on
Github</a>. You can download a tarball directly
by <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">clicking
here</a>. The corpus is licensed under the <a href="http://creativecommons.org/">Creative
Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike
3.0 Unported License</a> (CC BY-SA 3.0).
<h2>Citations</h2>
<p>
The following publications have made use of this dataset.
</p>
<ul>
<li><b>Post, Callison-Burch, and Osborne (2012)</b>. This paper introduced the parallel
corpora, describing how the data was collected, reporting the results of prelimary
experiments, and suggesting some potential research directions.
</li>
</ul>
<h2>Scores</h2>
<p>
Below are the best translation scores (case-insensitive BLEU-4) that have been
reported on the provided test sets. The Google results were recorded in the fall of
2011 (and are described in Post et al. (2012)). Google does not have a Malayalam
system.
</p>
<div>
<table>
<tr>
<th style="width:150px">Citation</th>
<th>BN</th>
<th>HI</th>
<th>ML</th>
<th>TA</th>
<th>TE</th>
<th>UR</th>
</tr>
<tr>
<td class="system">Google</td>
<td>20.01</td>
<td>25.21</td>
<td>&ndash;</td>
<td>13.51</td>
<td>16.03</td>
<td>23.09</td>
</tr>
<tr>
<td class="system">Post et al. (2012)</td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
</table>
</div>
</div>
<div class="span4">
<div>
<img width="250px" src="images/map1.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p>
<img width="250px" src="images/map2.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p>
</div>
</div>
</div>
</div> <!-- /container -->
<!-- Le javascript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="bootstrap/js/jquery.js"></script>
<script src="bootstrap/js/bootstrap-transition.js"></script>
<script src="bootstrap/js/bootstrap-alert.js"></script>
<script src="bootstrap/js/bootstrap-modal.js"></script>
<script src="bootstrap/js/bootstrap-dropdown.js"></script>
<script src="bootstrap/js/bootstrap-scrollspy.js"></script>
<script src="bootstrap/js/bootstrap-tab.js"></script>
<script src="bootstrap/js/bootstrap-tooltip.js"></script>
<script src="bootstrap/js/bootstrap-popover.js"></script>
<script src="bootstrap/js/bootstrap-button.js"></script>
<script src="bootstrap/js/bootstrap-collapse.js"></script>
<script src="bootstrap/js/bootstrap-carousel.js"></script>
<script src="bootstrap/js/bootstrap-typeahead.js"></script>
</body>
</html>