blob: 29168b1c77bda3689524cad02b73aed0c9dcf1e6 [file] [log] [blame]
---
layout: documentation
title: Indian Languages Parallel Corpora
---
<div class="container">
<div class="row">
<div class="span8">
<h1>Datasets</h1>
<h2>Indian Parallel Languages</h2>
<span id="download">
<a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a>
</span>
</div>
</div>
<hr />
<div class="row">
<div class="span8">
This page describes a set of six parallel corpora obtained by translating popular
Wikipedia documents in six languages from the Indian sub-continent into English. The
languages are:
<ul>
<li>Bengali</li>
<li>Hindi</li>
<li>Malayalam</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Urdu</li>
</ul>
<p>
The collection and release of this data is described in the following paper:
</p>
<blockquote>
<i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/>
<a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris
Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles
Osborne</a> <br/>
<a href="http://statmt.org/wmt12">WMT 2012</a> <br/>
<a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a>
<a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a>
</blockquote>
<h2>Download & License</h2>
The Indian parallel corpora dataset
is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on
Github</a>. You can clone that, or download a release tarball by clicking the big green
button above. The corpus is licensed under
the <a href="http://creativecommons.org/">Creative
Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike
3.0 Unported License</a> (CC BY-SA 3.0).
<h2>Scores</h2>
<p>
Below are the best translation scores (case-insensitive BLEU-4) that have been reported
on the provided test sets. The Google results were recorded in the fall of 2011 (and
are described in Post et al. (2012)). Google does not have a Malayalam system.
</p>
<div>
<table>
<tr>
<th style="width:150px">Citation</th>
<th>BN</th>
<th>HI</th>
<th>ML</th>
<th>TA</th>
<th>TE</th>
<th>UR</th>
</tr>
<tr>
<td class="system">Google</td>
<td>20.01</td>
<td>25.21</td>
<td>&ndash;</td>
<td>13.51</td>
<td>16.03</td>
<td>23.09</td>
</tr>
<tr>
<td class="system"><a href="http://aclweb.org/anthology/W/W12/W12-3152.pdf">Post et al. (2012)</a></td>
<td>13.53</td>
<td>17.29</td>
<td>13.72</td>
<td> 9.81</td>
<td>12.46</td>
<td>19.53</td>
</tr>
</table>
</div>
</div>
<div class="span4">
<div>
<img width="250px" src="images/map1.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p>
<img width="250px" src="images/map2.png"/>
<p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p>
</div>
</div>
</div>
</div> <!-- /container -->