| --- |
| layout: documentation |
| title: Indian Languages Parallel Corpora |
| --- |
| |
| <div class="container"> |
| |
| <div class="row"> |
| <div class="span8"> |
| <h1>Datasets</h1> |
| <h2>Indian Parallel Languages</h2> |
| <span id="download"> |
| <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a> |
| </span> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <div class="row"> |
| <div class="span8"> |
| |
| This page describes a set of six parallel corpora obtained by translating popular |
| Wikipedia documents in six languages from the Indian sub-continent into English. The |
| languages are: |
| |
| <ul> |
| <li>Bengali</li> |
| <li>Hindi</li> |
| <li>Malayalam</li> |
| <li>Tamil</li> |
| <li>Telugu</li> |
| <li>Urdu</li> |
| </ul> |
| |
| <p> |
| The collection and release of this data is described in the following paper: |
| </p> |
| |
| <blockquote> |
| <i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/> |
| <a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris |
| Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles |
| Osborne</a> <br/> |
| <a href="http://statmt.org/wmt12">WMT 2012</a> <br/> |
| <a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a> |
| <a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a> |
| </blockquote> |
| |
| <h2>Download & License</h2> |
| |
| The Indian parallel corpora dataset |
| is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on |
| Github</a>. You can clone that, or download a release tarball by clicking the big green |
| button above. The corpus is licensed under |
| the <a href="http://creativecommons.org/">Creative |
| Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike |
| 3.0 Unported License</a> (CC BY-SA 3.0). |
| |
| <h2>Scores</h2> |
| |
| <p> |
| Below are the best translation scores (case-insensitive BLEU-4) that have been reported |
| on the provided test sets. The Google results were recorded in the fall of 2011 (and |
| are described in Post et al. (2012)). Google does not have a Malayalam system. |
| </p> |
| |
| <div> |
| <table> |
| <tr> |
| <th style="width:150px">Citation</th> |
| <th>BN</th> |
| <th>HI</th> |
| <th>ML</th> |
| <th>TA</th> |
| <th>TE</th> |
| <th>UR</th> |
| </tr> |
| <tr> |
| <td class="system">Google</td> |
| <td>20.01</td> |
| <td>25.21</td> |
| <td>–</td> |
| <td>13.51</td> |
| <td>16.03</td> |
| <td>23.09</td> |
| </tr> |
| <tr> |
| <td class="system"><a href="http://aclweb.org/anthology/W/W12/W12-3152.pdf">Post et al. (2012)</a></td> |
| <td>13.53</td> |
| <td>17.29</td> |
| <td>13.72</td> |
| <td> 9.81</td> |
| <td>12.46</td> |
| <td>19.53</td> |
| </tr> |
| </table> |
| </div> |
| </div> |
| |
| <div class="span4"> |
| <div> |
| <img width="250px" src="images/map1.png"/> |
| <p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p> |
| |
| <img width="250px" src="images/map2.png"/> |
| <p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p> |
| </div> |
| </div> |
| </div> |
| </div> <!-- /container --> |