| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <title>Indian Languages Parallel Corpora</title> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| <meta name="description" content=""> |
| <meta name="author" content=""> |
| |
| <!-- Le styles --> |
| <link href="/bootstrap/css/bootstrap.css" rel="stylesheet"> |
| <style> |
| body { |
| padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */ |
| } |
| #download { |
| background-color: green; |
| font-size: 14pt; |
| font-weight: bold; |
| text-align: center; |
| color: white; |
| border-radius: 5px; |
| padding: 4px; |
| } |
| |
| #download a:link { |
| color: white; |
| } |
| |
| #download a:hover { |
| color: lightgrey; |
| } |
| |
| #download a:visited { |
| color: white; |
| } |
| |
| a.pdf { |
| font-variant: small-caps; |
| /* font-weight: bold; */ |
| font-size: 10pt; |
| color: white; |
| background: brown; |
| padding: 2px; |
| } |
| |
| a.bibtex { |
| font-variant: small-caps; |
| /* font-weight: bold; */ |
| font-size: 10pt; |
| color: white; |
| background: orange; |
| padding: 2px; |
| } |
| |
| img.sponsor { |
| height: 120px; |
| margin: 5px; |
| } |
| </style> |
| <link href="bootstrap/css/bootstrap-responsive.css" rel="stylesheet"> |
| |
| <!-- HTML5 shim, for IE6-8 support of HTML5 elements --> |
| <!--[if lt IE 9]> |
| <script src="bootstrap/js/html5shiv.js"></script> |
| <![endif]--> |
| |
| <!-- Fav and touch icons --> |
| <link rel="apple-touch-icon-precomposed" sizes="144x144" href="bootstrap/ico/apple-touch-icon-144-precomposed.png"> |
| <link rel="apple-touch-icon-precomposed" sizes="114x114" href="bootstrap/ico/apple-touch-icon-114-precomposed.png"> |
| <link rel="apple-touch-icon-precomposed" sizes="72x72" href="bootstrap/ico/apple-touch-icon-72-precomposed.png"> |
| <link rel="apple-touch-icon-precomposed" href="bootstrap/ico/apple-touch-icon-57-precomposed.png"> |
| <link rel="shortcut icon" href="bootstrap/ico/favicon.png"> |
| </head> |
| |
| <body> |
| |
| <div class="navbar navbar-inverse navbar-fixed-top"> |
| <div class="navbar-inner"> |
| <div class="container"> |
| <button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| <a class="brand" href="#">Joshua</a> |
| <div class="nav-collapse collapse"> |
| <ul class="nav"> |
| <li class="active"><a href="/">Home</a></li> |
| <li><a href="index.html">Indian Languages</a></li> |
| </ul> |
| </div><!--/.nav-collapse --> |
| </div> |
| </div> |
| </div> |
| |
| <div class="container"> |
| |
| <div class="row"> |
| <div class="span8"> |
| <h1>Indian Languages Parallel Corpora</h1> |
| </div> |
| <div> |
| <p> |
| <br/> |
| <span id="download"> |
| <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">Download</a> |
| </span> |
| </p> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <div class="row"> |
| <div class="span8"> |
| |
| <h2>Description</h2> |
| |
| This page describes a set of parallel corpora between English and six languages from the |
| Indian sub-continent: |
| |
| <ul> |
| <li>Bengali</li> |
| <li>Hindi</li> |
| <li>Malayalam</li> |
| <li>Tamil</li> |
| <li>Telugu</li> |
| <li>Urdu</li> |
| </ul> |
| |
| <p> |
| They can be used to train (and evaluate) models |
| for <a href="http://en.wikipedia.org/wiki/Statistical_machine_translation">automatically |
| translating</a> text into and out of these languages. They were collected by |
| translating Indian Wikipedia articles into English using Amazon's Mechanical Turk. |
| Their collection and release are described in the paper: |
| </p> |
| |
| <blockquote> |
| <i>Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing</i> <br/> |
| <a href="http://cs.jhu.edu/~post">Matt Post</a>, <a href="http://cs.jhu.edu/~ccb">Chris |
| Callison-Burch</a>, and <a href="http://homepages.inf.ed.ac.uk/miles/">Miles |
| Osborne</a> <br/> |
| <a href="http://statmt.org/wmt12">WMT 2012</a> <br/> |
| <a class="pdf" href="http://aclweb.org/anthology-new/W/W12/W12-3152.pdf">PDF</a> |
| <a class="bibtex" href="http://aclweb.org/anthology-new/W/W12/W12-3152.bib">BIB</a> |
| </blockquote> |
| |
| <h2>Download & License</h2> |
| |
| The Indian parallel corpora dataset |
| is <a href="https://github.com/joshua-decoder/indian-parallel-corpora">hosted on |
| Github</a>. You can download a tarball directly |
| by <a href="https://github.com/joshua-decoder/indian-parallel-corpora/zipball/master">clicking |
| here</a>. The corpus is licensed under the <a href="http://creativecommons.org/">Creative |
| Commons</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/">Attribution-Sharealike |
| 3.0 Unported License</a> (CC BY-SA 3.0). |
| |
| <h2>Citations</h2> |
| |
| <p> |
| The following publications have made use of this dataset. |
| </p> |
| |
| <ul> |
| <li><b>Post, Callison-Burch, and Osborne (2012)</b>. This paper introduced the parallel |
| corpora, describing how the data was collected, reporting the results of prelimary |
| experiments, and suggesting some potential research directions. |
| </li> |
| </ul> |
| |
| <h2>Scores</h2> |
| |
| <p> |
| Below are the best translation scores (case-insensitive BLEU-4) that have been |
| reported on the provided test sets. The Google results were recorded in the fall of |
| 2011 (and are described in Post et al. (2012)). Google does not have a Malayalam |
| system. |
| </p> |
| |
| <div> |
| <table> |
| <tr> |
| <th style="width:150px">Citation</th> |
| <th>BN</th> |
| <th>HI</th> |
| <th>ML</th> |
| <th>TA</th> |
| <th>TE</th> |
| <th>UR</th> |
| </tr> |
| <tr> |
| <td class="system">Google</td> |
| <td>20.01</td> |
| <td>25.21</td> |
| <td>–</td> |
| <td>13.51</td> |
| <td>16.03</td> |
| <td>23.09</td> |
| </tr> |
| <tr> |
| <td class="system">Post et al. (2012)</td> |
| <td>13.53</td> |
| <td>17.29</td> |
| <td>13.72</td> |
| <td> 9.81</td> |
| <td>12.46</td> |
| <td>19.53</td> |
| </tr> |
| </table> |
| </div> |
| </div> |
| |
| <div class="span4"> |
| <div> |
| <img width="250px" src="images/map1.png"/> |
| <p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Indo-Aryan_languages">Indo-Aryan languages</a></p> |
| |
| <img width="250px" src="images/map2.png"/> |
| <p style="text-align: center"><a href="http://en.wikipedia.org/wiki/Dravidian_languages">Dravidian languages</a></p> |
| </div> |
| </div> |
| </div> |
| </div> <!-- /container --> |
| |
| <!-- Le javascript |
| ================================================== --> |
| <!-- Placed at the end of the document so the pages load faster --> |
| <script src="bootstrap/js/jquery.js"></script> |
| <script src="bootstrap/js/bootstrap-transition.js"></script> |
| <script src="bootstrap/js/bootstrap-alert.js"></script> |
| <script src="bootstrap/js/bootstrap-modal.js"></script> |
| <script src="bootstrap/js/bootstrap-dropdown.js"></script> |
| <script src="bootstrap/js/bootstrap-scrollspy.js"></script> |
| <script src="bootstrap/js/bootstrap-tab.js"></script> |
| <script src="bootstrap/js/bootstrap-tooltip.js"></script> |
| <script src="bootstrap/js/bootstrap-popover.js"></script> |
| <script src="bootstrap/js/bootstrap-button.js"></script> |
| <script src="bootstrap/js/bootstrap-collapse.js"></script> |
| <script src="bootstrap/js/bootstrap-carousel.js"></script> |
| <script src="bootstrap/js/bootstrap-typeahead.js"></script> |
| |
| </body> |
| </html> |
| |