| Workshop Program (June 29)
| Invited Talk
|| So many languages, so few resources: How to bridge the gap?
| Mike Maxwell
| Regular Papers
|| Association-Based Bilingual Word Alignment
| Robert C. Moore
|| Cross Language Text Categorization by acquiring Multilingual Domain Models from Comparable Corpora
| Alfio Gliozzo, Carlo Strapparava
|| Parsing Word-Aligned Parallel Corpora in a Grammar Induction Context
| Jonas Kuhn
|| Bilingual Word Spectral Clustering for Statistical Machine Translation
| Bing Zhao, Eric P. Xing, Alex Waibel
|| Revealing Phonological Similarities between Related Languages from Automatically Generated Parallel Corpora
| Karin Mueller
|| Acquiring and Using Parallel Texts and Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation
| Maja Popovic, David Vilar, Hermann Ney, Slobodan Jovicic, Zoran Saric
|| Induction of Fine-grained Part-of-speech Taggers via Classifier Combination and Crosslingual Projection
| Elliott Franco Drabek, David Yarowsky
|| Comparison, Selection, and Use of Sentence Alignment Algorithms for New Language Pairs
| Anil Kumar Singh, Samar Husain
|Shared Task on Word Alignment
|| Word Alignment for Languages with Scarce Resources
| Joel Martin, Rada Mihalcea, Ted Pedersen
|| A hybrid approach to align sentences and words in English-Hindi parallel corpora
| Niraj Aswani, Robert Gaizauskas
|| NUKTI: English-Inuktitut Word Alignment System Description
| Philippe Langlais, Fabrizio Gotti, Guihong Cao
|| Models for Inuktitut-English Word Alignment
| Charles Schafer, Elliott Franco Drabek
|| Improved HMM Alignment Models for Languages with Scarce Resources
| Adam Lopez, Philip Resnik
|| Symmetric Probabilistic Alignment
| Ralf D. Brown, Jae Dong Kim, Peter J. Jansen, and Jaime G. Carbonell
|| ISI's Participation in the Romanian-English Alignment Task
| Alexander Fraser, Daniel Marcu
|| Experiments Using MAR for Aligning Corpora
| Juan Miguel Vilar
|| Combined word alignments
| Dan Tufis, Radu Ion, Alexandru Ceausu, Dan Stefanescu
|Panel and Discussions
|| "Building and Exploiting Parallel Texts for Languages with Scarce Resources: Lessons Learned and Future Directions".
| Ralf Brown, Joel Martin, Bob Moore, Charles Schafer
So many languages, so few resources: How to bridge the gap?
Linguistic Data Consortium
University of Pennsylvania
It is now common knowledge that many of the world's more
than six thousand languages are in danger of becoming
extinct. While languages have disappeared throughout history
and pre-history, the present rate of extinction is
unprecedented. Attempts are being made both to preserve
languages in the living state, and to document and describe
The primary resource for language documentation is
undoubtedly parallel text. Traditionally (and necessarily)
field linguists have created parallel text in the languages
they study by hand, starting out by transcribing previously
unwritten languages. This work continues today, aided by
modern tools, but it is still labor intensive and slow.
Computational linguists have also demonstrated the utility
of parallel text as the fuel for many areas of NLP,
including statistical machine translation. But while these
uses of parallel text have a record of success with
languages like French, Chinese and Arabic, recent efforts in
so-called "Low Density" languages such as Hindi, Cebuano and
others have shown lesser success, in large part because the
of the shortage of parallel text.
In sum, apart from a few large languages, parallel text is a
scarce and expensive commodity. I will try to give a feel
for the availability of parallel text in a wide range of
languages, and discuss efforts to create more parallel text,
including better tools for field linguists, web search, paid
translation, and Open-Mind style efforts. I conclude by
suggesting that if the scarcity of parallel text is to be
solved, both for language documentation and for NLP, then it
is time to try new methods, perhaps including wikification.
Short bio: Mike Maxwell is a researcher at the Linguistic Data
Consortium of the University of Pennsylvania. He obtained
his BS in zoology at the University of Illinois in 1972, an
MA in linguistics at the University of Washington in 1977,
and his PhD at the University of Washington in 1984. As a
member of the Summer Institute of Linguistics, he worked
with indigenous languages of Mexico, Ecuador and Colombia,
and developed tools for doing morphological analysis. He has
also worked in syntactic parsing at Boeing Computer Services.
At the Linguistic Data Consortium, his work has included
developing morphological transducers for various languages,
and creating corpora for "low density" languages, that is
languages without extensive computational resources, ranging
from Hindi to Tigrinya.
His interests include documentation and description of
endangered languages, collecting and building resources for
low density languages, and morphology.
The goal of this shared task is to provide an environment for the evaluation of systems for word alignment, with a focus on languages with scarce resources. This follows on the success of the word alignment shared task that took place as part of the NAACL 2003 workshop on parallel texts.
All researchers who have a word alignment system available are invited to participate in this shared task on word alignment, individually or as part of a team.
Participants in the shared task will be provided with common sets of training data, consisting of English-Inuktitut, Romanian-English, and English-Hindi parallel texts (a participating team can choose to apply their system on one, two, or all three language pairs). Participants will be given approximately one month to train their systems with this data, and then previously held out test data will be released. Participants will run their alignment system on this test data and submit their results, which will be evaluated using a common set of metrics.
The registration form is now available here. All active participants who intend to participate in the word alignment shared task are required to register. During the test period (April 3 - April 10) test data will be released only to registered participants!
Last day to register for participation in the shared task: April 7.
Everybody interested in the shared task is invited to register in the shared task mailing list (this mailing list is open to everybody interested in word alignment, regardless of their participation in the shared task). A list of general text alignment resources is also provided.
|Submission of results
|Results back to participants
|Submission of short papers
Guidelines and data sets
Code for alignment evaluation, and for format validation of alignment files.
- Guidelines for the shared task.
- Training data
- English-Inuktitut training data. A collection of Inuktitut-English parallel texts from the Legislative Assembly of Nunavut, sentence-aligned. An introduction to Inuktitut that participants might find helpful is available here.
- Romanian-English training data. This collection groups together the parallel text of 1984, the Romanian Constitution, and a large (about 900,000 tokens) collection of texts collected from the Web. (to get access to this data set, please send an email to Rada Mihalcea, rada at cs unt edu).
- English-Hindi training data. A collection of English-Hindi parallel texts, from the Emille project. Data provided by Niraj Aswani and Rob Gaizauskas from U.Sheffield.
- Development data
- Test data
General text alignment resources