The Align System

Introduction

Align is a C++ package for aligning, at the sentence level, a pair of text files which are translations of one another. The problem Align was designed to solve is this: you have a pair of text files which are translations of one another. Each file may contain "spurious" (extra) sentences, not appearing in the other file. The translations may also be impressionistic. Relying on dynamic programming and a user-provided routine for calculating the probability of a word-to-word translation between the two languages, Align will (ideally, anyway) weave an optimal sentence-to-sentence alignment between the two files.

Align takes as input a pair of ascii files to be aligned. Each file contain one "sentence" per line, the words of which are space-delimited. That is, newlines delimit sentences, and spaces delimit words. I put the word "sentence" in quotes because Align doesn't actually care what syntactic units appear on each line; however, the output of Align will be an alignment between lines of the input files. (If you so desire, you may put paragraphs or just phrases on each line, to align at a coarser or finer level of granularity.)

Code-related notes

There should be nothing you, the user of this code, needs to modify in any part of the code except User.[CH], where you *will* be required to fill in some empty functions. The most important component there is a scoring routine which gives the probability that a "French" word is the proper translation of an "English" word. The alignment program relies on this scoring function to guide its alignment: a pair of sentences containing words that are likely translations of one another are probably themselves translations.

One can think of this probabilistic word-to-word translation model as an N by M matrix, where N is the number of recognized French words and M is the number of recognized English words. The model doesn't have to be highly accurate, but the better the probabilities are, the better the alignments are likely to be.

This code attempts to align the input text on a sentence-by-sentence level. Of course, some bilingual corpora are actually aligned at a much finer grain: at the phrase level, say. This program guarantees only that the resulting alignment will be the optimal alignment (relative to the user-provided translation probabilities and user-provided thresholds) at the sentence level.

This code was originally intended for use in aligning the "Hansards": proceedings of the Canadian parliament. That explains the use of "French" and "English" in the code. Despite this notation, the code makes no explicit assumptions about the identity of the underlying languages.

A word about anchors

Align looks for (but does not insist on) special "anchors" in the files. These are lines of the form

=t= [some anchor label] =t=

The program will guarantee that in the resulting alignment, anchors with the same label will align in the two files. The actual spelling of the =t= alignment symbol is a run-time parameter. (The intelligent thing to do, of course, is to use a symbol which is not a word in either language.)

Running the program

The program is meant to be compiled within a Unix-type environment. A makefile is provided. Run with no arguments to get the proper usage.

Copyright notice

Copyright (C) 2000, Carnegie Mellon University and Adam Berger All rights reserved.

This software package, including the documentation and makefile, is made available for research purposes only. It may be redistributed freely for this purpose, in full or in part, provided that this entire copyright notice is included on any copies of this software and applications and derivations thereof.

This software is provided on an "as is" basis, without warranty of any kind, either expressed or implied, as to any matter including, but not limited to warranty of fitness of purpose, or merchantability, or results obtained from use of this software.

You are welcome to send email to me, the developer, at aberger@cs.cmu.edu, with bug reports or feature requests. I can't promise to address your concern or even reply to your email, but I will try. This was not originally intended for public consumption, but I decided to make it available to the research community after I received a request for the code.