AMTA 2006 每 Workshop:  MT Evaluation:  the Black Box in the Hall of Mirrors

August 12, 2006,

Boston Marriott



Motivation:   Evaluation of Machine Translation (MT) has proven elusive throughout the entire history of computing.  This problem arises in part from the inherent difficulty in assessing human language translation in general and in part from the different and often competing evaluation goals of the MT stakeholders.  In the last 5 years, the advent of rapidly applied, automatic evaluation measures have shown promise in creating the expectation of a common benchmark for all MT approaches.  But the coverage, consistently, and reliability of those measures remain problematic for much of the MT community.  This workshop will invite its participants to engage in a series of activities intended to widen horizons, challenge deeply held views, and engage in questions that may appear wildly out of scope of what has heretofore been the bounds of the discourse.  The goal is to increase awareness of the issues and difficulties, confront the presumptions that may have implicitly hindered us so far, and motivate us to come up with new evaluation approaches that are more actionable, more accurate, and more informative.


Who Should Attend:  All MT users, procurers, researchers, developers, investors, and anyone else who must make decisions about some aspect of machine translation approaches, implementation, or lifecycle.  We welcome persons new to the issues of MT evaluation, as well as those who have experience in designing and conducting evaluations.


Participation and Format:   This workshop will include invited speakers, presentation of brief position papers by participants, and hands-on exercises in evaluation, all intended to reveal and highlight extreme positions.  All participants will be expected to participate in the exercises, which will probe some of the inherent difficulty in evaluation, while comparing contemporary methods over the same translation corpora.  We encourage each participant, as well, to prepare a position statement on evaluation issues such as those listed below, in the form of a brief PowerPoint presentation (3-4 slides). We will allow presentation of these, time permitting, and will make them available to the participants online.  We ask participants to address particular problems in evaluation, but in particular, we would like the participants to engage one of the following questions:


  • Automatic vs. Human Judgment 每 which is the lesser of evils?
  • Do current methods tell you anything you really need to know?



John White (Systran Software)

Keith Miller (MITRE Corporation)


Advisory Committee:

Flo Reeder (MITRE)

Michelle Vanni (Army Research Laboratory)

Kathi Taylor (US Government)

Elaine Marsh (Naval Research Center)


Invited Speakers and\or Panelists:

The Workshop is pleased to offer a point-counterpoint by two associates of the MITRE Corporation, Florence Reeder and John Henderson.  Each has a unique and contrasting perspective on the efficacy of linguistic judgment-based evaluation and empirical evaluation.  Their presentations will demonstrate the divergence of opinion on MT evaluation techniques, providing ammunition for the events to follow in the workshop.



08:30 每 08:45

Welcome and Overview

08:45 每 09:45

Point and Counterpoint 每 Linguistics versus Statistics

09:45 每 10:00

Hands-on Task 每 Description and Data

10:00 每 10:15


10:15 每 11:30

Presentation of Position Papers

11:30 每 12:00

Hands-on Exercise

12:00 每 13:30


13:30 每 14:30

Hands-on Exercise continued

14:30 每 15:00

Team Compilation and Analysis

15:00 每 15:15


15:15 每 16:45

Exercise read outs, synthesis

16:45 每 17:00

Conclusions:  Future work