Names and name transliterations are information-bearing units in finished translation products and are important for MT embedded in applications such as entity extraction, information retrieval, data mining and identity matching. We will discuss the problems which names present in these environments: deciding when translation, transliteration or some combination of the two is appropriate, dealing with the complexity of name structures and naming practices in different languages, issues in character mapping, selecting from among competing transliteration schemes and adoption and enforcement of standards. We will also discuss downstream consequences of name processing in search applications which embed MT, automatic transliteration, entity extraction, methods to collect data for training, eg, gazetteers, and approaches to evaluation.
Presenters: Keith J. Miller, Mitre corporation and Sherri Condon, Mitre corporation
Keith J. Miller, PhD, spent several years working on various large-scale name matching systems. His current research activities center around multicultural name matching, machine translation, embedded HLT systems and component and system-level evaluation of HLT systems.
Sherri Condon received her Ph.D. in Linguistics from the University of Texas at Austin. In addition to several years of work in multilingual name matching and cross script name matching, she is a researcher in discourse/dialogue, entity extraction, and evaluation of machine translation and dialogue systems.
This tutorial is a multi-presenter nuts-and-bolts half-day tutorial on how to integrate MT. The presenters are application developers and integrators who have worked with multiple MT systems and will talk about how best to incorporate MT into a workflow for different environments and objectives.
Organizers: Jen Doyon, Mitre corporation
Presenters: Bill McClellan, Booz Allen Hamilton, Rod Holland, MITRE, Jay Carlson, MITRE, David Day, MITRE, Patrick Crago, Multi-Threaded Inc., and David Palmer, Virage
Bill McClellan has managed projects in the field of automated document exploitation, overseeing the integration of newly available technology to form end-to-end, high volume document exploitation systems for foreign language data. He has accompanied the technology in field trials and works with other language technology experts in the automated natural language processing (NLP) area to deliver data management systems that incorporate new foreign language technology that is appropriate for the unique problems his customers face.
Rod Holland is a Senior Principal Artificial Intelligence Engineer leading advanced development and applied research projects at The MITRE Corporation. While at MITRE, he has served as Principal Investigator on several DARPA, ONR, LASER ACTD, and MITRE Technology Program efforts, most recently the TrIM and Clipper programs. He is particularly interested in the process of turning the products of research into fielded systems.
Jay Carlson is a Principal Network Systems and Distributed Systems Engineer at MITRE. He works with Rod Holland to develop integrated tool suites for cross-language information retrieval (CLIR) and instant messaging. Rod, Jay, and team have been building and fielding prototype applications that use machine translation (MT) as an embedded component. They will be discussing two such embedded MT prototypes: 1) Translingual Instant Messaging (TrIM) and 2) Clipper (a CLIR system).
David Day (Ph.D., University of Massachusetts) has worked in the area of multi-lingual information extraction for the past decade. In the past year and a half Dr. Day has led the MITRE "Flex" project that has integrated a range of NLP tools and resources within a fully instrumented environment called "C/Flex." The goal of this project is to study whether, and how, these tools can enhance linguist productivity in realistic production environments.
Patrick Crago, President of Multi-Threaded Inc., is a recognized Subject Matter Expert (SME) in multi-lingual data exploitation and a pioneer in data “constructuring,” the process of bringing structure to unstructured data. Patrick leverages this knowledge to provide strategic investment and systems engineering consulting to a number of Federal Government customers. He has participated in the prototyping and development of a number of large scale, multi-lingual data exploitation systems for the Federal Government.
David D. Palmer, Ph.D. is the Vice President of Research for Virage, where he leads the corporate research program focused on developing and integrating multilingual speech and language technologies into automated video processing. He received his Ph.D. in Electrical Engineering from the University of Washington. He will be discussing work at Virage integrating several MT engines into a real-time news monitoring system that currently captures live broadcasts in Arabic, Mandarin Chinese, Russian, Spanish, French, and Italian.
Knowledge about typical word usage, plausible word sense, understandable phrase structure and most likely sentence meaning are crucial elements of an accurate translation. Such translation-relevant knowledge can be captured in models based on large collections of bilingual and monolingual texts, including dictionaries. Combining such models into translation systems is the subject of much current research in statistical machine translation.
"Statistical Machine Translation," or SMT, has grown from the confluence of several lines of research. Seen from this perspective, SMT essentially involves three processes:
In this tutorial, we will discuss these techiques by:
Presenters: David Smith, Johns Hopkins University and Charles Schafer, Johns Hopkins University
David Smith is author of papers in the area of syntactic parsing, morphological disambiguation and feature selection for statistical machine translation. His research interests include machine translation, multilingual parsing, and machine learning techniques for natural language tasks. As lead programmer for the Perseus Digital Library Project at Tufts University, he worked in morphological analysis, information extraction, and event detection. David is currently a PhD student in computer science, but before his turn to CS, he was trained in the translation practices of classical philology.
David's website: http://www.cs.jhu.edu/~dasmith
Charles Schafer's technical papers touch on many aspects of statistical machine translation, including lexical and syntactic transduction, corpus alignment, and word sense induction. His research interests in the areas of machine translation and multilingual natural language processing are primarily aimed at addressing problems in the translation of low-resource languages. As a Post-doctoral fellow in Natural Language Processing at JHU, Charles is extending and applying many techniques for building morphological analyzers, basic syntactic analyzers, named-entity recognizers, and cross-language name matching tools to a large set of languages, including many European, North Indian, Dravidian, and Turkic languages.
Charles's website http://www.cs.jhu.edu/~cschafer
Charles Schafer and David Smith are members of the Johns Hopkins University Center for Language and Speech Processing Natural Language Processing Laboratory.
Website for the Center for Language and Speech Processing: http://www.clsp.jhu.edu/index.shtml
The existence of dialects for any language constitutes a challenge for MT. The problem is particularly interesting and challenging in Arabic. Realistic and practical approaches to processing Arabic have to account for dialectal usage since it is so pervasive. In this tutorial, designed for both computer scientists and linguists, we will highlight how dialectal phenomena migrate from the standard and why they pose challenges to MT and NLP. We will present as background issues for standard Arabic then provide a high-level view of dialects, including aspects of interest for MT. Lastly, we will focus on dialectal morphology and dialectal syntactic parsing. We will make references throughout to available resources, analyze contrasts with standard Arabic and English, discuss annotation standards and provide links to recent publications and toolkits. No knowledge of Arabic is required. Nizar Habash’s Arabic NLP tutorial www.ccls.columbia.edu/cadim/presentations.html will be reviewed in the first quarter of the tutorial.
Presenters: Mona Diab, Columbia University and Nizar Habash, Columbia University
Mona Diab, PhD, Associate research scientist at the Center for Computational Learning Systems, Columbia University, pursues research on word sense disambiguation, automatic acquisition of natural language resources such as dictionaries and taxonomies, unsupervised learning methods, lexical semantics, cross language knowledge induction, Arabic NLP and processing tools, dialect modelling, and Arabic syntactic and semantic parsing. Dr. Diab was a senior member in the 2005 JHU summer workshop on Parsing Arabic Dialects. Mona’s website: http://www.cs.columbia.edu/~mdiab
Nizar Habash, PhD, Associate research scientist at the Center for Computational Learning Systems in Columbia University, pursues research on machine translation, natural language generation, lexical semantics, morphological analysis, parsing and computational modeling of Arabic dialects. Dr. Habash co-chaired the Workshop on MT for Semitic Languages (MT Summit 2003), is vice-president of the Semitic Language Special Interest Group of the Association of Computational Linguistics and serves as research program co-chair for AMTA 2006. Nizar's website: http://www.nizarhabash.com
Drs. Diab and Habash served as co-chairs for the Workshop on Computational Approaches to Semitic Languages (ACL 2005), with Kareem Darwish, and co-founded the Columbia Arabic Dialect Modeling (CADIM) group, with Owen Rambow. CADIM website: http://www.ccls.columbia.edu/cadim
Authors of technical content are often not aware that style choices have consequences for their “consumers,” especially translators. Automating translation becomes a distant dream when the source text is itself confusing. This tutorial focuses on human- and machine-translatability in order to define actionable goals and concrete methods for producing translatable content. The strategies discussed yield improvements for both human and machine translation processes. We cover approaches from simplified or “controlled” language to those which emphasize consistent style, reuse, and process awareness. We help writers focus on concrete techniques that can bring immediate improvements. A relatively new crop of software tools, which we will demonstrate and discuss, now provides effective means for monitoring translatability.
Presenter: Mike Dillinger, Consultant
Mike Dillinger, PhD, a content management consultant who helps language technology vendors design, develop, and promote new functionality for content management tools, counts among his affiliations industry leaders such as Logos, Global Words Technologies and Spoken Translation. An experienced technical writer, revisor, translator, and interpreter, he has over 20 years of related research and teaching experience, work that spans four continents.
After nearly a decade in which "ontology" has been a bad word in the natural language processing (NLP) and Artificial Intelligence research communities, there are encouraging signs that the pendulum is swinging back. Increasingly, MT and NLP researchers are starting to use very shallow semantics to help statistical MT and NLP systems overcome the quality performance ceilings many of them seem to have reached.
This tutorial outlines the principal aspects of ontologies as they pertain to MT and NLP. It defines and describes ontologies; illustrates the uses on ontologies in MT in general; outlines the problems encountered in building large-scale ontologies as required for MT and NLP; lists five ontology construction paradigms that each provide a different methodology and result; discusses the MT-oriented ontology construction methodology, with examples, in more detail; and concludes with a review of some automated ontology construction techniques.
Presenter: Eduard Hovy, Information Sciences Institute, University of Southern California
Eduard Hovy leads the Natural Language Research Group at the Information Sciences Institute (ISI) of the University of Southern California (USC) where he serves as Deputy Director of the Intelligent Systems Division and is research associate professor of Computer Science at USC as well as Advisory Professor at Beijing University of Posts and Telecommunications. He completed a Ph.D. at Yale University in 1987. His research focuses on information extraction, summarization, question answering, constructing large lexicons and ontologies, machine translation, and digital government. Dr Hovy serves in an advisory capacity to funders of NLP research in the US and EU and has authored or co-edited five books and over 170 technical articles. He is past president of the Association for Computational Linguistics (ACL) and the International Association of Machine Translation (IAMT).