AMTA 2018 | Tutorial | Corpora Quality Management for MT – Practices and Roles
Tutorial Presenters: Nicola Ueffing (eBay MT Science), Pete Smith (University of Texas Arlington) and Silvio Picinini (eBay Localization)
Target audience: any person involved with the deployment of MT that is interested in the quality of the corpora data that will be at the foundation of the MT deployment. Roles include managers, linguists, engineers or scientists.
The quality of the corpora that trains MT systems has not been a prominent topic of discussion, if compared to MT technology. Most research uses the same, well- established corpora so that results can be reproduced. However, corpora can strongly determine the quality of the MT output. Neural MT relies more on data quality than previous technologies. Therefore, we thought that the time has come to take a deeper look into the corpora quality. We will look at this from a science view and from a linguist view, exploring current and future roles in the evolving MT scenario.
From the eBay experience, participants will learn and discuss:
- Best practices on Corpora Management from a Science team perspective
- Automatic ways to locate issues and cleanup – with examples
- Best practices on Corpora Management from a Localization team perspective
- Engineering and Linguistic Cleanup – with examples
- Creating high-quality in-domain content via post-editing
- Other Industry Best Practices on Corpora management – with examples
- New metrics for corpora quality
Considering that data curation of corpora may become a task for a Language Professional, learn from Univ. of Texas what the profile of a Language Professional could be:
- What a Language Professional education/skills should look like
- Experience from bringing Corpora Management to L10N students
Want to read more about the conference?