Coding for Language Communities

The unsupervised free CAT for low resource languages

Building a pipeline for the communities

February 1, 2020
3:00 PM – 4:00 PM

AW1.120

Alberto Massidda

We present: 1) a full pipeline for unsupervised machine translation training (making use of monolingual corpora) for languages with low available resources; 2) a translation server making use of that unsupervised MT with an HTTP API compatible with Moses toolkit, a once prominent MT system; 3) a Docker packaged version of the EU funded free Computer Aided Translation (CAT) tool MateCAT for ease of deployment. This full translation pipeline enables a non technical user, speaking a non-FIGS language for which there is scarcity of parallel corpora, to start translating documents and software following translation industry standards.

Localization within community suffers from the fragmentation of technologies (too wide wedge between commercial Computer Aided Translation tools and free ones), available language resources (making difficult to train a Machine Translation) and lack of clear and robust pipelines to get started. Low resource language communities suffer the most, since MT systems require training corpora of millions of words and industry has settled to expecting the massive corpora available to FIGS (French, Italian, German, Spanish) languages. Moreover, the community suffers from a lack of adoption of established technologies and workflows, leading to reinventing the wheel and suboptimal efforts’ outcomes. Today we would like to present a connector for the implementation of an unsupervised MT (made by Artetxe et al.), that claims a BLEU of 26 on limited language resources (which is enough as a support system) integrated with MateCAT, an industry level, free, web based tool funded by EU, in order to provide a more viable alternative to resorting to Google Translate and commercial LSPs.

Additional information

Type	devroom

More sessions

2/1/20	Lexemes in Wikidata Coding for Language Communities Lydia Pintscher AW1.120 Wikidata, Wikimedia's knowledge base, has been collecting general purpose data about the world for 7 years now. This data powers Wikipedia but also many applications outside Wikimedia, like your digital personal assistant. In recent years Wikidata's community has also started collecting lexicographical data in order to provide a large data set of machine-readable data about words in hundreds of languages. In this talk we will explore how Wikidata enables thousands of volunteers to describe their ...
2/1/20	Nuspell: version 3 of the new spell checker Coding for Language Communities Sander van Geloven AW1.120 Nuspell version 3 is a FOSS checker that is written in pure C 17. It extensively supports character encodings, locales, compounding, affixing and complex morphology. Existing spell checking in web browsers, office suits, IDEs and other text editors can use this as a drop-in replacement. Nuspell supports 90 languages, suggestions and personal dictionaries.
2/1/20	AMENDMENT Weblate! Localize your project the developer way: continously, flawlessly, community driven, and open-source Coding for Language Communities Michal Čihař AW1.120 Please note that this talk will now be given by Michal Čihař instead of Václav Zbránek. The presentation will show you how to localize your project easily with little effort, open-source way. Why we started Weblate? We said no to repetitive work, no to manual work with translation files anymore. Weblate is unique for its tight integration to VCS. Set it up once and start engaging the community of translators. More languages translated means more happy users of your software. Be like ...
2/1/20	Open Edge Hardware and Software for Natural Language Translation and Understanding Coding for Language Communities AW1.120 The last half decade has seen a major increase in the accuracy of deep learning methods for natural language translation and understanding. However many users still interact with these systems through proprietary models served on specialized cloud hardware. In this talk we discuss co-design efforts between researchers in natural language processing and computer architecture to develop an open-source software/hardware system for natural language translation and understanding across languages. ...
2/1/20	Poio Predictive Text Coding for Language Communities Peter Bouda AW1.120 The Poio project develops language technologies to support communication in lesser-used and under-resourced languages on and with electronic devices. Within the Poio project we develop text input services with text prediction and transliteration for mobile devices and desktop users to allow conversation between individuals and in online communities.

FOSDEM 2020

2/1/20 – 2/2/20

Event

FOSS Events

Created by @foss_events 25 Follower

Event Calendar