Leveling the NLP playing field in Africa through indigenous language datasets
Natural Language Processing (NLP) is revolutionising communication in the 21st century, particularly in digital translation and machine-readable text applications. However, indigenous African languages are severely underrepresented in these applications, because good-quality, open African language datasets are rare to non-existent. This will widen the digital divide in Africa unless organisations proactively support the development of good-quality African language datasets.
This workshop will talk about how NLP researchers and engineers in Africa collected, developed and curated datasets for underrepresented African languages, ensuring the data is representative, useful, and public. We will discuss datasets and their tasks like machine translation using Yoruba from Nigeria, sentiment analysis in Tunizi Arabizi, automatic speech recognition in Wolof from Senegal and classification in Swahili.
The data collection was facilitated through AI4D and Zindi.