LOT Summer School 2018

Language, computation and technology

Yves Scherrer



Title of the course:

Language technology for low-resource languages



However, students should be familiar with the basic concepts, tasks and algorithms used in Natural Language Processing, e.g. what a parser or a tagger is, or what word alignment is about.

Course description

A large part of recent research in language technology (LT) is restricted to a small number of languages. While more and more datasets are created, made available, and used for English and a few other languages, the large majority of the world's languages is hardly ever the object of LT research. In this course, we will introduce and discuss several definitions of so-called 'low-resource languages', and we will examine how LT systems (such as taggers or parsers) can be developed for such languages despite the challenging data situation. In particular, we will discuss how linguistic annotations or models can be transferred from a resource-rich to a resource-poor language. In this setting, we have to distinguish cases where the two languages are etymologically closely related from cases where they are not. We will also see how these methods can be applied to 'special' types of low-resource languages such as historical language varieties, dialects, and sociolects, whose automatic processing faces similar challenges.

Day-to-day program

Monday: Definitions of low-resource languages in linguistics and computational linguistics, overview of the main language technology applications and their resource requirements

Tuesday: Annotation projection using parallel corpora

Wednesday: Delexicalisation and relexicalisation approaches

Thursday: Closely related languages and language varieties - definitions, problems and solutions

Friday: Multilingual modelling and zero-shot learning

Besides these teaching sessions, I intend to arrange informal discussion meetings with interested students, in particular students of the LCT program.

Reading list

Background and preparatory readings:

Yulia Tsvetkov (2017): Opportunities and challenges in working with low-resource languages. (Slides, Part 1)

Course readings:

Lecture 1:

META-NET Strategic Research Agenda for Multilingual Europe 2020. (Sections 1, 2, and 4)

Lecture 2:

Lecture 3:

Lecture 4:

  • Delphine Bernhard & Anne-Laure Ligozat (2013): Hassle-free POS-Tagging for the Alsatian Dialects.
    In: Marcos Zampieri & Sascha Diwersy: Non-Standard Data Sources in Corpus Based-Research, Shaker, ZSM Studien.

Lecture 5: