LOT Winter School 2018

Statistical models of language variation with R

Natalia Levshina




Course information

Level: Advanced

Course description

A large amount of attention in sociolinguistics, psycholinguistics, cognitive and usage-based linguistics has been paid to alternations, variants, near synonyms and other types of functionally related linguistic units, e.g. English double-object and prepositional dative, Dutch doen and laten or the use or non-use of post-vocalic /r/, as in ‘fourth floor’. Such variation is normally probabilistic and multifactorial and requires the use of advanced multivariate methods. This practical hands-on course teaches how to model this kind of variation with the help of R, popular open-source statistical software. The methods covered in this course include frequentist and Bayesian binomial and multinomial logistic regression, mixed-effects models, conditional inference trees, random forests and naïve discriminative learning. The participants will learn how to fit such models, check their assumptions in order to obtain meaningful results, and interpret the latter with the help of various visualization techniques. Special attention will be paid to the advantages and disadvantages of each method for particular research tasks.

Day-to-day programme

Monday: Introduction. Binomial/binary logistic regression.

Tuesday: Binomial (continued) and multinomial logistic regression

Wednesday: Logistic regression with mixed effects.

Thursday: Conditional inference trees and random forests.

Friday: Naïve discriminative learning. Conclusions: pros and cons of each method.

Reading list


