LOT Winter School 2018

Statistical models of language variation with R

Natalia Levshina




Course information

Level: Advanced

Course description

A large amount of attention in sociolinguistics, psycholinguistics, cognitive and usage-based linguistics has been paid to alternations, variants, near synonyms and other types of functionally related linguistic units, e.g. English double-object and prepositional dative, Dutch doen and laten or the use or non-use of post-vocalic /r/, as in ‘fourth floor’. Such variation is normally probabilistic and multifactorial and requires the use of advanced multivariate methods. This practical hands-on course teaches how to model this kind of variation with the help of R, popular open-source statistical software. The methods covered in this course include frequentist and Bayesian binomial and multinomial logistic regression, mixed-effects models, conditional inference trees, random forests and naïve discriminative learning. The participants will learn how to fit such models, check their assumptions in order to obtain meaningful results, and interpret the latter with the help of various visualization techniques. Special attention will be paid to the advantages and disadvantages of each method for particular research tasks.

Day-to-day programme

Monday: Introduction. Binomial/binary logistic regression.

Tuesday: Binomial (continued) and multinomial logistic regression

Wednesday: Logistic regression with mixed effects.

Thursday: Conditional inference trees and random forests.

Friday: Naïve discriminative learning. Conclusions: pros and cons of each method.

Reading list


Levshina, N. 2015. How to Do Linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins. Chapter 12.


Divjak, Dagmar and Antti Arppe. 2013. Extracting prototypes from exemplars. What can corpus data tell us about concept representation? Cognitive Linguistics 24(2): 221-274.

Natalia Levshina. 2016. When variables align: A Bayesian multinomial mixed-effects model of English permissive constructions. Cognitive Linguistics 27(2): 235-268.


Johnson, D. E. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1): 359–383.

Levshina, N, D. Geeraerts & D. Speelman. 2013. Towards a 3D-Grammar: Interaction of linguistic and extralinguistic factors in the use of Dutch causative constructions. Journal of Pragmatics 52: 34–48.

Wolk, Ch., J. Bresnan, A. Rosenbach & B. Szmrecsanyi. 2013. Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica 30(3): 382–419.


Szmrecsanyi, B., J. Grafmiller, B. Heller & M. Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2): 109–137.

Tagliamonte, S. & R. H. Baayen. 2012. Models, forests and trees of York English: Was/were varia­tion as a case study for statistical practice. Language Variation and Change 24(2): 135–178.


Baayen, R. H. 2011. Corpus linguistics and naive discriminative learning. Brazilian Journal of Applied Linguistics 11: 295–328.

Baayen, R. H., A. Endresen, L. A. Janda, A. Makarova & T. Nesset. 2013. Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics 37: 253–291.