Corpus linguistics & vocabulary learning



Recently, one of my students asked for some pointers in corpus linguistics and vocabulary learning. Here´s my top 5 impromptu list.

Sinclair, J. (2003). Reading concordances. An introduction. Harlow: Longman.

This is a great resource to fully understand the implications of using concordances to derive (linguistic) meaning.

Leńko-Szymańska, A. (2015). The English Vocabulary Profile as a benchmark for assigning levels to learner corpus data. Learner corpora in language testing and assessment, 115-140.

Interesting research that discusses the use of Cambridge Vocabulary Profile to sort ICCI learners into levels.

Schmitt, N., Cobb, T., Horst, M., & Schmitt, D. (2017). How much vocabulary is needed to use English? Replication of Van Zeeland & Schmitt (2012), Nation, (2006), and Cobb (2007). Language Teaching, 50(2), 212–226.

Excellent paper that makes use of corpus linguistics research methods to assess how much vocabulary do learners need to use English.

Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research shows. Language Learning, 64, 4, 913–951.

Great paper that discusses the many sides of vocabulary knowledge. Great if you need a start for vocabulary research in language education.

Jones, M. & Durran, P. (2010) What can a corpus tell us about vocabulary teaching materials? The Routledge handbook of corpus linguistics.

Hopefully, this chapter will help you bridge the gap between corpora as resources and language teaching. Very practical stuff. By the way, the whole Routledge Handbook of corpus linguistics is a superb resource.








Graphic Online Language Diagnostic



The Graphic Online Language Diagnostic (“GOLD”) is a corpus tool that allows language educators to submit and analyze language data. GOLD was developed by the Center for Advanced Language Proficiency Education and Research (“CALPER”) at The Pennsylvania State University (“PSU”), University Park, PA, USA under a grant from the U.S. Department of Education (Title VI, P229A060003 and P229A020010).

Link here:

SACODEYL corpora #corpuslinguistics in The Routledge Handbook of Language Learning and Technology




Corpus types and uses
B Murphy, E Riordan – The Routledge Handbook of Language Learning and …, 2016
… 2008). Another is the SACODEYL corpus, which includes transcribed interviews with
British, German, French, Italian Spanish, Lithuanian and Romanian adolescents
between 13 and 18 years of age (Hoffstaedter and Kohn 2009). …

The Routledge Handbook of Language Learning and Technology
F Farr, L Murray – 2016
child Oslo Multilingual Corpus Open Parallel Corpus personal computer personal learning …

Spoken language corpora and pedagogical applications
A Caines, M McCarthy, A O’Keeffe – The Routledge Handbook of Language Learning …, 2016
… Focusing on an innovative tool developed to make corpus use easier to access for language
teaching, Farr (2010) details the potential of the SACODEYL (System Aided Compilation and
Open Distribution of European Youth Language, a European Commission–funded project …

Written language corpora and pedagogical applications
A Chambers – The Routledge Handbook of Language Learning and …, 2016
… 241–245), based on Mur Dueñas (2009), while the other focuses on intermediate learners of
EAP (pp. 260–263), based on Boulton (2010). Notes 1 http://www. um. es/sacodeyl (accessed
27 June 2014). 2 http://www. um. es/backbone (accessed 27 June 2014). 3 http://www. …

Non-obvious meaning in CL and CADS #cl2015


Plenary session: Alan Partington
Non-obvious meaning in CL and CADS: from ‘hindsight post-dictability’ to sweet serendipity

Chair: Amanda Potts

Introspection & intuition

Processes of inference from the linguistic trace left by speakers/writers

Shared meaning

Idiom principle

Complexity of common grammatical items

Colligation: every word primed to occur in or avoid certain grammatical positions and functions (Hoey, 2005: 13)

SiBol (Siena-Bologna) corpus of newspapers, judicial inquiries, press briefings. Link.

Rapid language change

Corpus methodology is useful in detecting absence, not only presence

Language looks rather different when you look at a lot of it at once (Sinclair 1991)

Qualitative: anaphoric, historic, past behaviour

Quantitative anaphoric and cataphoric; enough data with which to infer

If primed >> psychologically fixed >> reproduced

Evaluation as prototypicality: inner circle obvious, outer circle non-obvious

Prosody can depend on grammar (Louw 1993), pov, literal vs figurative use and on field of register

Embedding is an important factor to interpret prosody

The added value of CL in discourse studies

Looking at language at different levels of abstraction: overview & close reading

Data are not sacred

Much of textual meaning is accretional

Positive cherry-picking: find counter examples

Almost all explanation in DA is informed speculation: in human science this is the closest you get to explanation

Moral panics have evolved over the years (globesity in 2015)