Learner corpus research plenary #cl2015

Learner corpus research: a fast-growing interdisciplinary field

Sylviane Granger



LCR IS an interdisciplinary research

Design: learner and taks variables to control

Not only English language

Method: CIA (Granger, 1996) and computer-aided error analysis

Wider spectrum of linguistic analysis

Interpretation: focus on transfer but this is changing; growing integration of SLA theory

Applications: few up-and-running resources but great potential

Version 3 (2016 or 2017) around 30 L1s as opposed to 11 L1s in Version 1

Learner corpora is a powerful heuristic resource

Corpus techniques make it possible to uncover new dimensions of learner language and lead to the formulation of new research questions: the L2 phrasicon (word combinations).

Prof. Granger brings up Leech’s preface to Learner English on Computer (1998)

Gradual change from mute corpora to sound aligned corpora

POS tagging has improved so much

Error-tagging: wide range of error tagging systems: multi-layer annotation systems

Parsing of learner data (90% accuracy Geertzen et al. 2014)

Static learner corpora vs monito corpora

CMC learner corpus (Marchand 2015)

Granger (2009) paper on the learner research field:

CIA V2 Granger (2015): a new model

SLA researchers are more interested in corpus data and corpus linguists are more familiar with SLA grounding

Implications are much more numerous than applications

Links with NLP: spell and gramar checking, learner feedback, native language id, etc.

Multiple perspectives on the same resource: richer insights and more powerful tools


Louvain English for Academic Purposes Dictionary (LEAD)


corpus based

descriptions of cross-disciplinary academic vocabulary

1200 lexical times around 18 functions (contrast, illustrate, quote, refer, etc.)

A really exciting application









Free ngram databases from COW14 web corpora

From the corpora list


We are pleased to announce the release of the first very large ngram databases derived from the giga-token COW14 web corpora. They are completely free (CC-BY) and can be downloaded without registration. We have applied no frequency thresholds whatsoever. In addition to the counted ngram lists, we offer raw versions such that everybody can create their own version. The raw ngrams also contain additional information (crawl year, top-level domain, country geolocation).

There are also English dependency bigrams (based on Malt parses) containing words, their heads, and the dependency relation between them.

For end-users, there are also word and lemma frequency lists with some convenient frequency measures, optionally with a frequency threshold of 10 (smaller files, easier handling).



License Creative Commons Attribution 4.0 International
References http://corporafromtheweb.org/category/cow-citation/

Please tell us whenever you publish work based on COW:




The ngrams are derived from the COW14AX sentence-shuffled corpora.

Information http://corporafromtheweb.org/category/corpora/
Interface https://webcorpora.org/

English 9,578,828,861 tokens (International)
German 11,660,894,000 tokens (AT, CH, DE)
Spanish 3,680,794,644 tokens (International)
Swedish 4,842,753,707 tokens (FI, SV)


Languages English, German, Spanish, Swedish
Versions Lemma, Lemma + POS, Word, Word + POS
Thresholds no threshold; raw frequency > 9
Measures raw frequency, absolute rank, frequency per million,
log-frequency per million, frequency band


N 1 .. 5
Languages English, German, Spanish, Swedish
Versions Raw, Word, Word + POS, Lemma (except Swedish)


Languages English (German soon, maybe Swedish)
Versions Raw, Word, Word + POS, Lemma, Lemma + POS

Where’s austerity in everyday speech?

According to Prof. McEnery, people in conversations avoid using words such as “austerity”, only once in a 5 million corpus. Listen to the interview below.

Surprising? Don’t think so. “Expected” words do not crop up when data is examined. We have found that the UK legislation 2007-2011 on immigration does not include the lemma “immigrant” .  Unexpected?

New book: Corpus Linguistics for ELT



Corpus Linguistics for ELT: Research and Practice
Ivor Timmis

From the introduction:

The challenge of fostering a fruitful relationship between corpus linguistics and ELT was clearly set out by Conrad (2000: 556):
Corpus grammarians must strive to reach more audiences that include
teachers and must emphasize concrete pedagogical applications … In fact,
the strongest force for change could be a new generation of ESL teachers who
were introduced to corpus-based research in their training programs [and]
have practiced conducting their own corpus investigations and designing materials based on corpus research.
Indeed, this comment by Conrad encapsulates the main aim of this book: to help move corpus linguistics from what Römer (2012) terms its ‘minority sport’ status in language teaching to a point where the ability to carry out and interpret corpus research is seen as a normal part of an English language teacher’s repertoire.
Familiarity with corpus research and practice should be a standard part of an English language teacher’s toolkit, I would argue, because most people in ELT will at some time have had thoughts like these:
• How many words do my learners need to learn?
• Why is everyone talking about lexical chunks and collocations?
• Do my students really need this grammar point?
• Which words should I use to exemplify this structure?
• Am I teaching my learners language they will need to use when they speak the language?
• Does the grammar explanation in the coursebook really reflect how we use this structure?
• What vocabulary do my English for dentistry students need to get their teeth into?

If you have had questions like these, this book is designed to help you to answer them by consulting corpora and corpus-informed literature. It is also designed to help you to generate and investigate similar questions. It is, however, important to keep corpora in perspective throughout this book.

The argument presented here is that corpora are a resource and a reference source and, as is the case with all resources, pedagogic judgement is vitally important in determining how and when
they are deployed to best effect.
The book does not assume prior knowledge or experience of corpus research; nor does it assume any technical expertise. Technophobes can relax: contemporary corpus interfaces and corpus software are user-friendly and often include tutorial packages. The tasks in this book will help to familiarise readers with publicly available user-friendly corpora such as the British National Corpus hosted at

Ideology in corporate language

Ruth Breeze

Ideology in corporate language: discourse analysis using Wmatrix3

2013 Annual Reports from leading companies (16)  in financial services, mining, food and pharmaceutical

Parts: first part, non technical, discursive, visually interesting

Reference corpus: 1st BNC Sampler Business & BNC Informative texts but then only BNC Business

Use of semantic categories

Three case studies: size (big), time (begin) and casuse and effect

Size: Focus on growth, large, expanding, substantial. Not only adjectives are interesting here.


Ideology of cause and effect

Dynamic approach to time

Emphasis on size and importance

Salient semantic areas: investigation, tough, strong, attentive, jelp & give, in power, belonging to a group

Differences: only in domain/topic-focus, probably different stresses on newness and green economy