Workshop: Corpus lingüísticos y aprendizaje y enseñanza de lenguas 

Corpus lingüísticos y aprendizaje y enseñanza de lenguas 

Universidad Nacional de San Agustín, Arequipa, Perú

27 Noviembre 2018

Pascual Pérez-Paredes, University of Cambridge

In this session we’ll look at some corpus linguistics methods that can be used to analyse a text or a group of texts automatically and used them for language teaching or research. In a way, corpus linguistics could be seen as a type of content analysis that places great emphasis on the fact that language variation is highly systematic. We´ll look at ways in which frequency and word combination can reveal different patterns of use and meaning at the lexical, syntactical and semantic levels. We will examine how we can make use of corpus linguistics methods to look at a corpus of texts (from different or the same individuals) and single texts and how these compare to what is frequent in similar or identical registers or communicative situations. This way, we can not only find out what is frequent but also what is truly distinctive or central in a given text or group of texts.

Students are encouraged to download and install Antconc on their laptops:



Quote 1

The word corpus is Latin for body (plural corpora). In linguistics a corpus is a collection of texts (a ‘body’ of language) stored in an electronic database. Corpora are usually large bodies of machine-readable text containing thousands or millions of words. A corpus is different from an archive in that often (but not always) the texts have been selected so that they can be said to be representative of a particular language variety or genre, therefore acting as a standard reference. Corpora are often annotated with additional information such as part-of-speech tags or to denote prosodic features associated with speech. Individual texts within a corpus usually receive some form of meta-encoding in a header, giving information about their genre, the author, date and place of publication etc. Types of corpora include specialised, reference, multilingual, parallel, learner, diachronic and monitor. Corpora can be used for both quantitative and qualitative analyses. Although a corpus does not contain new information about language, by using software packages which process data we can obtain a new perspective on the familiar (Hunston 2002: 2–3).

Baker et al. (2006). A glossary of corpus linguistics. Edinburgh: UEP.

Quote 2

Armchair linguistics does not have a good name in some linguistics circles. A caricature of the armchair linguist is something like this. He sits in a deep soft comfortable armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, “Wow, what a neat fact!”, grabs his pencil, and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like. (There isn’t anybody exactly like this, but there are some approximations.)

Charles Fillmore. Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82, 1991),

Quote 3

We as linguists should train ourselves specifically to be open to the evidence of long text. This is quite different from using the computer to be our servant in trying out our ideas; it is making good use of some essential differences between computers and people.

[…] I believe that we have to cultivate a new relationship between the ideas we have and the evidence that is in front of us. We are so used to interpreting very scant evidence that we are not in a good mental state to appreciate the opposite situation. With the new evidence the main difficulty is controlling and organizing it rather than getting it.

Sinclair. Trust the Text. (2004:17)

Quote 4

Register, genre, and style differences are fundamentally important for any student with a primary interest in language. For example, any student majoring in English, or in the study of another language like Japanese or Spanish, must understand the text varieties in that language. If you are training to become a teacher (e.g. for secondary education or for TESL), you will shortly be faced with the task of teaching your own students how to use the words and structures that are appropriate to different spoken and written tasks – different registers and genres. Other students of language are more interested in the study of literature or the creative writing of new literature, issues relating to the style perspective, since the literary effects that distinguish one novel (or poem) from the next are realized as linguistic differences.

Biber & Conrad  (2009:4)

Quote 8

Tony McEnery has outlined the reasons why corpus linguistics was largely ignored in the past possibly because of the influence of Noam Chomsky. Prof. McEnery has placed this debate in a wider context where different stakeholders fight a paradigm war: rationalist introspection versus evidence driven analysis.

Quote 9

“Science is a subject that relies on measurement rather than opinion”, Bill Cox wrote in the book version of Human Universe, the BBC Show. And I think he is right. Complementary research methodologies can only bring about better insights and better-informed debates.

Quote 10

Language production is a dynamic process and speakers enter interactions with a plurality of aims; both L1 and L2 speakers also vary in their mastery of the language. Recording and describing variation among speakers is one of the most valuable contributions of CL to the study of SLA, complementing the evidence from more controlled studies that employ sophisticated techniques to decrease natural heterogeneity and variation of language used in the context of meaningful communication. Findings from corpus-based studies can place this evidence into a broader perspective and thus contribute to strengthening of the theoretical base of SLA.

Gablasova, Dana, Brezina, Vaclav & McEnery, Tony. 2017. Exploring learner language through corpora: comparing and interpreting corpus frequency information. Language Learning 67(S1):130-154. DOI: 10.1111/lang.12226

Hands-on workshop. Corpus analysis: the basics.


3a Run a word list

3b Run a keyword list

3c Use concord plot: explore its usefulness

3d Choose a lexical item: explore clusters

3e Choose a lexical item: explore n-grams

3f Run a collocation analysis

Download the Conservative manifesto 2017 here and the Labour 2017 manifesto here



Brown corpus data download.

Brown corpus text categories and the texts themselves identified.



UAM Corpus Tool

Representative corpora (EN)



Representative corpora (Register perspective)


Corpus of research articles

A list of corpora you can download.

Using NVIVO?

NVIVO node export and beyond