Research methods: corpus linguistics

In this session we’ll look at some corpus linguistics methods that can be used to analyse a text or a group of texts automatically.

In a way, corpus linguistics could be seen as a type of content analysis that places great emphasis on the fact that language variation is highly systematic.

We´ll look at ways in which frequency and word combination can reveal different patterns of use and meaning at the lexical, syntactical and semantic levels. We will examine how we can make use of corpus linguistics methods to look at a corpus of texts (from different or the same individuals) and single texts and how these compare to what is frequent in similar or identical registers or communicative situations. This way, we can not only find out what is frequent but also what is truly distinctive or central in a given text or group of texts.

Students are encouraged to download and install Antconc on their laptops:


File converter tool URL:

CL research methods

There are different well-established CL methods to research language usage through the examination of naturally occurring data. These methods stress the importance of frequency and repetition across texts and corpora to create saliency. These methods can be grouped in four categories:

Analysis of keywords. These are words that are unusually frequent in corpus A when compared with corpus B. This is a Quantitative method that examines the probability to find/not to find a set of words in a given corpus against a reference corpus. This method is said to reduce both researchers´ bias in content analysis and cherry-picking in grounded theory.

Analysis of collocations. Collocations are words found within a given span (-/+ n words to the left and right) of a node word. This analysis is based on statistical tests that examine the probability to find a word within a specific lexical context in a given corpus. There are different collocation strength measures and a variety of approaches to collocation analysis (Gries, 2013). A collocational profile of a word, or a string of words, provides a deeper understanding of the meaning of a word and its contexts of use.

Colligation analysis. This involves the analysis of the syntagmatic patterns where words, and string of words, tend to co-occur with other words (Hoey, 2005). Patterning stresses the relationship between a lexical item and a grammatical context, a syntactic function (i.e. postmodifiers in noun phrases) and its position in the phrase or in the clause. Potentially, every word presents distinctive local colligation analysis. Word Sketches have become a widely used way to examine patterns in corpora.

N-grams. N-gram analysis relies on a bottom-up computational approach where strings of words (although other items such as part of speech tags are perfectly possible) are grouped in clusters of 2,3,4,5 or 6 words and their frequency is examined. Previous research on n-grams shows that different domains (topics, themes) and registers (genres) offer different preferences in terms of the n-grams most frequently used by expert users.

Quote 1: what is a corpus?

The word corpus is Latin for body (plural corpora). In linguistics a corpus is a collection of texts (a ‘body’ of language) stored in an electronic database. Corpora are usually large bodies of machine-readable text containing thousands or millions of words. A corpus is different from an archive in that often (but not always) the texts have been selected so that they can be said to be representative of a particular language variety or genre, therefore acting as a standard reference. Corpora are often annotated with additional information such as part-of-speech tags or to denote prosodic features associated with speech. Individual texts within a corpus usually receive some form of meta-encoding in a header, giving information about their genre, the author, date and place of publication etc. Types of corpora include specialised, reference, multilingual, parallel, learner, diachronic and monitor. Corpora can be used for both quantitative and qualitative analyses. Although a corpus does not contain new information about language, by using software packages which process data we can obtain a new perspective on the familiar (Hunston 2002: 2–3).

Baker et al. (2006). A glossary of corpus linguistics. Edinburgh: UEP.

Quote 2: introspection

Armchair linguistics does not have a good name in some linguistics circles. A caricature of the armchair linguist is something like this. He sits in a deep soft comfortable armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, “Wow, what a neat fact!”, grabs his pencil, and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like. (There isn’t anybody exactly like this, but there are some approximations.)

Charles Fillmore. Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82, 1991),

Quote 3: evidence in a corpus

We as linguists should train ourselves specifically to be open to the evidence of long text. This is quite different from using the computer to be our servant in trying out our ideas; it is making good use of some essential differences between computers and people.

[…] I believe that we have to cultivate a new relationship between the ideas we have and the evidence that is in front of us. We are so used to interpreting very scant evidence that we are not in a good mental state to appreciate the opposite situation. With the new evidence the main difficulty is controlling and organizing it rather than getting it.

Sinclair. Trust the Text. (2004:17)

Quote 4: why analyse registers?

Register, genre, and style differences are fundamentally important for any student with a primary interest in language. For example, any student majoring in English, or in the study of another language like Japanese or Spanish, must understand the text varieties in that language. If you are training to become a teacher (e.g. for secondary education or for TESL), you will shortly be faced with the task of teaching your own students how to use the words and structures that are appropriate to different spoken and written tasks – different registers and genres. Other students of language are more interested in the study of literature or the creative writing of new literature, issues relating to the style perspective, since the literary effects that distinguish one novel (or poem) from the next are realized as linguistic differences.

Biber & Conrad  (2009:4)

Quote 8: sleeping furiously

Tony McEnery has outlined the reasons why corpus linguistics was largely ignored in the past possibly because of the influence of Noam Chomsky. Prof. McEnery has placed this debate in a wider context where different stakeholders fight a paradigm war: rationalist introspection versus evidence driven analysis.

Quote 9: epistemological adherence?

“Science is a subject that relies on measurement rather than opinion”, Bill Cox wrote in the book version of Human Universe, the BBC Show. And I think he is right. Complementary research methodologies can only bring about better insights and better-informed debates.

Hands-on workshop. Corpus analysis: the basics.


3a Run a word list

3b Run a keyword list

3c Use concord plot: explore its usefulness

3d Choose a lexical item: explore clusters

3e Choose a lexical item: explore n-grams

3f Run a collocation analysis

Download the Conservative manifesto 2017 here and the Labour 2017 manifesto here


Policy paper: DFID Education Policy 2018: Get Children Learning (PDF)


Brown corpus data download.

Brown corpus text categories and the texts themselves identified.


UAM Corpus Tool

Representative corpora (EN)



Representative corpora (Register perspective)


Corpus of research articles

A list of corpora you can download.

Using NVIVO?

NVIVO node export and beyond

U. Oxford Keynote: Education and learning research in the age of complexity and fragmentation. An introspection


Oxford-Cambridge PhD students’ exchange seminar.

Department of Education, University of Oxford. June 1, 2018.

Keynote: Education and learning research in the age of complexity and fragmentation: an introspection

On 1 June 2018 I had the privilege to deliver a keynote on the Oxford-Cambridge PhD in Education exchange. I discussed the impact of the ideas of complexity and fragmentation on my own research and how my PhD students understood complexity.

I came up with a 6-point desideratum that was used as the basis for the ensuing discussion:

(1) Research is becoming more interdisciplinary and discipline boundaries tend to disappear.

(2) Collaboration with other researchers is essential. 

(3) Re-examine constantly your ontology and  epistemology. I´m in favour of a dynamic ontology / epistemology. Think critically at your work through the eyes of differing epistemologies (and ontology).

(4) Go deeper into the basic foundations of your discipline. But make sure it´s you and not somebody else guiding that reflection and

(5) Explore the limits of your discipline and themes.

(6) Attention is your best asset. Attention needs to be strategic.


Some references used in my talk

Douglas Fir Group (Atkinson, D.; Byrnes, H.; Doran, M.; Duff, P.; Ellis, Nick C.; Hall, J. K.; Johnson, K.; Lantolf, J.; Larsen-Freeman, D.; Negueruela, E.; Norton, B.; Ortega, L.; Schumann, J.; Swain, M.; Tarone, E.) (2016). A transdisciplinary framework for SLA in a multilingual world. Modern Language Journal, 100, 19-47.

Greene, M. T. (2003). What cannot be said in science. Nature, 388(6643), 619-620. 

Greene. (2007). The demise of the lone author. Nature, Nature, 2007

Larsen-Freeman, D. (2012). Complex, dynamic systems: A new transdisciplinary theme for applied linguistics? Language Teaching, 45(2), 202-214.

Williams, J. (2018). Stand out of our lights. Cambridge: Cambridge University Conference.

You can download my presentation here.


Making the Links: from theory to research design – follow-up qs


Making the Links: from theory to research design and back again

The video is a film of the lecture given by Professor Madeleine Arnot for the M.Phil, M.Ed, Ph.D and Ed.D courses on educational research. It offers students a chance to think about some recent debates about the role of theory in research, and the ways in which a theoretically informed study can be designed. The examples given derive from actual research projects.

Created: 2013-02-13 10:50 by Andrew Borkett

Keynote speaker: Madeleine Arnot

Publisher: University of Cambridge

You & theories

Category A – I have found theories (or a conceptual framework) I like which I am going to use.
Category B – I am worried because I don’t have a theory or conceptual framework, or can’t find one.
Category C – This is not relevant to me. I am a practitioner and want to improve practice not educational theories. I already know what I want to find out
Category D – I think theory- driven projects are biased and restrictive, I want to start with the data.

Concepts & methodology

Positivism, post-positivism, mixed methods
Surveys, data banks, tests, interviews,
Interpretivist methodology
Symbolic Interactionist
Phenomenology/grounded theory
Participatory/action research
Critical interpretivist traditions
Feminist methodologies
Critical policy research
Community studies/family studies
Youth cultural studies


-Have you considered how to “position” yourself? What does “positioning” entail?

-Why is it not enough to describe “the world”?

-What is the link between our RQ and theories? Is it one of those technical issues favoured by existing govt policies?

-What is the role of “grand theories”? Is there a grand theory particularly relevant in your research?

-“Life is messy message”. What do you take from this? What´s wrong with “patterns”?

-Theoretically-driven research vs grounded approach. How does this play out in your research?

-How useful are the models discussed by Prof Arnot for your research ( a>b, triangle, circular, deconstruction models)?

-Thinking conceptually and research designs. How does “concept” impact your research methods?





Some follow-up questions Prof Pauline Rose Keynote


Discussion points (a suggestion)

-Your position as a researcher. Do you have one? Do you need one after all?

-How important is “critique” in your process to become a researcher or practice research?

-Why do you want to carry out research?

-How do you understand “evidence” in your own research project/practice?

-How do Research Questions come to life? What is a real-lefe issue in your case?

-Research “methodology”. Are you familiar with different approaches or are you more into one single specific research methods -data collection – analysis tradition?

-How do you go about deciding on “your” research method(s) in a research project? What needs to be accounted for?

-Were you familiar with the UK Data Archive? What can of research can these data inform?

-Education research: (big) challenges. How can we contribute to the debate? Issues around policy, evidence and (different) agendas.

-Ethics and research. Have you considered the many issues involved?

You can read about Prof Rose here.