Some resources to learn corpus-based discourse analysis

One of my students asked me for some online references to learn more about corpus-based/assisted discourse analysis. Here’s 5 online talks.

Obesity in the News: Combining Corpus and Critical Perspectives. Online talk by Gavin Brookes at Universidad de Murcia.

Corpus linguistics and the discursive construction of migrants. Online talk by Charlotte Taylor at Universidad de Murcia.

CorpusCast with Dr Robbie Love: Professor Paul Baker on social justice.

Corpus-based discourse analysis. Online talk by Tony McEnery. LAEL webinar.

Corpus linguistics and the analysis of language ideology. Online talk by Rachelle Vessey at Universidad de Murcia.

New research on Data-driven language learning March 2023

Allan, R. (2023). Reserved for Research? Normalising Corpus Use for School TeachersNordic Journal of English Studies22(1).

Allan, R., Walker, T., & Langum, V. (2023). Data-driven learning: Tools, approaches, and next steps. Nordic Journal of English Studies22(1), 1-12.

Muftah, M. (2023). Data-driven learning (DDL) activities: do they truly promote EFL students’ writing skills development? Education and Information Technologies, 1-27.

O’Keeffe, A. (2023). A Theoretical Rationale for the Importance of Patterning in Language Acquisition and the Implications for Data-driven Learning. Nordic Journal of English Studies22(1), 16-41.

Şahin Kızıl, A. Data‐driven learning: English as a foreign language writing and complexity, accuracy and fluency measures. Journal of Computer Assisted Learning.

Exploring Part of Speech (POS)-tag sequences in a large-scale learner corpus of L2 English: A developmental perspective


This research explores the POS-tag sequences that shape the transition from upper intermediate (B2 CEFR) to near-native proficiency (C2 CEFR) in a corpus of essays (n=32,410) from the Cambridge Learner Corpus. Gilquin (2018) and others have shown that POS tag sequences offer a holistic approach to extracting the most commonly used patterns without a starting point of an a prioriset of words and word sequences. Using corpus linguistics informed by usage-based theories of language learning, this paper examines the frequency and distribution of 4-slot POS-tag sequences in L2 English writing, drawing on the taxonomy of pattern grammar (Francis et al. 1996, 1998; Hunston & Francis, 2000). Findings point to the presence of both core and emergent POS-tag sequences in learner language in the two proficiency levels analysed. These sequences point to the presence of dynamic language restructuring processes as learners become more proficient and re-evaluate their understanding of frequency and distribution in English. This paper shows evidence of how language competence increases with proficiency. The research offers new evidence to our understanding of the development of L2 writing in EFL contexts.

This is a preprint of

Lim, J., Mark, G., Pérez-Paredes, P. & O’Keeffe, A. (2024). Exploring Part of Speech (POS)-tag sequences in a large-scale learner corpus of L2 English: A developmental perspective. Corpora, 19(1).

5 recent books for language teachers interested in corpus linguistics, DDL & language education

Crosthwaite, P. (Ed.). (2019). Data-driven learning for the next generation: Corpora and DDL for pre-tertiary learners. Routledge. (URL)

Jablonkai, R. R., & Csomay, E. (Eds.). (2022). The Routledge Handbook of Corpora and English Language Teaching and Learning. Routledge.. (URL)

Pérez-Paredes, P. (2020). Corpus Linguistics for Education. A Guide for Research. Routledge. (URL)

Timmis, I. (2015). Corpus linguistics for ELT: Research and practice. Routledge. (URL)

Viana, V. (Ed.). (2022). Teaching English with Corpora: A Resource Book. Routledge. (URL)

Excel text functions

  • UPPER(cell_reference)
  • LOWER(cell_reference)
  • PROPER(cell_reference)
  • TRIM(text) 
  • EXACT(cell_reference1, cell_reference2) The result is True for an exact match or False for no match.
  • FIND(find, within, start_number) where the first two arguments are required. The start_number argument is optional and allows you to specify with which character position to start the search.
  • REPLACE(current_text, start_number, number_characters, new_text) where each argument is required. Let’s look at the details for the arguments. Current_text: The cell reference(s) for the current text; Start_number: The first character’s numeric position in the current text; Number_characters: The number of characters you want to replace; New_text: The new text to replace the current text.
  • SUBSTITUTE function to change the actual text rather than using a character’s position (cell_reference, current_text, new_text, instances) where all arguments are required except for instances. You can use instances to specify which occurrence in the text string to change.
  • Source: