Phil Durrant’s talk available on Youtube

Check out Dr Durrant’s talk “Researching writing development with a corpus” on our research group Youtube Channel

More info on the talk here.

More info on Corpus linguistics and applied linguistics research 2021 site.

From Data to Knowledge – Digital literacy at the service of corpora

  By Paulo Martins. University of Mihno, Braga, 11/11/2021

Learning a programming language

Coding literacy

Learning a programming language is easier than learning a natural language (?), explore new scientific strategies, automate daily tasks, boost problem solving skills.

NLP and data science

Data: raw, unstructured vs information: structured, organized…useful.

Some tools

Webcrawlers: fetching comments is challenging (javascript and stuff)

Json files Json syntax

Yago ontology

YAGO is a knowledge base, i.e., a database with knowledge about the real world. YAGO contains both entities (such as movies, people, cities, countries, etc.) and relations between these entities (who played in which movie, which city is located in which country, etc.). All in all, YAGO contains more than 50 million entities and 2 billion facts.

YAGO arranges its entities into classes: Elvis Presley belongs to the class of people, Paris belongs to the class of cities, and so on. These classes are arranged in a taxonomy: The class of cities is a subclass of the class of populated places, this class is a subclass of geographical locations, etc.

YAGO also defines which relations can hold between which entities: birthPlace, e.g., is a relation that can hold between a person and a place. The definition of these relations, together with the taxonomy is called the ontology.


Recent DDL research & events: 5 tips

Really exciting times for DDL and corpus linguistics and education researchers. There’s some interesting new stuff that has just been published, including some interesting conference videos. Here’s my selection.

(1) Boulton, A., & Vyatkina, N. (2021). Thirty years of data-driven learning: Taking stock and charting new directions over timeLanguage Learning & Technology25(3), 66-89.


The tools and techniques of corpus linguistics have many uses in language pedagogy, most directly with language teachers and learners searching and using corpora themselves. This is often associated with work by Tim Johns who used the term Data-Driven Learning (DDL) back in 1990. This paper examines the growing body of empirical research in DDL over three decades (1989-2019), with rigorous trawls
uncovering 489 separate publications, including 117 in internationally ranked journals, all divided into five time periods. Following a brief overview of previous syntheses, the study introduces our collection, outlining the coding procedures and conversion into a corpus of over 2.5 million words. The main part of the analysis focuses on the concluding sections of the papers to see what recommendations and future avenues of research are proposed in each time period. We use manual coding and semi-automated corpus keyword analysis to explore whether those points are in fact addressed in later publications as an indication of the evolution of the field

(2) Dr Peter Crosthwaite, The University of Queensland: Is Data Driven Learning dead? In this talk Dr Crosthwaite ****

Language is never, ever, ever random

“Language is never, ever, ever random” (Kilgarriff, 2005), not in its usage, not in its acquisition, and not in its processing. (Nick C. Ellis, 2017, p. 41)

Nick C. Ellis (2017). Cognition, Corpora, and Computing: Triangulating Research in Usage-Based Language Learning. Language Learning 67(S1), pp. 40–65

Corpus of North American Spoken English (CoNASE)

The Corpus of North American Spoken English (CoNASE), a 1.25-billion-word corpus of geolocated automatic speech-to-text transcripts, is now available in a beta version.

URL for more information.

The corpus was created from 301,847 ASR transcripts from 2,572 YouTube channels, corresponding to 154,041 hours of video. The size of the corpus is 1,252,066,371 word tokens.

The channels sampled in the corpus are associated with local government entities such as town, city, or county boards and councils, school or utility districts, regional authorities such as provincial or territorial governments, or other governmental organizations.

The transcripts are primarily of recordings of public meetings, although other genres are also present. Video transcripts have been assigned exact latitude-longitude coordinates using a geocoding script.

This information was distributed through the Corpora-List by Steven Coats, University of Oulu, Finland

To cite the corpus, please use

Coats, Steven. 2021. Corpus of North American Spoken English (CoNASE).