Here you can find some useful resources to carry out your transcription project.

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates.

Brian MacWhinney (2019) Tools for Analyzing Talk. Part 1: The CHAT Transcription Format. URL:

Leech (2004): types of annotation

phonetic annotation e.g. adding information about how a word in a spoken corpus was pronounced.

prosodic annotation — again in a spoken corpus — adding information about prosodic features such as stress, intonation and pauses.

syntactic annotation —e.g. adding information about how a given sentence is parsed, in terms of syntactic analysis into such units such phrases and clauses

semantic annotation e.g. adding information about the semantic category of words — the noun cricket as a term for a sport and as a term for an insect belong to different semantic categories, although there is no difference in spelling or pronunciation.

pragmatic annotation e.g. adding information about the kinds of speech act (or dialogue act) that occur in a spoken dialogue — thus the utterance okay on different occasions may be an acknowledgement, a request for feedback, an acceptance, or a pragmatic marker initiating a new phase of discussion.
discourse annotation e.g. adding information about anaphoric links in a text, for example connecting the pronoun them and its antecedent the horses in: I’ll saddle the horses and bring them round. [an example from the Brown corpus]

stylistic annotation e.g. adding information about speech and thought presentation (direct speech, indirect speech, free indirect thought, etc.)
lexical annotation adding the identity of the lemma of each word form in a text — i.e. the base form of the word, such as would occur as its headword in a dictionary (e.g. lying has the lemma LIE).

Online services:


Backbone Transcriptor. URL



Metadata for corpus work:

Annotation on Sketch Engine:

TEI by example website:

Acquiring text varities

One of the most important goals of formal schooling is teaching text varieties that might not be acquired outside of school […] Early in school, children learn to read books of many different types, including fictional stories, historical accounts of past events, and descriptions of natural phenomena. These varieties rely on different linguistic structures and patterns, and students must learn how to recognize and interpret those differences. At the same time, students must learn how to produce some of these different varieties, for example writing a narrative essay on what they did during summer vacation versus a persuasive essay on whether the school cafeteria should sell candy. The amount of explicit instruction in different text varieties varies across teachers, schools, and countries, but even at a young age, students must somehow learn to control and interpret the language of different varieties, or they will not succeed at school.

Biber & Conrad (2009:3)

Biber, D., & Conrad, S. (2009). Register, genre, and style (Cambridge Textbooks in Linguistics).

Improving Writing Through Corpora

Online Data-Driven Learning SPOC “Improving Writing Through Corpora” is now live at the following address:

Improvements in Version 2 include:

A) All course images and functionality have been updated for the ‘new’ Sketch Engine interface.

B) New functions specific to the ‘new’ Sketch Engine interface are now included in the course (e.g. Good Dictionary EXamples (GDEX))

C) Course is now completely self-contained – no need for external assessments.  Certificates of completion generated automatically upon completion of online activities.  

D) Improved reflective component and opportunities for peer discussion.

The course is primarily pitched at L2 graduate writing students, but anyone is eligible, whether a student, lecturer, or anyone with an interest in language and technology. 

To enrol, follow the instructions at the link provided.  Please contact the course creator Dr. Peter Crosthwaite at with any questions or technical problems.

Sinclair (2004) vs the theoreticians

Those who during the last decade tried to barricade the profession against the influence of corpora recycled the critical arguments of the theoreticians thirty years before, and we heard again that no corpus can be a totally accurate sample of a language, that occurrence in a corpus is no guarantee of correctness, that frequency is not a sound guide to importance, that there are inexplicable gaps in the coverage of any corpus, however large, etc.

That flurry of resistance is now largely behind us, and it is timely to consider the issue posed as the title of this book, how to use corpora in language teaching, since corpora are now part of the resources that more and more teachers expect to have access to.

Sinclair (2004: 2)

Sinclair, J. (2004). How to use corpora in language teaching. Amsterdam: John Benjamins.

Corpus linguistics and instructional needs

Tyler & Ortega (2018: 317):

Quite simply, corpora are the place to look for patterns of usage. Moreover, we believe that in usage-inspired instruction L2 targets should be taught not just because they can be taught – that is, because we have a good linguistic description or can create good materials – but because corpus linguistic investigations of learner language development show them to be actual areas of instructional need.

Tyler & Ortega (2018: 318):

The diversity of learning goals just acknowledged is salutary. But it also carries the danger of encouraging a certain bifurcation of usage-inspired L2 instruction into two separate streams, one that privileges implicit and incidental learning (i.e.,absorbing new patterns of language without trying hard to learn them and without knowing they are being learned) and another that revalorizes explicit knowledge, explicit teaching, and explicit learning, thus going against the grain of suspicion over explicitness in much instructed SLA in the past. However, we do not see the explicit-implicit instructional continuum as a zero-sum game. Usage-based views of language development show that the bulk of language learning happens implicitly. But much of the fine-tuning also happens explicitly with the aid of top-down, conscious processing (Ellis, 2011, 2015). It follows that learning proceeds by dynamic interactions between implicit and explicit processing.
Thus, we argue that the full range of goals for learning needs to be addressed in instructional designs. Ideally, usage-inspired L2 instruction can vary so as to offer learners diverse benefits, including more fluent and more contextually effective language use (e.g., through close attention to meaningful input- and practice-driven implicit learning), greater metacognitive self-regulation for greater autonomy and life-long learning (e.g., through induction and deduction of new understandings of language during explicit, concept-guided, top-down learning), and heightened agency in making connections between language choices and social consequences
so the latter can be empowering (e.g., through ethnographic and corpus analyses of one’s and others’ communicative repertoires that make the social consequences and their language reflexes conscious).

Tyler, A. E., Ortega, L., (2018). Usage-inspired L2 instruction. Some reflections and a heuristic. In Tyler, A. E., Ortega, L., Uno, M., & Park, H. I. (Eds.). Usage-inspired L2 instruction: Researched pedagogy. Amsterdam: John Benjamins Publishing Company, 315-321.