A selection of “new” corpora

This is a selection of corpora I´ve just discovered recently and which I find of interest. The resources are, therefore, not necessarily new. The idea is to gather here corpora other than more traditional resources such as the BNC, COCA or ICLE, to name a few.

Learner language

The CLC FCE Dataset is a set of 1,244 exam scripts written by candidates sitting the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001. www 

The EF Cambridge open database www 

The EF-Cambridge Open Language Database (EFCAMDAT) is a publicly available resource to facilitate second language research and teaching. It contains written samples from thousands of adult learners of English as a second language, world wide.

EFCAMDAT currently contains over 83 million words from 1 million assignments written by 174,000 learners, across a wide range of levels (CEFR stages A1-C2). This text corpus includes information on learner errors, part of speech, and grammatical relationships. Researchers can search for language patterns using a range of criteria, including learner nationality and level.


American English

-The Santa Barbara Corpus of Spoken American English

URL: http://www.linguistics.ucsb.edu/research/santa-barbara-corpus#access

The Santa Barbara Corpus of Spoken American English is based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more.

The Santa Barbara Corpus of Spoken American English also forms part of the International Corpus of English (ICE). The Santa Barbara Corpus provides the main source of data for the spontaneous spoken portions of the American component of the International Corpus of English. In order to meet the specific design specifications of the International Corpus of English (allowing comparison between American and other national varieties of English), the Santa Barbara Corpus data have been supplemented by additional materials in certain genres (e.g. read speech), filling out the American component of ICE.


Variety not specified

-GUM corpus

Small but very richly annotated corpus called GUM, the Georgetown University Multilayer corpus, annotated by students at Georgetown University (https://corpling.uis.georgetown.edu/gum/).

Data in the corpus comes from four sources, available under a Creative Commons license:

– Wikimedia interviews
– Wikinews news articles
– Wikivoyage travel guides
– wikiHow how-to guides

Contact person: Dr. Amir Zeldes