Corpus of North American Spoken English (CoNASE)

The Corpus of North American Spoken English (CoNASE), a 1.25-billion-word corpus of geolocated automatic speech-to-text transcripts, is now available in a beta version.

URL for more information.

The corpus was created from 301,847 ASR transcripts from 2,572 YouTube channels, corresponding to 154,041 hours of video. The size of the corpus is 1,252,066,371 word tokens.

The channels sampled in the corpus are associated with local government entities such as town, city, or county boards and councils, school or utility districts, regional authorities such as provincial or territorial governments, or other governmental organizations.

The transcripts are primarily of recordings of public meetings, although other genres are also present. Video transcripts have been assigned exact latitude-longitude coordinates using a geocoding script.

This information was distributed through the Corpora-List by Steven Coats, University of Oulu, Finland

To cite the corpus, please use

Coats, Steven. 2021. Corpus of North American Spoken English (CoNASE).

TAALES 2.2 is out : automatic analysis of lexical sophistication, Windows and Mac

From the TAALES website:

Kyle, K. & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly 49(4), pp. 757-786. doi: 10.1002/tesq.194

TAALES is a tool that measures over 400 classic and new indices of lexical sophistication, and includes indices related to a wide range of sub-constructs. TAALES indices have been used to inform models of second language (L2) speaking proficiency, first language (L1) and L2 writing proficiency, spoken and written lexical proficiency, genre differences, and satirical language.

Starting with version 2.2, TAALES provides comprehensive index diagnostics, including text-level coverage output (i.e., the percent of words/bigrams/trigrams in a text covered by the index) AND individual word/bigram/trigram index coverage information.

TAALES takes plain text files as input (it will process all plain text files in a particular folder) and produces a comma separated values (.csv) spreadsheet that is easily read by any spreadsheet software.


You can find all the info here. Windows and Mac versions available for free.

The Conference on #NLP KONVES new deadline



The Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. It allows researchers from all disciplines relevant to this field of research to present their work. The conference will take place September 19–21, 2016 in Bochum (Germany). We are pleased to announce that John Nerbonne and Barbara Plank will give invited talks at the conference.

Call for Papers

We welcome original, unpublished contributions on research, development, applications and evaluation, covering all areas of natural language processing, ranging from basic questions to practical implementations of natural language resources, components and systems.

The special theme of the 13th KONVENS is: “Processing non-standard data — commonalities and differences”.

A wide range of data can be considered “non-standard” because it deviates in one way or the other from standard written data such as newspaper texts. Examples include:
* data produced by language learners
* historical data
* data from social media
* (transcriptions of) spoken data

We especially encourage the submission of contributions comparing different types of non-standard data and their properties, focussing on their impact for natural language processing. For example, a feature common to many types of non-standard data is the use of non-standard spelling. However, spelling variation in learner data as compared to historical data is due to very different reasons and, most likely, resulting in very different types of non-standard spellings.

Topics that we would like to see addressed include:
* Common properties of (many) non-standard data, e.g. non-standard spelling, data sparseness, features of orality
* Impact of the commonalities and differences of non-standard data on the methods and tools that are applied to the data, e.g. normalization vs. tool adaptation, evaluation without gold standard, etc.

Important Dates
NEW: June 7, 2016  Paper submissions due
NEW: July 18, 2016 Notification of acceptance
August 15, 2016    Camera-ready copy due
September 19–21, 2016  Conference


We welcome two types of contributions:
* Full papers for oral presentation (8 pages plus references)
* Short papers for presentation as posters (4 pages plus references)

Short papers/posters can be combined with a system demonstration. Reviews will be anonymous. Accepted full and short papers will be published in the conference proceedings.

Submissions must conform to the formatting guidelines, and must be made electronically through the conference website (see

The conference languages are English and German. We encourage the submission of contributions in English.

CFP Language and the new (instant) media

2016 PLIN Day, hosted by the Linguistics Research Unit of UCLouvain in Belgium.

After last year’s successful edition on Lexical complexity, this year’s topic is ‘Language and the new (instant) media’. The PLIN Day will take place on 12 May 2016 in Louvain-la-Neuve.

More information and registration (free for all Belgian participants)

The main objective of the workshop is to bring together specialists from a number of different but related fields to discuss the specificities of language in the new media. The workshop will thus offer a view of different approaches to language in the new media. The event will be structured around five keynote presentations and poster sessions. We are happy to welcome the following keynote speakers:
Patricia Bou-Franch (Universitat de València)
Walter Daelemans (Universiteit Antwerpen)
Elisabeth Stark (University of Zurich)
Caroline Tagg (The Open University)
Olga Volckaert-Legrier (Université Toulouse Jean Jaurès)

The poster sessions, which will include time for a short oral presentation of each poster, offer a forum for numerous other research trends. If you’re a PhD student, you’re eligible for the Best Poster Award!

Posters may deal with any of the following linguistic domains:
Discourse analysis
Language norms and contacts
Corpus Linguistics
Natural Language Processing
Language Statistics
We also invite companies which develop research or research-based applications concerning language and new media, to submit a poster proposal.

Important dates:
Deadline for poster proposal submissions: 31 January 2016
Notification of acceptance: 1 March 2016
Submission of Power Point Presentations for the posters boost session: 1 May 2016

We are also happy to inform you that the Annual Linguistic Day of the Linguistic Society of Belgium will also be held at UCL, on 13 MAY 2016, the day after the PLIN day (

Best regards,

Louise-Amélie Cougnon (Girsef – Cental), Barbara De Cock (Valibel – Discours et variation) and Cédrick Fairon (Cental)

Follow @plindayucl on Twitter for the latest news!

Official Website:

Prof. Cédrick Fairon
Centre de traitement automatique du langage (CENTAL)
Place Blaise Pascal, 1, bte L3.03.12 B-1348-Louvain-la-Neuve
Tél. 32 (0)10 47 37 88 – Fax 32 (0)10 47 26 06

Adam Kilgarriff: a selection of papers and talks

Some readings to remember one of the most indisputably influential corpus linguists in the 20 and 21st centuries.

Using corpora for language research

Googleology is bad science

Grammar is to meaning as the law is to good behaviour. Corpus Linguistics and Linguistic Theory 3 (2): 195-198.