Enhancing and extending corpora and corpora tools for learning and teaching



Valoriser et développer les outils autour des corpus dans une perspective didactique / Enhancing and extending corpora and corpora tools for learning and teaching

Mardi/Tuesday, mai/May 27th

Salle/Room 205 Site Rabelais, UJF Valence, France

Link

Programme

9h30 – Speed-dating : Présentations/Presentations

10h – Présentation et discussion autour du livre/presentation and discussion about the book « Des documents authentiques aux corpus. Démarches pour l’apprentissage des langues ». Boulton et Tyne (2014). Discussion autour de l’abondance de matières exploitables dans les corpus et la sous-exploitation dans l’enseignement des langues/Including the abondance of exploitable corpora materials and the general lack of their use in language teaching.

Conférencier: Alex Boulton

11h – Présentation de la Plate-forme Chamilo : comment l’utiliser pour les corpus ? Suivi d’une discussion en français/anglais.

Jérémie Grépiloux et Hubert Borderiou (SIMSU)

13h30 – Pedagogical uses of corpora: theories and practices / Utilisations pédagogiques  des corpus : théories et pratiques, 20-minute presentation followed by a group discussion

Conférencier: Pascual Pérez-Paredes

14h30 – Speed-dating : Consultation en ligne des corpus/Consulting on-line corpora: Montrer et voir des corpus en salle informatique

16h – Bilan de la journée et projets/Summary of the day and projects

Cristelle Cavalla and Laura Hartwell

Inscriptions (Gratuit et obligatoire)/Mandatoary free registration :

https://docs.google.com/forms/d/118xpaiTACRMW5KA5ja92oEGJqZ5Q6BUmqfVmSPq41U0/viewform

Logistics: Sylvain Perraud, Sylvain.Perraud@gmail.com (Compte rendu/minutes)

Contacts: Cristelle.Cavalla@univ-paris3.fr, Laura.Hartwell@ujf-grenoble.fr

References

SACODEYL : http://www.um.es/sacodeyl/

Chamilo : http://www.chamilo.org/fr

Scientext : http ://scientext.msh-alpes.fr/scientext-site-en/spip.php?article9

EmoBase/EmoProf : http://emolex.u-grenoble3.fr/emoBase/

Full-text data for the two largest BYU corpora

I have received this through the CORPORA List:
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

At http://corpus.byu.edu/full-text/ you can now download full-text data for the two largest BYU corpora:

Corpus of Contemporary American English (COCA). 440 million words of downloadable text; the largest, most up-to-date, publicly-available corpus of English that is balanced for genre (spoken, fiction, magazine, newspaper, and academic).

The corpus of Global Web-Based English (GloWbE). 1.8 billion words of downloadable text; divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

With this full-text data, you will have the actual corpora on your computer, and you can search the data in any way that you’d like. You can generate your own frequency data, collocates, n-grams, or concordance lines; you can search by word, lemma, and part of speech; and you can carry out complex syntactic and semantic searches offline. You can even modify the lexicon and sources tables to search the corpora in ways that are not possible via the standard web interfaces.

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data, you purchase the rights to any and all of these formats.

Full-text data for the two largest BYU corpora

I have received this through the CORPORA List:
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

At http://corpus.byu.edu/full-text/ you can now download full-text data for the two largest BYU corpora:

Corpus of Contemporary American English (COCA). 440 million words of downloadable text; the largest, most up-to-date, publicly-available corpus of English that is balanced for genre (spoken, fiction, magazine, newspaper, and academic).

The corpus of Global Web-Based English (GloWbE). 1.8 billion words of downloadable text; divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

With this full-text data, you will have the actual corpora on your computer, and you can search the data in any way that you’d like. You can generate your own frequency data, collocates, n-grams, or concordance lines; you can search by word, lemma, and part of speech; and you can carry out complex syntactic and semantic searches offline. You can even modify the lexicon and sources tables to search the corpora in ways that are not possible via the standard web interfaces.

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data, you purchase the rights to any and all of these formats.

Reading concordances is not a trivial task

The methodological transfer from the CL research area to the applied ring of language learning and teacher underwent no adaptation, and thus learners were presented with the same tools, corpora and analytical tasks as well-trained and professional linguists.

[…]

Reading concordances is, by no means, a trivial task. Sinclair (1991) recommends a complex procedure which involves five distinct stages. Let us review very briefly what they entail. The first stage is
that of initiation. Learners here will look to the left and to the right of the nodes and determine the dominant pattern. Then, learners are prompted to interpret and hypothesize about what it is that these
words have in common. Thirdly, the consolidation stage, where students are to corroborate their hypothesis by looking more closely at variations of their hypotheses. After this, these findings have to be reported and, finally a new round of observations starts. Although typically reduced in language classrooms, this procedure is common in the possibilities scenario and certainly characterises the so-called bottom-up approach (Mishan, 2004: 223). A recent analysis (Kreyer, 2008) deconstructs the idea of corpus competence in different skills, namely, interpreting corpus data, knowledge about corpus design, knowledge about resources in the Internet, some linguistic background, knowledge about how to use concordances and, finally, some corpus linguistics background. This is a positive effort in the
right direction as the author admits the need to create the conditions for the use of corpora in the language classroom or, in other words, the Kreyer recognizes that pedagogic mediation is necessary if we want to turn the corpus into a learning tool. Notwithstanding, the challenges are significant.

Pérez-Paredes, P. (2010). Corpus Linguistics and Language Education in Perspective: Appropriation and the Possibilities Scenario. In T. Harris & M. Moreno Jaén (Eds.), Corpus Linguistics in Language Teaching (pp. 53-73). Peter Lang.

Reading concordances is not a trivial task

The methodological transfer from the CL research area to the applied ring of language learning and teacher underwent no adaptation, and thus learners were presented with the same tools, corpora and analytical tasks as well-trained and professional linguists.

[…]

Reading concordances is, by no means, a trivial task. Sinclair (1991) recommends a complex procedure which involves five distinct stages. Let us review very briefly what they entail. The first stage is
that of initiation. Learners here will look to the left and to the right of the nodes and determine the dominant pattern. Then, learners are prompted to interpret and hypothesize about what it is that these
words have in common. Thirdly, the consolidation stage, where students are to corroborate their hypothesis by looking more closely at variations of their hypotheses. After this, these findings have to be reported and, finally a new round of observations starts. Although typically reduced in language classrooms, this procedure is common in the possibilities scenario and certainly characterises the so-called bottom-up approach (Mishan, 2004: 223). A recent analysis (Kreyer, 2008) deconstructs the idea of corpus competence in different skills, namely, interpreting corpus data, knowledge about corpus design, knowledge about resources in the Internet, some linguistic background, knowledge about how to use concordances and, finally, some corpus linguistics background. This is a positive effort in the
right direction as the author admits the need to create the conditions for the use of corpora in the language classroom or, in other words, the Kreyer recognizes that pedagogic mediation is necessary if we want to turn the corpus into a learning tool. Notwithstanding, the challenges are significant.

Pérez-Paredes, P. (2010). Corpus Linguistics and Language Education in Perspective: Appropriation and the Possibilities Scenario. In T. Harris & M. Moreno Jaén (Eds.), Corpus Linguistics in Language Teaching (pp. 53-73). Peter Lang.

Extracting n word phrases in large texts

This is a summary of resources posted on [Corpora-List] early 2014

CMU-Cambridge Statistical Language Modeling toolkit

http://mi.eng.cam.ac.uk/~prc14/toolkit.html

Sketch Engine

http://www.sketchengine.co.uk/documentation/wiki/SkE/NGrams

Lawrence Anthony’s AntConc 

http://www.antlab.sci.waseda.ac.jp/software.html

kfNgram

http://www.kwicfinder.com/kfNgram/kfNgramHelp.html

Colibri

Software for the extraction of n-grams as well as patterns that are not consecutive (skipgrams). The software is written in C++ for speed and memory efficiency but comes with a Python binding for usage from Python script. It also has a standalone CLI tool that can do what you want.

https://github.com/proycon/colibri-core

http://proycon.github.io/colibri-core/doc/ f

Maarten van Gompel

GnuPG key: 0x1A31555C  XMPP: proycon@anaproy.nl