Valencia Research Methods Workshop – Using corpora to study online language learning

Unicollaboration Research Methods Workshop
Universitat de València & Universitat Politècnica de València (Spain)
6-7th October 2017
Corpus Analysis and Online Language Learning

Plenary session + Q&A: What electronic corpora can tell us about online foreign language learning

You can download presentation slides here

It is no longer uncommon to see corpora being used  by researchers across different disciplines in order to gain insight into a wide range of areas that, for the most part, examine communication in various forms. Social scientists, in particular, have found in corpus linguistics (CL) a complementary method that enriches other standard approaches in their research fields. In fact, this complementariness  adapts extremely well to research contexts where language is either the vehicle or the outcome of human  activity. This session will present an overview of the research methodology behind corpus linguistics where I will briefly discuss the epistemological perspective that underpins CL practice and then outline the main methods that are currently used by CL researchers in language learning contexts. Special attention will be placed on areas of research where telecollaboration practitioners may wish to use language generated by learners and teachers as a source of data.

However, traditional SLA researchers,  have been reluctant in the past to embrace corpus linguistics research methodology. The nature of the data used by corpora may have been an underlying reason but many other factors may have have contributed to this lack of interest. Myles (2005) contributed enormously to change the perception of corpora in the SLA community.



Quote 1

The word corpus is Latin for body (plural corpora). In linguistics a corpus is a collection of texts (a ‘body’ of language) stored in an electronic database. Corpora are usually large bodies of machine-readable text containing thousands or millions of words. A corpus is different from an archive in that often (but not always) the texts have been selected so that they can be said to be representative of a particular language variety or genre, therefore acting as a standard reference. Corpora are often annotated with additional information such as part-of-speech tags or to denote prosodic features associated with speech. Individual texts within a corpus usually receive some form of meta-encoding in a header, giving information about their genre, the author, date and place of publication etc. Types of corpora include specialised, reference, multilingual, parallel, learner, diachronic and monitor. Corpora can be used for both quantitative and qualitative analyses. Although a corpus does not contain new information about language, by using software packages which process data we can obtain a new perspective on the familiar (Hunston 2002: 2–3).

Baker et al. (2006). A glossary of corpus linguistics. Edinburgh: UEP.

Quote 2

Armchair linguistics does not have a good name in some linguistics circles. A caricature of the armchair linguist is something like this. He sits in a deep soft comfortable armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, “Wow, what a neat fact!”, grabs his pencil, and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like. (There isn’t anybody exactly like this, but there are some approximations.)

Charles Fillmore. Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82, 1991),

Quote 3

We as linguists should train ourselves specifically to be open to the evidence of long text. This is quite different from using the computer to be our servant in trying out our ideas; it is making good use of some essential differences between computers and people.

[…] I believe that we have to cultivate a new relationship between the ideas we have and the evidence that is in front of us. We are so used to interpreting very scant evidence that we are not in a good mental state to appreciate the opposite situation. With the new evidence the main difficulty is controlling and organizing it rather than getting it.

Sinclair. Trust the Text. (2004:17)

Quote 4

Register, genre, and style differences are fundamentally important for any student with a primary interest in language. For example, any student majoring in English, or in the study of another language like Japanese or Spanish, must understand the text varieties in that language. If you are training to become a teacher (e.g. for secondary education or for TESL), you will shortly be faced with the task of teaching your own students how to use the words and structures that are appropriate to different spoken and written tasks – different registers and genres. Other students of language are more interested in the study of literature or the creative writing of new literature, issues relating to the style perspective, since the literary effects that distinguish one novel (or poem) from the next are realized as linguistic differences.

Biber & Conrad  (2009:4)

Quote 5

However, even though language in Applied Linguistics is seen as a linguistic, social, cultural, political, an aesthetic, and an educational local practice, which the researcher is called upon to illuminate, in my experience, the practice itself is often used by young researchers to uncritically illustrate a theory born elsewhere.
In order to be seen as legitimate scholars, young applied linguists are encouraged to read theory, preferably of the French kind, and ‘apply’ it to their data (see McNamara 2015). But rather than being inspired by the practice to ask new and critical questions about the theory, that would ultimately benefit both the practice and the theory, they too often give their data short shrift by merely translating the practice into the nomenclature of an existing theory (see Shuy 2015). This might increase the symbolic value of applied linguistic research, but not necessarily its value for the practice.

Kramsch (2015:456-7)

Myles (2005)  on the future of learner corpora

Not only do we need large datasets in order to be able to generalize our findings, but some of the structures which are crucial for informing current debates are rarely found in learner data.

My concern is that the kind of studies that are being undertaken are too closely dependent on what corpora are at hand, and what software tools are available. (p.388)
The field needs to become much more ambitious in its use of new technologies, and in the kind of cor- pora it collects in order to address its current research agenda. (p. 388)

For this purpose, we need good quality longitudinal oral corpora, in a range of different L1/L2 combinations, for the reasons outlined in Section II. The possibilities offered by the computerized analysis of corpora are considerable, as I hope to have demonstrated. SLA researchers, however, need to make sure that not only the corpora they collect but also the computerized tools they use are adapted to their research agendas, rather than the other way round, i.e., adapting their research questions to the corpora or the tools readily available. Some sophisti- cated tools can be used, and it is high time that the pioneering work of L1 acquisitionists in this area is emulated by L2 researchers. (p.388)

Myles, F. (2005). Interlanguage corpora and second language acquisition research. Second Language Research, 21(4), 373-391.

Quote 6

Several reasons can be given for why elicitation techniques are favoured in SLA research. For instance, Mackey & Gass (2005) provide the following reasons why metalinguistic data may be used in SLA research, as opposed to natural language use data: (i) the particular structure you want to investigate may not occur in natural production: it may be absent or there may not be enough instances, and, conversely, (ii) to answer your research question you may need to know what learners rule out as a possible L2 sentence: (a) presence of a particular structure/ feature in the learners’ natural output does not necessarily indicate that the learn- ers know (i.e. have a mental representation of) the structure, and (b) absence of a particular structure/feature in natural language use data does not necessarily indicate that learners do not know the structure. An additional reason is provided by Granger (2002: 6): it is di cult to control the variables that a ect learner produc- tion in a non-experimental context. Additionally, L2 researchers have been typically trained in (quasi)experimental methods rather than in corpus methods, ex- cept for those studies conducted with source data from CHILDES (see Myles 2007b: 386 for a discussion).  e consequence of all this is that the empirical base of SLA research tends to be relatively narrow, based on the language produced by a very limited number of subjects, which, as pointed out by Granger (2002: 6), raises questions about whether results can be generalised. But the methodological future of SLA looks promising, since some researchers are currently claiming that combining both naturalistic and experimental data is crucial to gain insight into the relation between the two types of data (e.g. Gilquin & Gries 2009).

Lozano, C., & Mendikoetxea, A. (2013). Learner corpora and Second Language Acquisition: The design and collection of CEDEL2. In A. Díaz-Negrillo, N. Ballier & P. Thompson (Eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, pp. 65-100.

Quote 7

All in all, the contribution of learner corpus research so far has been much more substantial in description than interpretation of SLA data, documenting differences between native and non-native English, rather than explaining and ad- dressing the key theoretical issues in SLA research (Granger 2004; Myles 2005). According to Granger (2004: 134–135), this is because learner corpus research has been mainly conducted by corpus linguists, rather than SLA specialists (Hasselgard 1999), and the type of learner language corpus that researchers have been most interested in (intermediate to advanced) was so poorly described in the literature that they felt the need to establish the facts before launching into theoretical gen- eralisations.

As Tono summarises:
Many corpus-based researchers do not know enough about the theoretical background of SLA research to communicate with them [i.e. SLA researchers] effectively, while SLA researchers typically know little about what corpora can do for them. (Tono 2003: 806)

Lozano, C., & Mendikoetxea, A. (2013). Learner corpora and Second Language Acquisition: The design and collection of CEDEL2. In A. Díaz-Negrillo, N. Ballier & P. Thompson (Eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, pp. 65-100.

Quote 8

Tony McEnery has outlined the reasons why corpus linguistics was largely ignored in the past possibly because of the influence of Noam Chomsky. Prof. McEnery has placed this debate in a wider context where different stakeholders fight a paradigm war: rationalist introspection versus evidence driven analysis.

Quote 9

“Science is a subject that relies on measurement rather than opinion”, Bill Cox wrote in the book version of Human Universe, the BBC Show. And I think he is right. Complementary research methodologies can only bring about better insights and better-informed debates.

Quote 10

Language production is a dynamic process and speakers enter interactions with a plurality of aims; both L1 and L2 speakers also vary in their mastery of the language. Recording and describing variation among speakers is one of the most valuable contributions of CL to the study of SLA, complementing the evidence from more controlled studies that employ sophisticated techniques to decrease natural heterogeneity and variation of language used in the context of meaningful communication. Findings from corpus-based studies can place this evidence into a broader perspective and thus contribute to strengthening of the theoretical base of SLA.

Gablasova, Dana, Brezina, Vaclav & McEnery, Tony. 2017. Exploring learner language through corpora: comparing and interpreting corpus frequency information. Language Learning 67(S1):130-154. DOI: 10.1111/lang.12226

Hands-on workshop. Corpus analysis: the basics.

You can find the slides here.