5 recent papers on language complexity and learner language

Bulté, B., & Roothooft, H. (2020). Investigating the interrelationship between rated L2 proficiency and linguistic complexity in L2 speechSystem, 102246.


This study investigates the relationship between nine quantitative measures of L2 speech complexity and subjectively rated L2 proficiency by comparing the oral productions of English L2 learners at five IELTS proficiency levels. We carry out ANOVAs with pairwise comparisons to identify differences between proficiency levels, as well as ordinal logistic regression modelling, allowing us to combine multiple complexity dimensions in a single analysis. The results show that for eight out of nine measures, targeting syntactic, lexical and morphological complexity, a significant overall effect of proficiency level was found, with measures of lexical diversity (i.e. Guiraud’s index and HD-D), overall syntactic complexity (mean length of AS-unit), phrasal elaboration (mean length of noun phrase) and morphological richness (morphological complexity index) showing the strongest association with proficiency level. Three complexity measures emerged as significant predictors in our logistic regression model, each targeting different linguistic dimensions: Guiraud’s index, the subordination ratio and the morphological complexity index.


The present study on the relationship between nine complexity measures and five different levels of oral proficiency, as measured by the IELTS speaking test, confirms previous studies which have found that learners at higher levels of proficiency tend to produce more complex language. Even though we found higher complexity scores in higher proficiency levels for measures of lexical, syntactic and morphological complexity, the observed patterns differ substantially across measures. If we only consider differences between adjacent proficiency levels, we observed a significant increase in morphological richness (as measured by the morphological complexity index) between levels 4 and 5, in lexical diversity (Guiraud’s index) between levels 5 and 6, and in overall syntactic (mean length of AS-unit), clausal (mean length of clause) and phrasal complexity (mean length of noun phrase) as well as lexical diversity (Guiraud’s index and HD-D) between levels 6 and 7. We did not observe significant differences in complexity between the highest two proficiency levels in our dataset (i.e. 7 and 8). In addition, we found that the Guiraud index, the subclause ratio and the morphological complexity index applied to verbs were significant predictors for proficiency level in our ordinal logistic regression model, explaining around two thirds of the variance in proficiency level.

Crossley, S. (2020). Linguistic features in writing quality and development: An overview. Journal of Writing Research11(3).


This paper provides an overview of how analyses of linguistic features in writing samples provide a greater understanding of predictions of both text quality and writer development and links between language features within texts. Specifically, this paper provides an overview of how
language features found in text can predict human judgements of writing proficiency and changes in writing levels in both cross-sectional and longitudinal studies. The goal is to provide a better understanding of how language features in text produced by writers may influence writing quality
and growth. The overview will focus on three main linguistic construct (lexical sophistication, syntactic complexity, and text cohesion) and their interactions with quality and growth in general. The paper will also problematize previous research in terms of context, individual differences, and reproducibility.


While there are a number of potential limitations to linguistic analyses of writing, advanced NLP tools and programs have begun to address linguistic complications while better data collection methods and more robust statistical and machine learning approaches can help to control for confounding variables such as first language
differences, prompt effects, and variation at the individual level. This means that we are slowly gaining a better understanding of interactions between linguistic production and text quality and writing development across multiple types of writers, tasks, prompts, and disciplines. Newer studies are beginning to also look at interaction between linguistic features in text (product measures) and writing process characteristics such as
fluency (bursts), revisions (deletions and insertions) or source use (Leijten & Van Waes, 2013; Ranalli, Feng, Sinharry, & Chukharev-Hudilainen, 2018; Sinharay, Zhang, & Deane, 2019). Future work on the computational side may address concerns related to the accuracy of NLP tools, the classification of important discourse structures such as claims and arguments, and eventually even predictions of argumentation strength, flow,
and style.
Importantly, we need not wait for the future because linguistic text analyses have immediate applications in automatic essay scoring (AES) and automatic writing evaluation (AWE), both of which are becoming more common and can have profound effects on the teaching and learning of writing skills. Current issues for both AES and AWE involve both model reliability (Attali & Burstein, 2006; Deane, Williams, Weng, &
Trapani, 2013; Perelman, 2014) and construct validity (Condon, 2013; Crusan, 2010; Deane et al., 2013; Elliot et al., 2013, Haswell, 2006; Perelman, 2012), but more principled analyses of linguistic feature, especially those that go beyond words and structures, are helping to alleviate those concern and should only improve over time. That being said, the analysis of linguistic features in writing can help us not only better understand writing quality and development but also improve the teaching and learning of writing skills and strategies.

Díez-Bedmar, M. B., & Pérez-Paredes, P. (2020). Noun phrase complexity in young Spanish EFL learners’ writing: Complementing syntactic complexity indices with corpus-driven analyses. International Journal of Corpus Linguistics25(1), 4-35.


he research reported in this article examines Noun Phrase (NP) syntactic complexity in the writing of Spanish EFL secondary school learners in Grades 7, 8, 11 and 12 in the International Corpus of Crosslinguistic Interlanguage. Two methods were combined: a manual parsing of NPs and an automatic analysis of NP indices using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC). Our results revealed that it is in premodifying slots that syntactic complexity in NPs develops. We argue that two measures, (i) nouns and modifiers (a syntactic complexity index) and (ii) determiner + multiple premodification + head (a NP type obtained as a result of a corpus-driven analysis), can be used as indices of syntactic complexity in young Spanish EFL learner language development. Besides offering a learner-language-driven taxonomy of NP syntactic complexity, the paper underscores the strength of using combined methods in SLA research.


Our research highlights the need for using combined methods of analysis that examine the same data from different perspectives. The use of statistical complexity analysis software (Kyle, 2016) has allowed us to account for every single noun and nominal group in the corpus. The range of indices in Kyle (2016) has allowed us to approach syntactic phenomena from a purely quantitative perspective. As a result, we have found that the use of the “Nouns as modifiers” index yields significant differences between Grades 8 and 12, which confirms our finding that premodification slots are of interest for the study of learner language development. The corpus-driven manual analysis of NPs, in turn, has allowed us to gain an in-depth understanding of the types of complexity patterns used by learners in the different grades. As a result of this approach, our research has produced a learner-generated taxonomy of NP syntactic complexity that can be used in studies that examine learner language in other contexts. By combining these two research methods, we hope to make a case for their integration and to enrich methodological pluralism (McEnery & Hardie, 2012Römer, 2016). Moreover, the findings obtained with the two methods are consistent and thus show promising avenues for collaboration and complementarity.

Two methodological features of this study are worth considering. The fine-grained classification of NP types, which includes every NP type found in the corpus, may have determined the results of the statistical analysis: the more detailed the classification of NP, the more likely it is to obtain a low number of instances in some of the NP types. Another feature to be considered is that the manual parsing conducted did not include every single noun in the corpus. This may be seen as a limitation of this study. Another limitation lies in the use of automatic analysis software and POS tagging that was not written primarily to navigate learner language. The impact of these systems on learner-language analysis has rarely been explored in corpus linguistics, and we believe that these software solutions should be sensitive to the range of disfluencies of learner language. If the small number of errors found in the use of automatic tools in learner language are considered tolerable, the automatic analysis of complexity and frequency indices in learner language can be beneficial. Finally, this study has not offered a Contrastive Interlanguage Analysis (CIA) (Granger, 19962015) as it is beyond the scope of this paper to look at other L1 learners or English as an L1.

Khushik, G. A., & Huhta, A. (2019). Investigating Syntactic Complexity in EFL Learners’ Writing across Common European Framework of Reference Levels A1, A2, and B1. Applied Linguistics.


The study investigates the linguistic basis of Common European Framework of Reference (CEFR) levels in English as a foreign language (EFL) learners’ writing. Specifically, it examines whether CEFR levels can be distinguished with reference to syntactic complexity (SC) and whether the results differ between two groups of EFL learners with different first languages (Sindhi and Finnish). This sheds light on the linguistic comparability of the CEFR levels across L1 groups. Informants were teenagers from Pakistan (N = 868) and Finland (N = 287) who wrote the same argumentative essay that was rated on a CEFR-based scale. The essays were analysed for 28 SC indices with the L2 Syntactic Complexity Analyzer and Coh-Metrix. Most indices were found to distinguish CEFR levels A1, A2, and B1 in both language groups: the clearest separators were the length of production units, subordination, and phrasal density indices. The learner groups differed most in the length measures and phrasal density when their CEFR level was controlled for. However, some indices remained the same, and the A1 level was more similar than A2 and B2 in terms of SC across the two groups.

Vercellotti, M. L. (2019). Finding variation: assessing the development of syntactic complexity in ESL SpeechInternational Journal of Applied Linguistics29(2), 233-247.


This paper examines the development and variation of syntactic complexity in the speech of 66 L2 learners over three academic semesters in an intensive English program. This investigation tracked development using hierarchical linear modeling with three commonly‐used, recommended measures of productive complexity (i.e., length of AS‐unit, clause length, subordination) and three exploratory measures of structural complexity (i.e., syntactic variety, weighted complexity scores, frequency of nonfinite clauses) to capture different aspects of syntactic complexity. All measures showed growth over time, suggesting that learners are not forced to prioritize certain aspects of the construct at the expense of others (i.e., no trade‐off effects) across development. The unexplained significant variation found in these data differed among the measures reinforcing notions of multidimensionality of linguistic complexity.


The results can inform the measurement choices and methodology for future English L2 research. As would be expected with language learning performance, there was substantial variation. L2 researchers likely want to use practical measures that capture the variation between individuals and across development. The variation in different parts of the measure’s models suggest that the measures capture separate aspects of complexity, and some suggestions can be offered. Subordination may serve as a practical, broad measure of complexity in instructed contexts. The easily calculated phrasal complexity revealed variation early in development, as did the weighted structural complexity measure. Moreover, researchers may want to consider using the weighted complexity measure for research investigating individual differences in language performance. One possibility is to create a measure based on standard deviation (e.g., De Clercq & Housen, 2017) of the weighted complexity measure, if the study’s purpose is to measure the variety of structural complexity in the language sample, rather than the growth of the developmentally‐aligned structural complexity. When investigating differences in language learning outcomes, general complexity and the weighted structural complexity may be useful, given the additional variation found in the models. The unexplained significant remaining variation between individuals is fodder for future longitudinal research. For instance, future research might consider how production may be influenced by the frequency and function of constructions in learners’ L1s, motivation (Verspoor & Behrens, 2011), or individual speaking style (Pallotti, 2009). Overall, this paper offers a unique comparison of syntactic complexity, both productive and structural complexity measures, advancing our understanding of this most complex construct of language performance.

Usage based in a nutshell (Ellis 2012)

UB Approaches some references

Usage-based theories of language hold that learners acquire constructions in a similar fashion—from the statistical abstraction of patterns of form-meaning correspondence in their usage experience—and that the acquisition of linguistic constructions can be understood in terms of the cognitive science of concept formation following the general associative principles of the induction of categories from experience of the features of their exemplars. In natural language, the Zipfian-type token-frequency distributions of the occupants of each of these construction islands, their prototypicality and generality of function in these use, roles and the reliability of mappings between these together conspire to make language learnable. Phrasal teddy bears, formulaic phrases with routine functional purposes, play a large part in this experience, and the analysis of their
components gives rise to abstract linguistic structure and creativity.
Is the notion of language acquisition being seeded by formulaic phrases and yet learner language being formula-light having your cake and eating it too?

Ellis, N. (2012). Formulaic Language and Second Language Acquisition: Zipf and the Phrasal Teddy Bear. 32, 17-44.

Some references on Usage-based language learning approaches

Ellis, N. (2017) Chapter 6 – Chunking in Language Usage, Learning and Change: I Don’t Know from Part III – Chunking. Edited by Marianne Hundt, Universität Zürich, Sandra Mollin, Universität Heidelberg, Simone E. Pfenninger, Universität Salzburg. Cambridge University Press, pp 113-147

Ellis, N. (2017). Cognition, Corpora, and Computing: Triangulating Research in Usage‐Based Language Learning. Language Learning, 67(S1), 40-65.

Ellis, Nick C., & Ferreira-Junior, Fernando. (2009). Construction Learning as a Function of Frequency, Frequency Distribution, and Function. Modern Language Journal, 93(3), 370-385.

Tyler, A. (2010). Usage-Based Approaches to Language and Their Applications to Second Language Learning. Annual Review of Applied Linguistics, 30, 270-291.

Tyler, A., & Ortega, L. (2016). Usage-based approaches to language and language learning: An introduction to the special issue. 8(3), 335-345.

Tyler, A. (2018). Nick C. Ellis Ute Römer Matthew Brook O’Donnell: Usage-based approaches to language acquisition and processing: Cognitive and corpus investigations of construction grammar. Cognitive Linguistics, 29(1), 155-161.

Weber, Kirsten Morten H. Christiansen Peter Indefrey Peter Hagoort (2018) Primed From the Start: Syntactic Priming During the First Days of Language Learning. Language Learning. https://doi.org/10.1111/lang.12327

EGP: investigating patterns of learner grammar development AAAL 2018 Chicago


The English Grammar Profile: investigating patterns of learner grammar development

Anne O´Keeffe, Mary Immaculate College, University of Limerick – 

Geraldine Mark, Mary Immaculate College, University of Limerick – 

Pascual Pérez-Paredes, University of Cambridge

Check out our handout here.


The CEFR: http://www.cambridgeenglish.org/exams-and-tests/cefr/

The English Grammar Profile: http://www.englishprofile.org/english-grammar-profile/egp-online

Cambridge Learner Corpus: https://www.sketchengine.co.uk/cambridge-learner-corpus/

Sketch Engine universal POS tags https://www.sketchengine.co.uk/universal-pos-tags/



Ellis, N. C. (2003). ‘Constructions, chunking, and connectionism: The emergence of second language structure’. In C. Doughty & M. H. Long (Eds.), Handbook of Second Language Acquisition (pp. 33–68). Oxford, UK: Blackwell.

Ellis, N. C. (2012). “Formulaic language and second language acquisition: Zipf and the phrasal teddy bear”. Annual Review of Applied Linguistics, 32, 17-44.

Simpson-Vlach, R., & Ellis, N. C. (2010). An Academic Formulas List (AFL). Applied Linguistics, 31, 487–512.

Ellis, N. C., Römer, U. & O’Donnell, M. B. (2016). Usage-based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Language Learning Monograph Series. Wiley-Blackwell.

Larsen-Freeman, D. (2006).  “The emergence of complexity,  fluency, and accuracy in the oral and written production of  five Chinese learners of English”. Applied Linguistics, 27(4), 590–619.

Milton, J., & Meara, P. (1995). “How periods abroad affect vocabulary growth in a foreign language”. ITL Review of Applied Linguistics, (107–08), 17–34.

O’Keeffe, A., & Mark, G. (2017). “The English Grammar Profile of learner competence: Methodology and key findings”. International Journal of Corpus Linguistics, 22(4), 457-489. https://benjamins.com/#catalog/journals/ijcl.14086.oke/fulltext

Römer, U., O’Donnell, M. B., & Ellis, N. C. (2014). “Second language learner knowledge of verb–argument constructions: Effects of language transfer and typology”. The Modern Language Journal, 98(4), 952-975.

Thewissen, J. (2013). “Capturing L2 accuracy developmental patterns: Insights from an error-tagged learner corpus”. The Modern Language Journal, 97(S1), 77–101.

Deadline of the CfP for LCR 2017 extended to 31 Jan 2017 #corpuslinguistics

The deadline of the CfP for LCR 2017 has been extended to Tuesday, 31 January 2017

4th Learner Corpus Research Conference, Bolzano/Bozen, 5-7 October 2017

Call for Papers

Following the successful conferences in Louvain-la-Neuve (Belgium) in 2011, Bergen (Norway) in 2013 and Nijmegen (the Netherlands) in 2015, the 4th Learner Corpus Research Conference will be hosted by the Institute for Specialised Communication and Multilingualism at EURAC Research, Bolzano/Bozen, Italy. The conference, organized under the aegis of the Learner Corpus Association, aims to be a showcase for the latest developments in the field and will feature full paper presentations, work in progress reports, poster presentations, software demos and a book exhibition.

The theme of LCR 2017 is “Widening the Scope of Learner Corpus Research”.

Conference Venue: European Academy Bozen/Bolzano – EURAC Research

Confirmed keynote speakers:

  • Philip Durrant (University of Exeter, United Kingdom)
  • Stefan Th. Gries (University of California, Santa Barbara, U.S.A.)
  • Stefania Spina (Università per Stranieri Perugia, Italy)

The keynote speakers will address the theme of LCR 2017 in their respective lectures on L1 writing development and Learner Corpus Research, quantitative methods in Learner Corpus Research, and Learner Corpus Research and Italian as L2. We welcome papers that address all aspects of Learner Corpus Research, in particular the following ones:

  • Corpora as pedagogical resources
  • Corpus-based transfer studies
  • Data mining and other explorative approaches to learner corpora
  • English as a Lingua Franca
  • Error detection and correction of learner language
  • Extracting language features from learner corpora
  • Innovative annotations in learner corpora
  • Language for academic/specific purposes
  • Learner varieties
  • Learner corpora for less commonly taught languages
  • Learner Corpus Research and the Common European Framework of Reference for Languages (CEFR)
  • Learner Corpus Research and Natural Language Processing
  • Links between Learner Corpus Research and other research methodologies (e.g. experimental methods)
  • Search engines for learner corpora
  • Statistical methods in learner corpus studies
  • Task and learner variables

There will be four different categories of presentation:

  • Full paper (20 minutes + 10 minutes for discussion)
  • Work in Progress (WiP) report (10 minutes + 5 minutes for discussion)
  • Corpus/software demonstration
  • Poster

The Work in Progress reports and posters are intended to present research still at a preliminary stage and on which researchers would like to get feedback.

The language of the conference is English.


Your abstract should be between 600 and 700 words (excluding a list of references). Abstracts should provide the following:

  • clearly articulated research question(s) and its/their relevance;
  • the most important details about research approach, data and methods;
  • the main results and their interpretation.

Abstracts should be submitted through EasyChair (https://easychair.org/conferences/?conf=lcr2017) by Sunday 15 January 2017 by Tuesday 31 January 2017 (new deadline!). Please follow instructions provided on the conference website (http://lcr2017.eurac.edu).

Please note: The Learner Corpus Association will award the best paper and the best poster presentation given by a PhD student. Only LCA members can participate in the competition. Members interested in entering the competition must indicate so when submitting their abstracts.

Abstracts will be reviewed anonymously by the scientific committee. Notification of the outcome of the review process will be sent by 31 March 2017.


LCR2017 – Preconference workshop in honour of Professor Sylviane Granger

“LCR at the interfaces”, 4 October 2017, 15.00 to 18.00

This workshop, organized in honour of Sylviane Granger, will feature a series of invited speakers whose work has greatly contributed to the development of LCR. 

Four key interfaces will be discussed during the workshop:

“The interfaces between LCR and contrastive analysis” (Hilde Hasselgård and Signe Oksefjell Ebeling)

“The interfaces between LCR and SLA” (Nina Vyatkina)

“The interfaces between LCR and lexicography” (tbc)

“The interfaces between LCR and NLP” (tbc)

Join us for this event which promises to be a landmark in the LCR history!


The LCR 2017 organising committee

Andrea Abel (EURAC Research)
María Belén Díez-Bedmar (Universidad de Jaén)
Daniela Gasser (EURAC Research)
Aivars Glaznieks (EURAC Research)
Verena Lyding (EURAC Research)
Lionel Nicolas (EURAC Research)

The LCR 2017 scientific committee

Andrea Abel (EURAC Research)
Katherine Ackerley (Università degil Studi di Padova)
Annelie Ädel (Dalarna University)
Nicolas Ballier (Université Paris Diderot – Paris 7)
María Belén Díez-Bedmar (Universidad de Jaén)
Marcus Callies (Universität Bremen)
Erik Castello (Università degil Studi di Padova)
Francesca Coccetta (Università Ca’Foscari Venezia)
Pieter de Haan (Radboud Universiteit Nijmegen)
Hilde Hasselgård (Universitet i Oslo)
Sandra Deshors (New Mexico State University)
Ana Diaz-Negrillo (Universidad de Granada)
Michael Flor (ETS)
John Flowerdew (City University of Hong Kong)
Lynne Flowerdew (independent researcher)
Fanny Forsberg Lundell (Stockholm University)
Gaëtanelle Gilquin (University of Louvain)
Sandra Götz (Justus Liebig Universität Gießen)
Solveig Granath (Karlstad University)
Sylviane Granger (Universtié catholique de Louvain)
Nicholas Groom (University of Birmingham)
Jirka Hana (Charles University Prague)
Shin’ichiro Ishikawa (Kobe University)
Jarmo Harri Jantunen (University of Jyväskylä)
Scott Jarvis (Ohio University)
Marie Källkvist (Lund University Sweden)
Agnieszka Lenko-Szymanska (University of Warsaw)
Cristóbal Jesús Lozano Pozo (Universidad de Granada)
Anke Lüdeling (Humboldt-Universität Berlin)
Carla Marello (Università degil Studi Torino)
Fanny Meunier (Universtié catholique de Louvain)
Detmar Meurers (Universität Tübingen)
Florence Myles (University of Essex)
Susan Nacey (Hedmark University College)
Lionel Nicolas (EURAC Research)
Michael O’Donnell (Universidad Autónoma de Madrid)
Signe Oksefjell Ebeling (Universitetet i Oslo)
Magali Paquot (Universtié catholique de Louvain/FNRS)
Pascual Pérez-Paredes (University of Cambridge)
Tom Rankin (Vienna University of Economics and Business)
Paul Rayson (UCREL, Lancaster University)
Ute Römer (University of Michigan)
Anna Siyanova-Chanturia (Victoria University of Wellington)
Jennifer Thewissen (Universiteit Antwerpen)
Yukio Tono (Tokyo University of Foreign Studies)
Nina Vyatkina (University of Kansas)
Heike Zinsmeister (Universität Hamburg)