corpus linguistics Archives - Page 28 of 47

#CFP The 10th Web as Corpus Workshop (WAC-10) Submission deadline: 24 April 2015

The 10th Web as Corpus Workshop (WAC-10)

First Call for Papers

10 August 2015, Herstmonceux Castle, UK

Submission deadline: 24 April 2015

Endorsed by the Special Interest Group of the ACL on Web as Corpus

The web has become increasingly popular as a source of linguistic data, not only within the NLP community, but also with lexicographers and linguists. Accordingly, web corpora continue to gain importance, given their size and diversity in terms of genres/text types. However, a number
of issues in web corpus construction still need much research, ranging from questions of corpus design to more-technical aspects of efficient construction of large corpora. Similarly, the systematic evaluation of web corpora, for example in the form of task-based comparisons to traditional
corpora, has only lately shifted into focus.

For a decade now, the ACL SIGWAC, and especially the highly successful Web as Corpus (WAC) workshops, have served as a platform for researchers interested in building and working with web-derived corpora. Past workshops have been co-located with EACL, NAACL, LREC, WWW, and Corpus Linguistics.

This year we are excited to be collocated with Electronic lexicography in the 21st century: linking lexical data in the digital age (eLex 2015). This will be the first time that WAC has co-located with a lexicography conference.

FIRST CALL FOR PAPERS

As in previous years, the 10th Web as Corpus workshop (WAC-10) invites original contributions pertaining to all aspects of web corpora, including data collection, cleaning, duplicate removal, document filtering, linguistic post-processing and annotation, and the use of web corpora in
language technology and linguistics. Because of its co-location with a lexicography conference, WAC-10 particularly encourages submissions related to the use of web corpora in lexicography.

A major challenge in the construction of web corpora is the question of the quality and the evaluation of both the software used in the construction of web corpora as well as the corpora themselves. WAC10 encourages submissions related to these issues.

SUBMISSION FORMAT

All submissions should follow the ACL-IJCNLP 2015 style guidelines and must be in PDF format.

Full paper submissions may consist of up to eight (8) pages of content plus any number of pages consisting of only references. Short papers may consist of up to four (4) pages of content plus any number of pages consisting of only references. Full papers will be distinguished from short papers in the proceedings.

Papers will be presented either orally or as posters at the workshop. There will be no distinction between papers presented orally and those presented as posters in the proceedings.

Reviewing of papers will be double-blind. Therefore, the paper must not include the authors’ names and affiliations. Furthermore, self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, must be avoided. Instead, use citations such as “Smith (1991)
previously showed …”. Papers not conforming to these requirements will be rejected without review.

We strongly recommend the use of the ACL-IJCNLP 2015 LaTeX style files or Microsoft Word Style files. The style files and example documents will be available from the workshop website. We reserve the right to reject submissions that do not conform to these styles including font and page
size restrictions.

ORGANIZING COMMITTEE

Paul Cook, University of New Brunswick (paul.cook@unb.ca)
Roland Schäfer, Freie Universität Berlin (roland.schaefer@fu-berlin.de)
Egon Stemle, EURAC (egon.stemle@eurac.edu)

PROGRAMM COMMITTEE (Confirmed so far)

Andrea Abel, European Academy Bolzano / Bozen
Felix Bildhauer, Freie Universität Berlin
Jesse Egbert, Brigham Young University
Stefan Evert, Friedrich-Alexander-Universität Erlangen-Nürnberg
Simon Krek, Jožef Stefan Institute
Lothar Lemnitzer, Berlin-Brandenburgische Akademie der Wissenschaften
Robert Lew, Adam Mickiewicz University in Poznań
Nikola Ljubešić, University of Zagreb
Carolin Müller-Spitzer, Institut für Deutsche Sprache
Siva Reddy, University of Edinburgh
Steffen Remus, TU Darmstadt
Pavel Rychly, Masaryk University
Serge Sharoff, University of Leeds
Yukio Tono, Tokyo University of Foreign Studies
Andreas Witt, Institut für Deutsche Sprache
Torsten Zesch, University of Duisburg-Essen

IMPORTANT DATES

24 April 2015: Paper submission deadline (23:59 GMT-12)
29 May 2015: Notification
19 June 2015: Camera-ready deadline
10 August 2015: WAC-10 Workshop

Free access to @ReCALL most popular articles until 31 March

Including

Researching uses of corpora for language teaching and learning Editorial Researching uses of corpora for language teaching and learning

Alex Boulton and Pascual Pérez-Paredes

ReCALL / Volume 26 / Special Issue 02 / May 2014, pp 121 – 127

CILC 2015 7th International Conference on Corpus Linguistic Book of abstracts available @languagecorpora

English ENCOW14 web corpus now available first release version #ENCOW14A #corpuslinguistics

Through the corpora list
::::::::::::::::::::::::::::::::::

The English ENCOW14 web corpus is now available in its first release version ENCOW14A (16.8 GT full corpus, 9.6 GT shuffled). The shuffle version is completely free but available only to people working in the academia.

At the same time, we make available our new Colibri² web application hosted at webcorpora.org. It allows registered users to query the corpora or download the whole data sets. Colibri² also serves DECOW12AX
(German, 8.3 GT), NLCOW14AX (Dutch, 4.7 GT), SVCOW14AX (Swedish, 4.8 GT).

ENCOW14A was crawled in 2012 and 2014 in over 20 top-level domains, has undergone state-of-the-art deduplication, boilerplate removal, hyphenation repair and repair for run-together sentences (texrex). It is
annotated with POS (Penn/TreeTagger), lemma (TreeTagger), chunks (TreeTagger), as well as dependency relations (MaltParser, experimental). It contains the following meta data: URL, Last-Modified date, crawl date, country and city geolocation, and document quality score as well as paragraph boilerplate scores.

Download & web access via Colibri² (free registration required):
https://webcorpora.org/

Corpus information:
http://corporafromtheweb.org/encow14/

COW is created at Freie Universität Berlin, German Grammar Group:
http://hpsg.fu-berlin.de/

All processing specific to web documents was done with texrex:
http://texrex.sourceforge.net/

ENCOW14 includes GeoLite data created by MaxMind, available from:
http://www.maxmind.com.

:::::::::::

Roland Schäfer (ENCOW14/COW), Felix Bildhauer (COW)

Most Relevant NLP Journals via NLPeople

This is a question and follow-up initiated by Eduardo César Garrido Merchán in the Linkedin NLPeople group.

Meeting of the Association for Computational Linguistics (ACL)
Transactions of the Association for Computational Linguistics (ISSN: 2307-387X)
European Chapter of the ACL (EACL)
North American Chapter of the Association for Computational Linguistics
International Conference on Computational Linguistics (COLING)
Conference on Empirical Methods in Natural Language Processing (EMNLP)
Data & Knowledge Engineering.
IEEE Transactions on Knowledge and Data Engineering
Computational Linguistics
International Conference on Computational Linguistics and Intelligent Text Processing
Text REtrieval Conference (TREC)
International Joint Conference on Natural Language Processing
SIGIR
ECIR
CICLing.org

NLP conference calendar: http://www.cs.rochester.edu/~tetreaul/conferences.html

Learner Corpora in Language Testing and Assessment

Edited by Marcus Callies and Sandra Götz

University of Bremen / Justus Liebig University, Giessen

ISBN 9789027203786

The aim of this volume is to highlight the benefits and potential of using learner corpora for the testing and assessment of L2 proficiency in both speaking and writing, reflecting the growing importance of learner corpora in applied linguistics and second language acquisition research. Identifying several desiderata for future research and practice, the volume presents a selection of original studies, covering a variety of different languages. It features studies that present very thoroughly compiled new corpus resources which are tailor-made and ready for analysis in LTA, new tools for the automatic assessment of proficiency levels, and new methods of (self-)assessment with the help of learner corpora. Other studies suggest innovative research methodologies of how proficiency can be operationalized through learner corpus data. The volume is of particular interest to researchers in (applied) corpus linguistics, learner corpus research, language testing and assessment, as well as for materials developers and language teachers.

Learner corpora in language testing and assessment: Prospects and challenges

Marcus Callies and Sandra Götz

1 – 10

New corpus resources, tools and methods. The Marburg Corpus of Intermediate Learner English (MILE)

Rolf Kreyer

13 – 34

Avalingua : Natural language processing for automatic error detection

Pablo Gamallo Otero, Marcos Garcia, Iria del Río and Isaac González López

35 – 58

Data commentary in science writing: Using a small, specialized corpus for formative self-assessment practices

Lene Nordrum and Andreas Eriksson

59 – 84

First steps in assigning proficiency to texts in a learner corpus of computer-mediated communication

Tim Marchand and Sumie Akutsu

85 – 112

Data-driven approaches to the assessment of proficiency

The English Vocabulary Profile as a benchmark for assigning levels to learner corpus data

Agnieszka Lenko-Szymanska

115 – 140

A multidimensional analysis of learner language during story reconstruction in interviews

Pascual Pérez-Paredes and María Sánchez-Tornel

141 – 162

Article use and criterial features in Spanish EFL writing: A pilot study from CEFR A2 to B2 levels

María Belén Díez-Bedmar

163 – 190

Tense and aspect errors in spoken learner English: Implications for language testing and assessment

Sandra Götz

191 – 216