Lexicoder automated content analysis of text

Lexicoder is a Java-based, multi-platform software for automated content analysis of text. Lexicoder was developed by Lori Young and Stuart Soroka, and programmed by Mark Daku (initially at McGill University, and now at Penn, Michigan, and McGill respectively).

The current version of the software (2.0) is freely available – for academic use only. Additions and revisions will also be released here as they become available. In addition, the Lexicoder Sentiment Dictionary, a dictionary designed to capture the sentiment of political texts, is available formatted for Lexicoder, or WordStat, and also adaptable to other content-analytic software. Work on Topic Dictionaries, based on the Policy Agendas coding scheme, is also underway.

Through Linkedin The WebGenre R&D Group.

1st Intl. NLP for Informal Text- Deadline 17/4

Graph-Magnifier-icon

The 1st International Workshop on Natural Language Processing for Informal Text (NLPIT 2015)
In conjunction with The International Conference on Web Engineering(ICWE 2015)
June 23, 2015, Rotterdam, The Netherlands
http://wwwhome.cs.utwente.nl/~badiehm/nlpit2015/

Overview
The rapid growth of Internet usage in the last two decades adds new challenges to understand the informal user generated content (UGC) on the Internet. Textual UGC refers to textual posts on social media, blogs, emails, chat conversations, instant messages, forums, reviews, or advertisements that are created by end-users of an online system. A large portion of language used on textual UGC is informal. Informal text is the style of writing that disregard language grammars and uses a mixture of abbreviations and context dependent terms. The straightforward application of state-of-the-art Natural Language Processing approaches on informal text typically results in significantly degraded performance due to the following reasons: the lack of sentence structure; the lack of enough context required; the seldom entities involved; the noisy sparse contents of users’ contributions; and the untrusted facts contained. It is the aim of this work- shop to bring the attention of researchers to the opportunities and challenges involved in informal text processing. In particular, we are interested in discussing informal text modeling, normalization, mining, and understanding in addition to various application areas in which UGC is involved.

Topics

We invite submissions on topics that include, but are not limited to, the following core NLP approaches for informal UGC: language identification, classification, clustering, filtering, summarization, tokenization, segmentation, morphological analysis, POS tagging, parsing, named entity extraction, named entity disambiguation, relation/fact extraction, semantic annotation, sentiment analysis, language normalization, informality modeling and measuring, language generation, handling uncertainties, machine translation, ontology construction, dictionary construction, etc.

Submission

Authors are invited to submit original work not submitted to another conference or workshop. Workshop submissions could be a full paper or short paper. Paper length should not exceed 12 pages for full papers and 6 pages for short papers. All papers should follow the Springer’s LNCS format. Papers in PDF can be sent via the EasyChair Conference System https://easychair.org/conferences/?conf=nlpit2015. Each submission will receive, in addition to a meta-review, at least 2 peer double-blind reviews. Each full paper will get 25 minutes presentation time. Short papers will get 5 minutes presentation time in addition to a poster. Beside papers, we also plan to have an invited talk by a renowned scientist on a topic relevant for the workshop. Workshop proceedings will be published as part of the ICWE2015 workshop proceedings. To contact the NLPIT 2015 organization team, please send an e-mail to: nlpit2015@easychair.org.

Deadlines

– Submission deadline: April 17, 2015
– Notification deadline: May 17, 2015
– Camera-ready version: May 24, 2015
– Workshop date: June 23, 2015

Msg. distributed through the corpora list

Ideology in corporate language

Ruth Breeze

Ideology in corporate language: discourse analysis using Wmatrix3

2013 Annual Reports from leading companies (16)  in financial services, mining, food and pharmaceutical

Parts: first part, non technical, discursive, visually interesting

Reference corpus: 1st BNC Sampler Business & BNC Informative texts but then only BNC Business

Use of semantic categories

Three case studies: size (big), time (begin) and casuse and effect

Size: Focus on growth, large, expanding, substantial. Not only adjectives are interesting here.

Conclusions:

Ideology of cause and effect

Dynamic approach to time

Emphasis on size and importance

Salient semantic areas: investigation, tough, strong, attentive, jelp & give, in power, belonging to a group

Differences: only in domain/topic-focus, probably different stresses on newness and green economy

 

 

Big data and corpus linguistics

AELINCO 2015 Conference, U. Valladolid, Spain

la foto

Andre Hardie Keynote 

What follows is my own notes and understanding of Hardie’s keynote.

How big is big data?

N= ALL?

Non manual curation of the database

Must be mined or statistically summarised (manual not posssible)

Pattern finding: trend modelling, data mining & machine learning

Language big data: Google n-gram

A revolutionary change for language and linguistics?

Textual big data studies sone by non-linguistic specialists

Limitations of Google when used with no language training

Michel et al. Quantitative analysis if Culture. Science 331 (2011). Culturomics. What is there?

Quantitative findings, otherwise pretty predictable and very much frequency counts. In actual fact, the study was not backed by any expert in corpus linguistics. Steven Pinker was involved in the paper and the whole thing was treated as if they invented the wheel.

Borin et al. papers trying to “salvage” the whole cultoromics movement from its ignorance.

New “happiness” analyses are trendy, but what do they have to offer? Lots of problems attached and shortcomings.  I think that corpus analysis is becoming mainstream and it is more visible in specialized journal. The price of fame?

Linguistically risibly naive research done by non-linguists

la foto (1)

 

Paul Rayson keynote

Larger corpora available from Brown in the 1960’s

Mura Nava’s resource. An interesting timeline of corpus analysis tools.

SAMUELS : Semantic Annotation and mark-up for enhancing lexical searches

Overcoming problems when doing textual analysis: fused forms, archaic forms, apostrophe, and many many others…. Searching for words is a challenge > frequencies split by multiple spellings.

VARD

USAS semantic tagger

Full text tagging (as opposed to trends in “textual big data” analysis).

Modern & historical taggers

Disambiguation methods are essential

Paul discusses the Historical Thesaurus of English

The whole annotation system:

la foto (2)

 

I guess this is the missing part in big data as practised by non-linguists.