A linguistic taxonomy of registers on the searchable web #cl2015



Doug Biber; Jesse Egbert; Mark Davies
Panel: A linguistic taxonomy of registers on the searchable web: Distribution, linguistic descriptions, and automatic register identification

Abstract book pp 52-54

Doug Biber

Oral-literate dimensions & Narrative dimension remain constant in all MDA across languages and registers

Oral-literate dimensions

3 dimensions here

Pronouns & questions, verbs, dependent clauses crucial in interactivity

These analyses show that there are major linguistic differences among the eight major user-defined register categories.

Can we automatically id web registers?

Start point 150+ linguistic features as predictors

90% was training corpus and 10% test corpus

Each document was assigned to a single category

Stepwise discriminant analysis to select the strongest predictive  features

10-feature model 0.34 precision

44-feature model 0.44 precision