Corpora and metadata - Pérez-Paredes

Lou Burnard:

[…] it is no exaggeration to say that without metadata, corpus linguistics would be virtually impossible. Why? Because corpus linguistics is an empirical science, in which the investigator seeks to identify patterns of linguistic behaviour by inspection and analysis of naturally occurring samples of language. A typical corpus analysis will therefore gather together many examples of linguistic usage, each taken out of the context in which it originally occurred, like a laboratory specimen. Metadata can restore that context by supplying information about it, thus enabling us to relate the specimen to its original habitat. Furthermore, since language corpora are constructed from pre-existing pieces of language, questions of accuracy and authenticity are all but inevitable when using them: without metadata, the investigator has no way of answering such questions. Without metadata, the investigator has nothing but disconnected words of unknowable provenance or authenticity[1].

[1] URL: http://users.ox.ac.uk/~lou/wip/metadata.html

References: Burnard, Lou; Aston, Guy (1998). The BNC handbook: exploring the British National Corpus. Edinburgh: Edinburgh University Press.

Related