Detection of Errors and Correction
in Corpus Annotation

Markus Dickinson and Walt Detmar Meurers

Proceedings of the Special session on treebanks for spoken language and discourse at the 15th Nordic Conference of Computational Linguistics (NODALIDA-05).

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation, more recently work has also started to address errors in syntactic and other structural annotation.

Spoken language differs in many respects from written language, but to the best of our knowledge the issue of detecting errors in the annotation of spoken language corpora has not yet been systematically addressed. This is significant since spoken data is increasingly relevant for linguistic and computational research---and such corpora are starting to become more readily available. This paper addresses the issue, based on the variation n-gram error detection approach developed in Dickinson and Meurers (2003). We use the German Verbmobil treebank as an exemplar of a spoken language corpus and discuss properties of such corpora which are relevant when adapting the variation n-gram approach for detecting errors in syntactic annotation of spoken language corpora.

Bibtex entry:

  author =       {Markus Dickinson and W. Detmar Meurers},
  title =        {Detecting Annotation Errors in Spoken Language Corpora},
  booktitle =    {The Special Session on treebanks for spoken language 
                  and discourse at NODALIDA-05},
  pages =        {},
  url = {},
  year =         {2005},
  address =      {Joensuu, Finland}