Detection of Errors and Correction
in Corpus Annotation

Detecting Errors in Part-of-Speech Annotation

Markus Dickinson and Walt Detmar Meurers

Proceedings of EACL'03.

We propose a new method for detecting errors in ``gold-standard'' part-of-speech annotation. The approach locates errors with high precision based on n-grams occurring in the corpus with multiple taggings. Two further techniques, closed-class analysis and finite-state tagging guide patterns, are discussed. The success of the three approaches is illustrated for the Wall Street Journal corpus as part of the Penn Treebank.

The variation n-gram code used in the paper is freely available (written in python). Just send me an e-mail at the address below.

Electronically available file formats:

