Word Sense Disambiguation
Edited by Eneko Agirre and Philip Edmonds

Chapter 4: Evaluation of WSD Systems

Martha Palmer, Hwee Tou Ng, Hoa Trang Dang


In this chapter we discuss the evaluation of automatic word sense disambiguation (WSD) systems. Some issues, such as evaluation metrics and the basic methodology for hand-tagging evaluation data, are well agreed upon by the WSD community. However, other important issues remain to be resolved, including the question of which sense distinctions are important and relevant to the sense-tagging task, and how to evaluate WSD systems in real NLP applications. We give an overview of previous evaluation exercises and investigate sources of human inter-annotator disagreements. The errors are at least partially reconciled by a more coarse-grained view of the senses, and we present the groupings that were used for quantitative coarse-grained evaluation. Well-defined sense groups can be of value in improving sense tagging consistency for both humans and machines.


Senseval organization


4.1 Introduction. 75

4.1.1 Terminology. 76

4.1.2 Overview.. 80

4.2 Background. 81

4.2.1 WordNet and Semcor 81

4.2.2 The line and interest corpora. 83

4.2.3 The DSO corpus. 84

4.2.4 Open Mind Word Expert 85

4.3 Evaluation using pseudo-words. 86

4.4 Senseval evaluation exercises. 86

4.4.1 Senseval-1. 87

Evaluation and scoring. 88

4.4.2 Senseval-2. 88

English all-words task. 89

English lexical sample task. 89

4.4.3 Comparison of tagging exercises. 91

4.5 Sources of inter-annotator disagreement 92

4.6 Granularity of sense: Groupings for WordNet 95

4.6.1 Criteria for WordNet sense grouping. 96

4.6.2 Analysis of sense grouping. 97

4.7 Senseval-3. 98

4.8 Discussion. 99

References. 102

Copyright © 2006 Springer. All rights reserved.