You are here
Utility-Theoretic Ranking for Semi-Automated Text Classification
In Semi-Automated Text Classification (SATC) an automatic classifier h labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by h to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that h has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting (and correcting, if necessary) a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.
(Joint work with Andrea Esuli and Giacomo Berardi)