You are here

Utility-Theoretic Ranking for Semi-Automated Text Classification

Event date: 
Tuesday, 1 October, 2013 - 11:00
Sala Consiglio, West Building, ground floor
Fabrizio Sebastiani (ISTI-CNR, Pisa)

In Semi-Automated Text Classification (SATC) an automatic classifier h labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by h to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that h has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting (and correcting, if necessary) a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

(Joint work with Andrea Esuli and Giacomo Berardi)