Most classic machine learning methods depend on the assumption that humans can annotate all the data available for training. However, many modern machine learning applications (including image and video classification, protein sequence classification, and speech processing) have massive amounts of unannotated or unlabeled data. As a consequence, there has been tremendous interest both in machine learning and its application areas in designing algorithms that most efficiently utilize the available data while minimizing the need for human intervention. An extensively used and studied technique is active learning, where the algorithm is presented with a large pool of unlabeled examples (such as all images available on the web) and can interactively ask for the labels of examples of its own choosing from the pool, with the goal to drastically reduce labeling effort.
Formal setupWe consider classification problems (such as classifying images by who is in them or classifying emails as spam or not), where the goal is to predict a label y based on its corresponding input vector x. In the standard machine learning formulation, we assume that the data points (x, y) are drawn from an unknown underlying distribution D XY over X × Y ; X is called the feature (instance) space and Y = {0, 1} is the label space. The goal is to output a hypothesis function h of small error (or small 0/1 loss), where err(h) = P (x,y)∼D XY [h(x) = y]. In the passive learning setting, the learning algorithm is given a set of labeled examples (x 1 , y 1 ), . . . , (x m , y m ) drawn 1