Noise-tolerant, Reliable Active Classification with Comparison Queries

Hopkins, Max; Kane, Daniel M.; Lovett, Shachar; Mahajan, Gaurav

doi:10.48550/arxiv.2001.05497

Cited by 3 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It would also be interesting to investigate whether our algorithmic insights can find applications for learning halfspaces under the challenging Tsybakov noise model (Hanneke, 2011). Finally, it would be interesting to extend our ideas to actively learn more general classes such as low degree polynomials, perhaps using additional comparison queries as explored in recent works (Kane et al, 2017;Xu et al, 2017;Hopkins et al, 2020).…”

Section: Conclusion and Discussionmentioning

confidence: 95%

Efficient active learning of sparse halfspaces with arbitrary bounded noise

Zhang¹,

Shen²,

Awasthi³

2020

Preprint

View full text Add to dashboard Cite

In this work we study active learning of homogeneous s-sparse halfspaces in R d under label noise. Even in the absence of label noise this is a challenging problem and only recently have label complexity bounds of the form Õ s • polylog d, 1 been established in Zhang ( 2018) for computationally efficient algorithms under the broad class of isotropic log-concave distributions. In contrast, under high levels of label noise, the label complexity bounds achieved by computationally efficient algorithms are much worse. When the label noise satisfies the Massart condition (Massart and Nédélec, 2006), i.e., each label is flipped with probability at most η for a parameter η ∈ [0, 1 2 ), the work of Awasthi et al. ( 2016) provides a computationally efficient active learning algorithm under isotropic log-concave distributions with label complexity Õ s poly(1/(1−2η)) poly log d, 1 . Hence the algorithm is label-efficient only when the noise rate η is a constant. In this work, we substantially improve on the state of the art by designing a polynomial time algorithm for active learning of s-sparse halfspaces under bounded noise and isotropic log-concave distributions, with a label complexity of Õ s (1−2η) 4 polylog d, 1 . Hence, our new algorithm is label-efficient even for noise rates close to 1 2 . Prior to our work, such a result was not known even for the random classification noise model. Our algorithm builds upon existing margin-based algorithmic framework and at each iteration performs a sequence of online mirror descent updates on a carefully chosen loss sequence, and uses a novel gradient update rule that accounts for the bounded noise.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 95%

Efficient active learning of sparse halfspaces with arbitrary bounded noise

Zhang¹,

Shen²,

Awasthi³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Comparison queries consider four records (say 𝑣 1 , 𝑣 2 , 𝑣 3 , 𝑣 4 ) as input and compare the relative distance between (𝑣 1 , 𝑣 2 ) with that of (𝑣 3 , 𝑣 4 ). Such queries have been used to study correlation clustering [5,59], classification [38,57], top-𝑘 selection [13,17,19,34,43,44,54,61], skyline computation [62] and many other machine learning tasks. Many empirical crowdsourcing studies have shown the ability of crowd members to answer such queries accurately [5].…”

Section: Related Workmentioning

confidence: 99%

“…Such comparisons reveal the local hierarchical structure with respect to the queried records and can be answered without the knowledge of other records in the dataset. These oracle models have been widely popular to study fairness metrics [40], correlation clustering [59] and classification [38,57], identify maximum elements [34,61], top-𝑘 elements [13,17,19,43,44,54], information retrieval [42], skyline computation [62], and so on. In order to minimize the oracle workload, our framework prioritizes records to optimize the number of triplet comparisons.…”

mentioning

confidence: 99%

Hierarchical Entity Resolution using an Oracle

Galhotra

Firmani²,

Saha

et al. 2022

Proceedings of the 2022 International Conference on Management of Data

View full text Add to dashboard Cite

In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like typesubtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets.

show abstract

“…Distance based comparison oracles have been used to study a wide range of problems and we list a few of them -learning fairness metrics [34], top-down hierarchical clustering with a different objective [11,17,24], correlation clustering [49] and classification [32,48], identify maximum [30,53], top-𝑘 elements [14-16, 38, 40, 45], information retrieval [35], skyline computation [54]. To the best of our knowledge, there is no work that considers quadruplet comparison oracle queries to perform 𝑘-center clustering and single/complete linkage based hierarchical clustering.…”

Section: Other Related Workmentioning

confidence: 99%

“…Motivated by the aforementioned observations, we consider a quadruplet comparison oracle that compares the relative distance between two pairs of points (𝑢 1 , 𝑢 2 ) and (𝑣 1 , 𝑣 2 ) and outputs the pair with smaller distance between them breaking ties arbitrarily. Such oracle models have been studied extensively in the literature [11,17,24,32,34,48,49]. Even though quadruplet queries are easier than binary optimal queries, some oracle queries maybe harder than the rest.…”

Section: Introductionmentioning

confidence: 99%

How to design robust algorithms using noisy comparison Oracle

2021

View full text Add to dashboard Cite

Metric based comparison operations such as finding maximum, nearest and farthest neighbor are fundamental to studying various clustering techniques such as k -center clustering and agglomerative hierarchical clustering. These techniques crucially rely on accurate estimation of pairwise distance between records. However, computing exact features of the records, and their pairwise distances is often challenging, and sometimes not possible. We circumvent this challenge by leveraging weak supervision in the form of a comparison oracle that compares the relative distance between the queried points such as `Is point u closer to v or w closer to x ?'. However, it is possible that some queries are easier to answer than others using a comparison oracle. We capture this by introducing two different noise models called adversarial and probabilistic noise. In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search under these noise models. Building upon the techniques we develop for these problems, we give robust algorithms for k -center clustering and agglomerative hierarchical clustering. We prove that our algorithms achieve good approximation guarantees with a high probability and analyze their query complexity. We evaluate the effectiveness and efficiency of our techniques empirically on various real-world datasets.

show abstract

Noise-tolerant, Reliable Active Classification with Comparison Queries

Cited by 3 publications

References 12 publications

Efficient active learning of sparse halfspaces with arbitrary bounded noise

Efficient active learning of sparse halfspaces with arbitrary bounded noise

Hierarchical Entity Resolution using an Oracle

How to design robust algorithms using noisy comparison Oracle

Contact Info

Product

Resources

About