Multi-classification by categorical features via clustering

Seldin, Yevgeny; Tishby, Naftali

doi:10.1145/1390156.1390272

Cited by 10 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, many state‐of‐the‐art algorithms search for a weighted combination of simpler rules (Germain et al, ): bagging (Breiman, , ), boosting (Schapire et al, ; Schapire & Singer, ), and Bayesian approaches (Gelman et al, ) or even Kernel methods (Vapnik, ) and neural networks (Bishop, ). The major open problem in this scenario is how to weight the different rules in order to obtain good performance (Berend & Kontorovitch, ; Catoni, ; Lever et al, , ; Nitzan & Paroush, ; Parrado‐Hernández et al, ), how these performances can be assessed (Catoni, ; Donsker & Varadhan, ; Germain et al, , ; Lacasse et al, ; Langford & Seeger, ; Laviolette & Marchand, , ; Lever et al, , ; London et al, ; McAllester, , , ; Shawe‐Taylor & Williamson, ; Tolstikhin & Seldin, ; Van Erven, ), and how this theoretical framework can be exploited for deriving new learning approaches or for applying it in other contexts (Audibert, ; Audibert & Bousquet, ; Bégin et al, ; Germain et al, ; McAllester, ; Morvant, ; Ralaivola et al, ; Roy et al, ; Seeger, , ; Seldin et al, , ; Seldin & Tishby, , ; Shawe‐Taylor & Langford, ). The PAC‐Bayes approach is one of the sharpest analysis frameworks in this context, since it can provide tight bounds on the risk of the Gibbs classifier (GC), also called randomized (or probabilistic) classifier, and the Bayes classifier (BC), also called weighted majority vote classifier (Germain et al, ).…”

Section: Pac‐bayes Theorymentioning

confidence: 99%

Model selection and error estimation without the agonizing pain

Oneto

2018

WIREs Data Min & Knowl

View full text Add to dashboard Cite

How can we select the best performing data-driven model? How can we rigorously estimate its generalization error? Statistical learning theory (SLT) answers these questions by deriving nonasymptotic bounds on the generalization error of a model or, in other words, by delivering upper bounding of the true error of the learned model based just on quantities computed on the available data. However, for a long time, SLT has been considered only as an abstract theoretical framework, useful for inspiring new learning approaches, but with limited applicability to practical problems. The purpose of this review is to give an intelligible overview of the problems of model selection (MS) and error estimation (EE), by focusing on the ideas behind the different SLT-based approaches and simplifying most of the technical aspects with the purpose of making them more accessible and usable in practice. We start by presenting the seminal works of the 80s until the most recent results, then discuss open problems and finally outline future directions of this field of research.

show abstract

Section: Pac‐bayes Theorymentioning

confidence: 99%

Model selection and error estimation without the agonizing pain

Oneto

2018

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…Furthermore, constraint (*) ensures that the information encoded in spikes is reliable. Applying a PAC-Bayes inspired variant of Ockham's razor [15], it can be shown that the higher the effective information generated by spikes, the smaller the difference between the empirical reward estimate R m and expected reward R m :…”

Section: Cooperative Learning In Abstractmentioning

confidence: 99%

Regulating the information in spikes: a useful bias

Balduzzi

2012

Preprint

View full text Add to dashboard Cite

The bias/variance tradeoff is fundamental to learning: increasing a model's complexity can improve its fit on training data, but potentially worsens performance on future samples [1]. Remarkably, however, the human brain effortlessly handles a wide-range of complex pattern recognition tasks. On the basis of these conflicting observations, it has been argued that useful biases in the form of "generic mechanisms for representation" must be hardwired into cortex [2]. This note describes a useful bias that encourages cooperative learning which is both biologically plausible and rigorously justified [3][4][5][6][7][8][9].Let us outline the problem. Neurons learn inductively. They generalize from finite samples and encode estimates of future outcomes (for example, rewards) into their spiketrains [10]. Results from learning theory imply that generalizing successfully requires strong biases [1] or, in other words, specialization. Thus, at any given time some neurons' specialties are more relevant than others. Since most of the data neurons receive are other neurons' outputs, it is essential that neurons indicate which of their outputs encode high quality estimates. Downstream neurons should then be biased to specialize on these outputs.The proposed biasing mechanism is based on a constraint on the effective information, ei, generated by spikes, see Eq. (*) below. The motivation for using effective information comes from a connection to learning theory explained in §2. There, we show the ei generated by empirical risk minimization quantifies capacity: higher ei yields tighter generalization bounds.Sections §3 and §4 consider implications of the constraint in two cases: abstractly and for a concrete model. In both cases we find that imposing constraint (*) implies: (i) essentially all information is carried by spikes; (ii) spikes encode reward estimates and (iii) the higher the effective information, the better the guarantees on estimates.Although the proposal is inspired by cortical learning, the main ideas are information-theoretic, suggesting they may also apply to other examples of interacting populations of adaptive agents.

show abstract

“…The formulation and analysis of graph clustering presented here are based on the analysis of coclustering suggested in (Seldin and Tishby, 2008;Seldin, 2009), which is reviewed briefly in section 2. In section 3 we adapt the analysis to derive PAC-Bayesian generalization bound for the graph clustering problem.…”

Section: Introductionmentioning

confidence: 99%

“…This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. We adapt the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008;Seldin, 2009) to derive a PAC-Bayesian generalization bound for graph clustering. The bound shows that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes.…”

mentioning

confidence: 99%

A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering

Seldin

2010

Preprint

Self Cite

View full text Add to dashboard Cite

We formulate weighted graph clustering as a prediction problem 1 : given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. We adapt the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008;Seldin, 2009) to derive a PAC-Bayesian generalization bound for graph clustering. The bound shows that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar trade-off derived from information-theoretic considerations was already shown to produce state-ofthe-art results in practice (Slonim et al., 2005;Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues. We derive a bound minimization algorithm and show that it provides good results in real-life problems and that the derived PAC-Bayesian bound is reasonably tight.1 Pairwise clustering is equivalent to clustering of a weighted graph, where edge weights correspond to pairwise distances. Hence, from this point on, we restrict the discussion to graph clustering.2 Unweighted graphs can be modeled by setting the weight of present edges as 1 and absent edges as 0.

show abstract

Multi-classification by categorical features via clustering

Cited by 10 publications

References 7 publications

Model selection and error estimation without the agonizing pain

Model selection and error estimation without the agonizing pain

Regulating the information in spikes: a useful bias

A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering

Contact Info

Product

Resources

About