Learning classification models with soft-label information

Nguyen, Quang Uy; Valizadegan, Hamed; Hauskrecht, Miloš

doi:10.1136/amiajnl-2013-001964

Cited by 53 publications

(32 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a summary, most existing works use patient data, especially patient EHRs, collected from people who were already suspected or diagnosed to be sick [121,90,75,96,102,141]. However, mining general health examination data is an area that has not yet been well-explored, except a few studies on risk prediction such as the chronic disease early warning system proposed in [64].…”

Section: Discussionmentioning

confidence: 99%

“…Recently Nguyen et al [96] introduced a learning approach that considered the auxiliary soft-label information collected from human experts to quantify label uncertainty. They included an additional term in the original SVM formulation to define the loss of not respecting the orders induced by the auxiliary subjective probabilities.…”

Section: Classification With Label Uncertaintymentioning

confidence: 99%

“…They either have expert-defined low-risk or control classes [47,121,90,125,75,96] or simply treat non-positive cases as negative [113,22,102]. Methods that consider unlabeled data [106,73,60,97,46,124,61,84,138] are generally based on Semi-Supervised Learning (SSL) [144] that learns from both labeled and unlabeled data.…”

Section: Problems and Challengesmentioning

confidence: 99%

“…Nguyen et al [96] introduced a classification method for heparin induced thrombocytopenia. Their model took into account the soft-label information that reflected how strongly the human expert felt about the original class labels.…”

mentioning

confidence: 99%

“…LITERATURE REVIEW ON HEALTHCARE DATA ANALYSIS AND MINING heart failure [106], mental health problems like depression or suicide [120,121], or special diseases [96,90]. Another strain focuses on mortality/survivability prediction, for example, on Intensive Care…”

mentioning

confidence: 99%

See 4 more Smart Citations

Healthcare data mining from multi-source data

Chen¹

View full text Add to dashboard Cite

The "big data" challenge is changing the way we acquire, store, analyse, and draw conclusions from data. How we effectively and efficiently "mine" the data from possibly multiple sources and extract useful information is a critical question. Increasing research attention has been drawn to healthcare data mining, with an ultimate goal to improve the quality of care. The human body is complex and so too the data collected in treating it. Data noise that is often introduced via the collection process makes building Data Mining models a challenging task.This thesis focuses on the classification tasks of mining healthcare data, with the goal of improving the effectiveness of health risk prediction. In particular, we developed algorithms to address issues identified from real healthcare data, such as feature extraction, heterogeneity, label uncertainty, and large unlabeled data.The three main contributions of this research are as follows. First, we developed a new health index called Personal Health Index (PHI) that scores a person's health status based on the examination records of a given population. Second, we identified the key characteristics of the real datasets and issues that were associated with the data. Third, we developed classification algorithms to cope with those issues, particularly, the label uncertainty and large unlabeled data issues.This research takes one step forward towards scoring personal health based on mining increasingly large health records. Particularly, it pioneers exploring the mining of GHE data and tackles the associated challenges. It is our anticipation that in the near future, more robust data-mining-based health scoring systems will be available for healthcare professionals to understand people's health status and thus improve the quality of care.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Classification With Label Uncertaintymentioning

confidence: 99%

Section: Problems and Challengesmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Healthcare data mining from multi-source data

Chen¹

View full text Add to dashboard Cite

show abstract

Clinician-Driven AI: Code-Free Self-Training on Public Data for Diabetic Retinopathy Referral

Korot,

Gonçalves,

Huemer

et al. 2023

JAMA Ophthalmol

View full text Add to dashboard Cite

ImportanceDemocratizing artificial intelligence (AI) enables model development by clinicians with a lack of coding expertise, powerful computing resources, and large, well-labeled data sets.ObjectiveTo determine whether resource-constrained clinicians can use self-training via automated machine learning (ML) and public data sets to design high-performing diabetic retinopathy classification models.Design, Setting, and ParticipantsThis diagnostic quality improvement study was conducted from January 1, 2021, to December 31, 2021. A self-training method without coding was used on 2 public data sets with retinal images from patients in France (Messidor-2 [n = 1748]) and the UK and US (EyePACS [n = 58 689]) and externally validated on 1 data set with retinal images from patients of a private Egyptian medical retina clinic (Egypt [n = 210]). An AI model was trained to classify referable diabetic retinopathy as an exemplar use case. Messidor-2 images were assigned adjudicated labels available on Kaggle; 4 images were deemed ungradable and excluded, leaving 1744 images. A total of 300 images randomly selected from the EyePACS data set were independently relabeled by 3 blinded retina specialists using the International Classification of Diabetic Retinopathy protocol for diabetic retinopathy grade and diabetic macular edema presence; 19 images were deemed ungradable, leaving 281 images. Data analysis was performed from February 1 to February 28, 2021.ExposuresUsing public data sets, a teacher model was trained with labeled images using supervised learning. Next, the resulting predictions, termed pseudolabels, were used on an unlabeled public data set. Finally, a student model was trained with the existing labeled images and the additional pseudolabeled images.Main Outcomes and MeasuresThe analyzed metrics for the models included the area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity, and F1 score. The Fisher exact test was performed, and 2-tailed P values were calculated for failure case analysis.ResultsFor the internal validation data sets, AUROC values for performance ranged from 0.886 to 0.939 for the teacher model and from 0.916 to 0.951 for the student model. For external validation of automated ML model performance, AUROC values and accuracy were 0.964 and 93.3% for the teacher model, 0.950 and 96.7% for the student model, and 0.890 and 94.3% for the manually coded bespoke model, respectively.Conclusions and RelevanceThese findings suggest that self-training using automated ML is an effective method to increase both model performance and generalizability while decreasing the need for costly expert labeling. This approach advances the democratization of AI by enabling clinicians without coding expertise or access to large, well-labeled private data sets to develop their own AI models.

show abstract

Hierarchical Active Learning with Proportion Feedback on Regions

Luo

Hauskrecht

2019

Machine Learning and Knowledge Discovery in Databases

Self Cite

View full text Add to dashboard Cite

Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the feedback. A region is defined as a hyper-cubic subspace of the input feature space and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To learn a classifier from region-based feedback we develop an active learning framework that hierarchically divides the input space into smaller and smaller regions. In each iteration we split the region with the highest potential to improve the classification models. This iterative process allows us to gradually learn more refined classification models from more specific regions with more accurate proportions. Through experiments on numerous datasets we demonstrate that our approach offers a new and promising active learning direction that can outperform existing active learning approaches especially in situations when labeling budget is limited and small.

show abstract

Learning classification models with soft-label information

Abstract: A new classification learning framework that lets us learn from auxiliary soft-label information provided by a human expert is a promising new direction for learning classification models from expert labels, reducing the time and cost needed to label data.

Cited by 53 publications

References 28 publications

Healthcare data mining from multi-source data

Healthcare data mining from multi-source data

Clinician-Driven AI: Code-Free Self-Training on Public Data for Diabetic Retinopathy Referral

Hierarchical Active Learning with Proportion Feedback on Regions

Contact Info

Product

Resources

About