We study semi-supervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multi-manifold setting. We then propose a semi-supervised learning algorithm that separates different manifolds into decision sets, and performs supervised learning within each set. Our algorithm involves a novel application of Hellinger distance and size-constrained spectral clustering. Experiments demonstrate the benefit of our multimanifold semi-supervised learning approach.
Users of topic modeling methods often have knowledge about the composition of words that should have high or low probability in various topics. We incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. We present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model’s ability to follow and generalize beyond user-specified domain knowledge.
Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graph-structured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs. A procedure for greedily selecting cliques in the dual representation is then proposed, which allows sparse representations. By incorporating kernels and implicit feature spaces into conditional graphical models, the framework enables semi-supervised learning algorithms for structured data through the use of graph kernels. The framework and clique selection methods are demonstrated in synthetic data experiments, and are also applied to the problem of protein secondary structure prediction.
Background
Prior text analysis of R01 critiques suggested that female applicants
may be disadvantaged in NIH peer review, particularly for R01 renewals. NIH
altered its review format in 2009. The authors examined R01 critiques and
scoring in the new format for differences due to principal investigator (PI)
sex.
Method
The authors analyzed 739 critiques—268 from 88 unfunded and
471 from 153 funded applications for grants awarded to 125 PIs (M =
76, 61% F = 49, 39%) at the University of
Wisconsin-Madison between 2010 and 2014. The authors used 7 word categories
for text analysis: ability, achievement, agentic, negative evaluation,
positive evaluation, research, and standout adjectives. The authors used
regression models to compare priority and criteria scores, and results from
text analysis for differences due to PI sex and whether the application was
for a new (Type 1) or renewal (Type 2) R01.
Results
Approach scores predicted priority scores for all PIs’
applications (P<.001); but scores and critiques differed
significantly for male and female PIs’ Type 2 applications.
Reviewers assigned significantly worse priority, approach, and significance
scores to female than male PIs’ Type 2 applications, despite using
standout adjectives (e.g., “outstanding,”
“excellent”) and making references to ability in more of
their critiques (P<.05 for all comparisons).
Conclusions
The authors’ analyses suggest that subtle gender bias may
continue to operate in the post-2009 NIH review format in ways that could
lead reviewers to implicitly hold male and female applicants to different
standards of evaluation, particularly for R01 renewals.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.