Statistical models of the amino acid composition of the proteins within a fold family are widely used in science and engineering. Existing techniques for learning probabilistic graphical models from multiple sequence alignments either make strong assumptions about the conditional independencies within the model (e.g., HMMs), or else use sub-optimal algorithms to learn the structure and parameters of the model. We introduce an approach to learning the topological structure and parameters of an undirected probabilistic graphical model. The learning algorithm uses block-L 1 regularization and solves a convex optimization problem, thus guaranteeing a globally optimal solution at convergence. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Our model is generative, allowing for the design of new proteins that have corresponding statistical properties to those seen in nature. We apply our approach to two widely studied protein families: the WW and the PDZ folds. We demonstrate that our model is able to capture interactions that are important in folding and allostery. Our results additionally indicate that while the network of interactions within a protein is sparse, it is richer than previously believed.
Purpose: Serum-biomarker based screening for pancreatic cancer could greatly improve survival in appropriately targeted high-risk populations.Experimental Design: Eighty-three circulating proteins were analyzed in sera of patients diagnosed with pancreatic ductal adenocarcinoma (PDAC, n ¼ 333), benign pancreatic conditions (n ¼ 144), and healthy control individuals (n ¼ 227). Samples from each group were split randomly into training and blinded validation sets prior to analysis. A Metropolis algorithm with Monte Carlo simulation (MMC) was used to identify discriminatory biomarker panels in the training set. Identified panels were evaluated in the validation set and in patients diagnosed with colon (n ¼ 33), lung (n ¼ 62), and breast (n ¼ 108) cancers.Results: Several robust profiles of protein alterations were present in sera of PDAC patients compared to the Healthy and Benign groups. In the training set (n ¼ 160 PDAC, 74 Benign, 107 Healthy), the panel of CA 19-9, ICAM-1, and OPG discriminated PDAC patients from Healthy controls with a sensitivity/specificity (SN/SP) of 88/90%, while the panel of CA 19-9, CEA, and TIMP-1 discriminated PDAC patients from Benign subjects with an SN/SP of 76/90%. In an independent validation set (n ¼ 173 PDAC, 70 Benign, 120 Healthy), the panel of CA 19-9, ICAM-1 and OPG demonstrated an SN/SP of 78/94% while the panel of CA19-9, CEA, and TIMP-1 demonstrated an SN/SP of 71/89%. The CA19-9, ICAM-1, OPG panel is selective for PDAC and does not recognize breast (SP ¼ 100%), lung (SP ¼ 97%), or colon (SP ¼ 97%) cancer.Conclusions: The PDAC-specific biomarker panels identified in this investigation warrant additional clinical validation to determine their role in screening targeted high-risk populations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.