Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
TaskApproach Techniques Data discovery Sharing Collaborative Analysis [9]-[11] Web [12]-[17]
The cytochrome P450 enzymes (CYP or P450) 46A1 and 27A1 play important roles in cholesterol elimination from the brain and retina, respectively, yet they have not been quantified in human organs because of their low abundance and association with membrane. Based on our previous development of a multiple reaction monitoring (MRM) workflow for measurements of low abundance membrane proteins, we quantified CYP46A1 and CYP27A1 in human brain and retina samples from four donors. These enzymes were quantified in the total membrane pellet, a fraction of the whole tissue homogenate, using 15N-labled recombinant P450s as internal standards. The average P450 concentrations per mg of total tissue protein were 345 fmol of CYP46A1 and 110 fmol of CYP27A1 in the temporal lobe, and 60 fmol of CYP46A1 and 490 fmol of CYP27A1 in the retina. The corresponding P450 metabolites were then measured in the same tissue samples and compared to the P450 enzyme concentrations. Investigation of the enzyme-product relationships and analysis of the P450 measurements based on different signature peptides revealed a possibility of retina-specific post-translational modification of CYP27A1. The data obtained provide important insights into the mechanisms of cholesterol elimination from different neural tissues.
Approximately 30% of naturally occurring proteins are predicted to be embedded in biological membranes. Nevertheless, this group of proteins is traditionally understudied due to limitations of the available analytical tools. To facilitate the analysis of membrane proteins, the analytical methods for their soluble counterparts must be optimized or modified. Multiple reaction monitoring (MRM) assays have proven successful for the absolute quantification of proteins and for profiling protein modifications in cell lysates and human plasma/serum, but have found little application in the analysis of membrane proteins. We report on the optimization of sample preparation conditions for the quantification of two membrane proteins, cytochrome P450 11A1 (CYP11A1) and adrenodoxin reductase (AdR). These conditions can be used for the analysis of other membrane proteins. We have demonstrated that membrane proteins that are tightly associated with the membrane, such as CYP11A1, can be quantified in the total tissue membrane pellet obtained after high-speed centrifugation, whereas proteins that are weakly associated with the membrane, such as AdR, must be quantified in the whole tissue homogenate. We have compared quantifications of CYP11A1 using two different detergents, RapiGest SP and sodium cholate, and two different trypsins, sequencing grade modified trypsin and trypsin, type IX-S from porcine pancreas. The measured concentrations in these experiments were similar and encouraged the use of either combination of detergent/trypsin for quantification of other membrane proteins. Overall, the CYP11A1 and AdR quantified in this work ranged from hundred pmol to ten fmol per mg of tissue protein.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.