Background Blood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumor-derived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer. Methods Whole-genome sequencing was performed on cfDNA extracted from plasma samples ( N = 546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validations to assess generalization performance. Results In a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91–0.93) with a mean sensitivity of 85% (95% CI 83–86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance. Conclusions A machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies. Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway. Electronic supplementary material The online version of this article (10.1186/s12885-019-6003-8) contains supplementary material, which is available to authorized users.
In this paper we document our experiences with developing speech recognition for medical transcription -a system that automatically transcribes doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) grapheme based model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech. Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.3%. Our analysis shows that both models perform well on important medical utterances and therefore can be practical for transcribing medical conversations.
Background: Blood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumorderived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer.Methods: Whole-genome sequencing was performed on cfDNA extracted from plasma samples (N=546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validation to assess generalization performance. Results:In a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91-0.93) with a mean sensitivity of 85% (95% CI 83-86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance. Conclusions:A machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies.Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway.
66 Background: Despite population screening efforts, screening rates for colorectal cancer (CRC) remain suboptimal. A non-invasive, blood-based screening test with high sensitivity and specificity in early-stage disease should improve adherence and ultimately reduce mortality; however, tests based only on tumor-derived biomarkers have limited sensitivity. Here we used a multiomic, machine learning platform to discover, refine, and combine tumor- and immune-derived signals to develop a blood test for the detection of early-stage CRC. Methods: Samples from 591 participants enrolled in a prospective study including average-risk screening and case-control cohorts (NCT03688906) were included in this analysis (CRC: n = 43; colonoscopy-confirmed CRC-negative controls: n = 548). Participants with CRC were 60% male with a mean age of 63, and controls were 55% male with a mean age of 60. Stage distribution was 54% early (I/II) and 34% late (III/IV) with 11% unknown. Plasma was analyzed by whole-genome sequencing, bisulfite sequencing, and protein quantification methods. Computational methods were used to assess and infer the performance of individual and combined assays. Results: For colorectal adenocarcinoma, which represents ~95% of all CRCs, our multiomic test achieved a mean sensitivity of 92% in early stage (n = 17) and 84% in late stage (n = 11) at a specificity of 90%. Across all CRC pathological subtypes, our test achieved a mean sensitivity of 80% in early stage (n = 19) and 83% in late stage (n = 12) at a specificity of 90%; the test detected the single squamous cell carcinoma but missed both neuroendocrine tumors. Individual assays achieved a mean sensitivity of 50% in early stage and 66% in late stage at a specificity of 90%. Conclusions: In a prospective cohort, we demonstrated high sensitivity and specificity for early-stage adenocarcinoma by combining tumor- and immune-derived signals from cfDNA, epigenetic, and protein biomarkers. While most CRCs are adenocarcinomas, detection of all pathological subtypes is required to maximize sensitivity in a screening population. Further analysis of molecular and pathological subtypes, as well as the entire ~3000 patient cohort, is underway. Clinical trial information: NCT03688906.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.