We describe a solution for fast indexing and searching\ud within large heterogeneous data sets whose main purpose\ud is to support investigators that need to analyze forensic disk\ud images originated by seizures or created from bodies of evidence.\ud Our approach is based on a combination of techniques aimed at\ud improving efficiency and reliability of the indexing process.We do\ud not rely on existing frameworks like Hadoop but borrow concepts\ud from different contexts including High Performance Computing\ud and Database management
Biopharmaceutical R&D organizations characterize drug candidate target effects and modes of action and create molecular models of target diseases. These data-intensive activities are informed by vast data resources including publicly available data, internally generated data and partnered private data collections. However, rapid evolution in computing, data management tools, analytical and visualization methods, the complexity of data types and the data volumes that must be accommodated present significant technical and logistic hurdles to overcome. It is particularly difficult for a geographically dispersed R&D organization to make data resources easily available to scientists for search, visualization and exploration. Nevertheless, this is required for R&D scientists to gain insight into disease and drug mechanisms and to capture the knowledge needed to sustain the scientific enterprise. Standardized commercial solutions to R&D data challenges are unattractive since they require significant resource investment in platform configuration, user-training and system maintenance. This strategy necessarily creates delay in adopting newly emerging technologies and provides incentive not to adopt alternatives due to investment in existing systems. In contrast, our solution to R&D data demands was to build a cloud-deployed data platform using state of the art tools developed and maintained by the open source software community at the Apache Software Foundation. Partnering with academic data scientists, we selected the best available tools to fit our specific needs. We integrated them into a platform accessible to our federated R&D scientific community while allowing the system to be freely modified and updated on demand to meet evolving user requirements. Priorities for our data platform are to ingest, secure and index R&D source data of all types, make these indexed data assets available to computational scientists for analysis and provide faceted search capability based on a comprehensive metadata model. Three products: LabKey server, Apache OODT and ISATools have all been combined into a scientific data management system to provide a unified data resource enhanced by a search platform powered by Apache Solr. The platform supports both internally generated data and data imported from public, contracted or partnered sources. All data are available for interactive exploration by our R&D community, accessed via integrated search, analysis and visualization tools. Deployment of this system to our R&D organization has been met with enthusiastic adoption. Feedback for improvement or requests for system enhancements and additional capabilities are rapidly addressed in this open source environment, leading to further adoption among the R&D scientists and providing the basis for accessible, stable institutional knowledge collections. Citation Format: Lauren Intagliata, Selina Chu, Garth McGrath, Giuseppe Totaro, Daniel Civello, Nipurn Doshi, Shivika Thapar, Michael Livstone, Chris Mattmann, Paul Ramirez, Maureen Cronin. A cloud-enabled open source data management platform supporting a federated research and development organization. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 5282.
Biomedical information is available to research and development scientists as unstructured text in the form of scientific manuscripts and reports published in the literature and elsewhere. Scientists focused on specific research programs are burdened with surveying vast numbers of publications and reports to acquire information relevant to their efforts. Employing technology as a research aid provides a mechanism to cope with information overload that characterizes the R&D environment. Text mining can extract knowledge from large corpora of biomedical text and make it available to support scientific research and knowledge collections [1, 2] and intelligent PDF reader tools able to search content and find related articles [3] are available; however, such reader tools are typically desktop applications limited to specific platforms and data sources so they cannot easily support broad based integrated scientific search needs for a dispersed R&D organization with a wide variety of content needs. Our team has developed a web-browser based document reader with a built-in exploration tool and automatic concept extraction from biomedical text content. This provides R&D scientists with a simple tool to aid finding, reading, and exploring documents relevant to focused research objectives. The tool, Shangri-Docs, combines a document reader with automatic concept extraction and highlighting of relevant terms based on carefully selected ontologies combined with our custom corporate enterprise taxonomy. Shangri-Docs provides the ability to evaluate a wide variety of document formats (e.g. PDF, Word, PPT, text, etc.) and exploits the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and privately cataloged databases simultaneously. Shangri-Docs incorporates Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific pathology, disease, drug, and biological terms mentioned in the text. cTAKES was originally designed specifically to extract information from clinical medical records. We have extended cTAKES automatic knowledge extraction process to include the R&D biomedical research domain by improving the ontology guided information extraction process. Shangri-Docs could be adapted to other science fields and further customized across our R&D scientific community via our open source, cloud-based, data management system. [1] Funk, et.al., BMC Bioinformatics 2014, 15:59 doi:10.1186/1471-2105-15-59 [2] Kang et al., BMC Bioinformatics 2014, 15:64 doi:10.1186/1471-2105-15-64 [3] Utopia Documents, http://utopiadocs.com [4] Apache cTAKES, http://ctakes.apache.org Citation Format: Chris Mattmann, Lauren Intagliata, Selina Chu, Garth McGrath, Giuseppe Totaro, Daniel Civello, David Ballard, Jeffrey Long, Nipurn Doshi, Shivika Thapar, Michael Livstone, Paul Ramirez, Maureen Cronin. Shangri-Docs: a browser based tool for document exploration and automatic knowledge extraction from unstructured biomedical text. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 5283.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.