GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records

Tahsin, Tasnia; Weissenbacher, Davy; O’Connor, Karen; Magge, Arjun; Scotch, Matthew; Gonzalez-Hernandez, Graciela

doi:10.1093/bioinformatics/btx799

Cited by 11 publications

(13 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While georeferenced datasets describing environmental and climactic phenomena are more readily available, emerging genetic data sets present some challenges, as location of isolation are generally extracted manually from public records or publications. To address visualization of genetic sequence data, there have been efforts to extract geospatial metadata such as location and host from GenBank records, to ease automation of linking relevant sequence data for spatial modeling of disease [ 131 ]. Efforts to model outbreaks within a decision support environment which integrate data collected on different spatial scales need to address automated data extraction and transformation such as aggregation of case reports, host population densities, and locations from which isolates were sequenced for a region under study such that visualization of a multifaceted scenario is possible.…”

Section: Discussionmentioning

confidence: 99%

A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks

2018

Self Cite

View full text Add to dashboard Cite

BackgroundZoonotic diseases account for a substantial portion of infectious disease outbreaks and burden on public health programs to maintain surveillance and preventative measures. Taking advantage of new modeling approaches and data sources have become necessary in an interconnected global community. To facilitate data collection, analysis, and decision-making, the number of spatial decision support systems reported in the last 10 years has increased. This systematic review aims to describe characteristics of spatial decision support systems developed to assist public health officials in the management of zoonotic disease outbreaks.MethodsA systematic search of the Google Scholar database was undertaken for published articles written between 2008 and 2018, with no language restriction. A manual search of titles and abstracts using Boolean logic and keyword search terms was undertaken using predefined inclusion and exclusion criteria. Data extraction included items such as spatial database management, visualizations, and report generation.ResultsFor this review we screened 34 full text articles. Design and reporting quality were assessed, resulting in a final set of 12 articles which were evaluated on proposed interventions and identifying characteristics were described. Multisource data integration, and user centered design were inconsistently applied, though indicated diverse utilization of modeling techniques.ConclusionsThe characteristics, data sources, development and modeling techniques implemented in the design of recent SDSS that target zoonotic disease outbreak were described. There are still many challenges to address during the design process to effectively utilize the value of emerging data sources and modeling methods. In the future, development should adhere to comparable standards for functionality and system development such as user input for system requirements, and flexible interfaces to visualize data that exist on different scales.PROSPERO registration number: CRD42018110466.Electronic supplementary materialThe online version of this article (10.1186/s12942-018-0157-5) contains supplementary material, which is available to authorized users.

show abstract

Section: Discussionmentioning

confidence: 99%

A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks

2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…This paucity of high resolution geographic metadata has inspired researchers to develop new methods and tools to ascertain the LOIH for viral sequences represented in GenBank records ( Tahsin et al, 2014 ; Tahsin et al, 2017 ; Magge et al, 2018 ). Indeed, available pipelines for discerning the LOIH are configured such that they output not only the most probable location for a specific sequence, but also a vector of other possible locations along with their relative probabilities ( Magge et al, 2018 ).…”

Section: Introductionmentioning

confidence: 99%

Going back to the roots: Evaluating Bayesian phylogeographic models with discrete trait uncertainty

Vaiente

Scotch

2020

Infection, Genetics and Evolution

Self Cite

View full text Add to dashboard Cite

Phylogeography is a popular way to analyze virus sequences annotated with discrete, epidemiologically-relevant, trait data. For applied public health surveillance, a key quantity of interest is often the state at the root of the inferred phylogeny. In epidemiological terms, this represents the geographic origin of the observed outbreak. Since determining the origin of an outbreak is often critical for public health intervention, it is prudent to understand how well phylogeographic models perform this root state classification task under various analytical scenarios. Specifically, we investigate how discrete state space and sequence data set influence the root state classification accuracy. We performed phylogeographic inference on several simulated DNA data sets while i) increasing the number of sequences and ii) increasing the total number of possible discrete trait values. We show that phylogeographic models tend to perform best at intermediate sequence data set sizes. Further, we demonstrate that a popular metric used for evaluation of phylogeographic models, the Kullback-Leibler (KL) divergence, both increases with discrete state space and data set sizes. Further, by modeling phylogeographic root state classification accuracy using logistic regression, we show that KL is not supported as a predictor of model accuracy, indicating its limited utility for assessing phylogeographic model performance on empirical data. These results suggest that relying solely on the KL metric may lead to artificially inflated support for models with finer discretization schemes and larger data set sizes. These results will be important for public health practitioners seeking to use phylogeographic models for applied infectious disease surveillance.

show abstract

“…In our prior work, we developed GeoBoost and other automated language processing methods to address the lack of geospatial certainty in sequence databases. GeoBoost improves the granularity of the location of the infected host (LOIH) for GenBank records (Tahsin et al. 2018).…”

Section: Introductionmentioning

confidence: 99%

“…From these, GeoBoost extracts all geospatial mentions and assigns a probability of the LOIH given the GenBank record, P||Linormal Ri) where Li represents the unknown location and Ri indicates the linked record information for taxon i . The probabilities are currently based on a set of predefined rules that assign higher probabilities to more specific and accurate locations found in papers that can be used jointly with information scanned from the GenBank record (Tahsin et al. 2018).…”

Section: Introductionmentioning

confidence: 99%

Incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography

et al. 2019

Self Cite

View full text Add to dashboard Cite

Discrete phylogeography using software such as BEAST considers the sampling location of each taxon as fixed; often to a single location without uncertainty. When studying viruses, this implies that there is no possibility that the location of the infected host for that taxa is somewhere else. Here, we relaxed this strong assumption and allowed for analytic integration of uncertainty for discrete virus phylogeography. We used automatic language processing methods to find and assign uncertainty to alternative potential locations. We considered two influenza case studies: H5N1 in Egypt; H1N1 pdm09 in North America. For each, we implemented scenarios in which 25 per cent of the taxa had different amounts of sampling uncertainty including 10, 30, and 50 per cent uncertainty and varied how it was distributed for each taxon. This includes scenarios that: (i) placed a specific amount of uncertainty on one location while uniformly distributing the remaining amount across all other candidate locations (correspondingly labeled 10 , 30 , and 50 ); (ii) assigned the remaining uncertainty to just one other location; thus ‘splitting’ the uncertainty among two locations (i.e. 10/90, 30/70 , and 50/50 ); and (iii) eliminated uncertainty via two predefined heuristic approaches: assignment to a centroid location (CNTR) or the largest population in the country (POP). We compared all scenarios to a reference standard (RS) in which all taxa had known (absolutely certain) locations. From this, we implemented five random selections of 25 per cent of the taxa and used these for specifying uncertainty. We performed posterior analyses for each scenario, including: (a) virus persistence, (b) migration rates, (c) trunk rewards, and (d) the posterior probability of the root state. The scenarios with sampling uncertainty were closer to the RS than CNTR and POP. For H5N1, the absolute error of virus persistence had a median range of 0.005–0.047 for scenarios with sampling uncertainty—(i) and (ii) above—versus a range of 0.063–0.075 for CNTR and POP. Persistence for the pdm09 case study followed a similar trend as did our analyses of migration rates across scenarios (i) and (ii). When considering the posterior probability of the root state, we found all but one of the H5N1 scenarios with sampling uncertainty had agreement with the RS on the origin of the outbreak whereas both CNTR and POP disagreed. Our results suggest that assigning geospatial uncertainty to taxa benefits estimation of virus phylogeography as compared to ad-hoc heuristics. We also found that, in general, there was limited difference in results regardless of how the sampling uncertainty was assigned; uniform distribution or split between two locations did not greatly impact posterior results. This framework is available in BEAST v.1.10. In future work, we will explore viruses beyond influenza. We will also develop a w...

show abstract

GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records

Abstract: Supplementary data are available at Bioinformatics online.

Cited by 11 publications

References 8 publications

A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks

A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks

Going back to the roots: Evaluating Bayesian phylogeographic models with discrete trait uncertainty

Incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography

Contact Info

Product

Resources

About