Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We show that KeyGraph has similar accuracy when compared to state-of-the-art approaches on small, well-annotated collections, and it can successfully filter irrelevant documents and identify events in large and noisy social media collections. An extensive evaluation using Amazon’s Mechanical Turk demonstrated the increased accuracy and high precision of KeyGraph, as well as superior runtime performance compared to other solutions.
Abstract. In this paper, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time.In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR). A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.
Abstract| Accessing many data sources aggravates problems for users of heterogeneous distributed databases. Database administrators must deal with fragile mediators, that is, mediators with schemas and views that must be signi cantly changed to incorporate a new data source. When implementing translators of queries from mediators to data sources, database implementors must deal with data sources that do not support all the functionality required by mediators. Application programmers must deal with graceless failures for unavailable data sources. Queries simply return failure and no further information when data sources are unavailable for query processing. The Distributed Information Search COmponent ( Disco) addresses these problems. Data modeling techniques manage the connections to data sources, and sources can be added transparently to the users and applications. The interface between mediators and data sources exibly handles di erent query languages and different data source functionality. Query rewriting and optimization techniques rewrite queries so they are e ciently evaluated by sources. Query processing and evaluation semantics are developed to process queries over unavailable data sources. In this article we describe (a) the distributed mediator architecture of Disco (b) the data model and its modeling of data source connections (c) the interface to underlying data sources and the query rewriting process and (d) query processing semantics. We describe several advantages of our system.
Drug-target interaction studies are important because they can predict drugs' unexpected therapeutic or adverse side effects. In silico predictions of potential interactions are valuable and can focus effort on in vitro experiments. We propose a prediction framework that represents the problem using a bipartite graph of drug-target interactions augmented with drug-drug and target-target similarity measures and makes predictions using probabilistic soft logic (PSL). Using probabilistic rules in PSL, we predict interactions with models based on triad and tetrad structures. We apply (blocking) techniques that make link prediction in PSL more efficient for drug-target interaction prediction. We then perform extensive experimental studies to highlight different aspects of the model and the domain, first comparing the models with different structures and then measuring the effect of the proposed blocking on the prediction performance and efficiency. We demonstrate the importance of rule weight learning in the proposed PSL model and then show that PSL can effectively make use of a variety of similarity measures. We perform an experiment to validate the importance of collective inference and using multiple similarity measures for accurate predictions in contrast to non-collective and single similarity assumptions. Finally, we illustrate that our PSL model achieves state-of-the-art performance with simple, interpretable rules and evaluate our novel predictions using online data sets.
Large scale disasters bring together a diversity of organizations and produce massive amounts of heterogeneous data that must be managed by these organizations. The lack of effective ICT solutions can lead to a lack of coordination and chaos among these organizations, as they track victims' needs and respond to the disaster. The result can be delayed or ineffective response, the potential wastage of pledged support, imbalances in aid distribution, and a lack of transparency. ICT solutions to manage disasters can potentially improve efficiency and effectiveness. Sahana is a Free and Open Source Software (FOSS) application that aims to be a comprehensive solution for information management in relief operations, recovery and rehabilitation. This paper addresses the alignment between FOSS development and humanitarian applications. it then describes the anatomy of the Sahana system. We follow up with a case study of Sahana deployment and lessons learned. I. INTRODUCTION Recent disasters such as the 2003 SARS outbreak, the 2004 Asian tsunami, the 2005 Kashmir/Pakistan earthquake and 2005 hurricanes Katrina and Rita clearly identified the shortcomings of ICT solutions for disaster rescue and recovery.Large-scale disasters are typically accompanied by the need to effectively manage massive amounts of data. This includes data about victims and about relief personnel; data about damages to buildings, infrastructure and belongings; weather data; geographical data about roads and other landmarks; logistics data; communication and message data; financial data needed to manage the collection and distribution of donations; data in blogs; etc. Major disasters also involve multiple autonomous organizations (governmental, NGOs, INGOs, individuals, communities, and industry). This leads to a diversity of client needs that must be coordinated.Despite the tremendous value of disaster management systems, there are only very few systems that exist today and none are widely deployed. The most widely used system appears to be non-Web based and uses proprietary non standard database technology. While there are various specialized components that exist, there does not exist a single cohesive system that organizations such as the United Nations Disaster Assistance and Coordination (UNDAC) can routinely deploy.There are disaster information systems that focus on specialized application or data requirements including imagery and GIS data [1] [2] [3], early warning models using sensor data [4], mobile ad hoc networks and messaging, etc [5].
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.