Luis Gravano scite author profile

²

,

Papakonstantinou

³

2003

Applications in which plain text coexists with structured data are pervasive. Commercial relational database management systems (RDBMSs) generally provide querying capabilities for text attributes that incorporate state-of-the-art information retrieval (IR) relevance ranking strategies, but this search functionality requires that queries specify the exact column or columns against which a given list of keywords is to be matched. This requirement can be cumbersome and inflexible from a user perspective: good answers to a keyword query might need to be "assembled" -in perhaps unforeseen ways-by joining tuples from multiple relations. This observation has motivated recent research on free-form keyword search over RDBMSs. In this paper, we adapt IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs. Our query model can handle queries with both AND and OR semantics, and exploits the sophisticated single-column text-search functionality often available in commercial RDBMSs. We develop query-processing strategies that build on a crucial characteristic of IR-style keyword search: only the few most relevant matches -according to some definition of "relevance"-are generally of interest. Consequently, rather than computing all matches for a keyword query, which leads to inefficient executions, our techniques focus on the top-k matches for the query, for moderate values of k. A thorough experimental evaluation over real data shows the performance advantages of our approach. *

Learning similarity metrics for event identification in social media

Becker

¹

,

Нааман

²

,

³

2010

Social media sites (e.g., Flickr, YouTube, and Facebook) are a popular distribution outlet for users looking to share their experiences and interests on the Web. These sites host substantial amounts of user-contributed materials (e.g., photographs, videos, and textual content) for a wide variety of real-world events of different type and scale. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable event browsing and search in state-of-the-art search engines. To address this problem, we exploit the rich "context" associated with social media content, including user-provided annotations (e.g., title, tags) and automatically generated information (e.g., content creation time). Using this rich context, which includes both textual and non-textual features, we can define appropriate document similarity metrics to enable online clustering of media to events. As a key contribution of this paper, we explore a variety of techniques for learning multi-feature similarity metrics for social media documents in a principled manner. We evaluate our techniques on large-scale, realworld datasets of event images from Flickr. Our evaluation results suggest that our approach identifies events, and their associated social media documents, more effectively than the state-of-the-art strategies on which we build.

Evaluating top-k queries over Web-accessible databases

Bruno

¹

,

²

,

³

A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. Processing top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this article, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present a sequential algorithm for processing such queries, but observe that any sequential top-k query processing strategy is bound to require unnecessarily long query processing times, since web accesses exhibit high and variable latency. Fortunately, web sources can be probed in parallel, and each source can typically process concurrent requests, although sources may impose some restrictions on the type and number of probes that they are willing to accept. We adapt our sequential query processing technique and introduce an efficient algorithm that maximizes sourceaccess parallelism to minimize query response time, while satisfying source-access constraints. com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. We evaluate our techniques experimentally using both synthetic and real web-accessible data and show that parallel algorithms can be significantly more efficient than their sequential counterparts.

Hip and trendy: Characterizing emerging trends on Twitter

Нааман

¹

,

Becker

²

,

³

2011

J. Am. Soc. Inf. Sci.

Twitter, Facebook, and other related systems that we call social awareness streams are rapidly changing the information and communication dynamics of our society. These systems, where hundreds of millions of users share short messages in real time, expose the aggregate interests and attention of global and local communities. In particular, emerging temporal trends in these systems, especially those related to a single geographic area, are a significant and revealing source of information for, and about, a local community. This study makes two essential contributions for interpreting emerging temporal trends in these information systems. First, based on a large dataset of Twitter messages from one geographic area, we develop a taxonomy of the trends present in the data. Second, we identify important dimensions according to which trends can be categorized, as well as the key distinguishing features of trends that can be derived from their associated messages. We quantitatively examine the computed features for different categories of trends, and establish that significant differences can be detected across categories. Our study advances the understanding of trends on Twitter and other social awareness streams, which will enable powerful applications and activities, including user-driven real-time information services for local communities. IntroductionIn recent years, a class of communication and information platforms we call social awareness streams (SAS) has been shifting the manner in which we consume and produce information. Available from social media services such as Facebook, Twitter, FriendFeed, and others, these hugely popular networks allow participants to post streams of lightweight content artifacts, from short status messages to links, pictures, and videos. These SAS platforms have already Received July 30, 2010; revised December 20, 2010; accepted December 21, 2010 © 2011 ASIS&T • Published online 7 March 2011 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.21489 shown considerable impact on the information, communication, and media infrastructure of our society (Johnson, 2009), as evidenced during major global events such as the Iran election or the reaction to the earthquake in Haiti (Kwak, Lee, Park, & Moon, 2010), as well as in response to local events and emergencies (Shklovski, Palen, & Sutton, 2008;Starbird, Palen, Hughes, & Vieweg, 2010).SAS allow for rapid, immediate sharing of information aimed at known contacts or the general public. The content of the often-public shared items ranges from personal status updates to opinions and information sharing (Naaman, Boase, & Lai, 2010). In aggregate, however, the postings by hundreds of millions of users of Facebook, Twitter, and other systems expose global interests, happenings, and attitudes in almost real time (Kwak et al., 2010).These interests and happenings as reflected in SAS data change rapidly. The strong temporal nature of SAS information allows for the detection of significant events and other temporal trends ...

Evaluating top- k queries over web-accessible databases

Marian

¹

,

Bruno

²

,

ACM Trans. Database Syst.

³

2004

A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. Processing top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this article, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present a sequential algorithm for processing such queries, but observe that any sequential top-k query processing strategy is bound to require unnecessarily long query processing times, since web accesses exhibit high and variable latency. Fortunately, web sources can be probed in parallel, and each source can typically process concurrent requests, although sources may impose some restrictions on the type and number of probes that they are willing to accept. We adapt our sequential query processing technique and introduce an efficient algorithm that maximizes sourceaccess parallelism to minimize query response time, while satisfying source-access constraints. com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. We evaluate our techniques experimentally using both synthetic and real web-accessible data and show that parallel algorithms can be significantly more efficient than their sequential counterparts.