Fast candidate generation for real-time tweet search with bloom filter chains

Asadi, Nima; Lin, Jimmy

doi:10.1145/2493175.2493178

Cited by 24 publications

(16 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3 We found that the idf in both conditions were nearly identical across all query terms-this is not surprising considering that idf is on a log scale, and it takes substantial variations in document frequencies to have a noticeable affect on the value. However, Figure 1 shows that there is a large difference in effectiveness for a few topics: MB15, MB17, and MB35.…”

Section: Resultssupporting

confidence: 50%

See 1 more Smart Citation

The Impact of Future Term Statistics in Real-Time Tweet Search

Wang

Lin

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. In the real-time tweet search task operationalized in the TREC Microblog evaluations, a topic consists of a query Q and a time t, modeling the task where the user wishes to see the most recent but relevant tweets that address the information need. To simulate the real-time aspect of the task in an evaluation setting, many systems search over the entire collection and then discard results that occur after the query time. This approach, while computationally efficient, "cheats" in that it takes advantage of term statistics from documents not available at query time (i.e., future information). We show, however, that such results are nearly identical to a "gold standard" method that builds a separate index for each topic containing only those documents that occur before the query time. The implications of this finding on evaluation, system design, and user task models are discussed.

show abstract

Section: Resultssupporting

confidence: 50%

“…In this architecture, our experiments consider the candidate generation stage. Additional work has shown that end-to-end retrieval effectiveness is insensitive to the candidate generation algorithm [6,3], which means that our experiments using simple query-likelihood accurately reflect real-world conditions.…”

Section: Discussionmentioning

confidence: 98%

The Impact of Future Term Statistics in Real-Time Tweet Search

Wang

Lin

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…It requires methods and tools which can effectively extract data via APIs and then analysing these data to extract information of interest [3,10]. Social media research has introduced various methods for the effective collection of disaster related posts such as Bloom Filter Chains for real-time tweet search [9], TAKMI technology for content analysis [8] and twitter APIs -'crawl' and 'timeline' [11]. Krishnamurthy et al (2008) used these two methods, both relying on API functions provided by Twitter for the collection of large amount of data through crawl and timeline functions of Twitter.…”

Section: A Data Collection Methodsmentioning

confidence: 99%

Characterization of the Use of Social Media in Natural Disasters: A Systematic Review

Abedin

Babar

Abbasi

2014

2014 IEEE Fourth International Conference on Big Data and Cloud Computing

View full text Add to dashboard Cite

Social media sites are playing a significant role in rapid propagation of information when disasters occur. This effective communication platform is a great useful tool for emergency (disaster) management agencies during all phases of disaster management life cycle: prevention (mitigation), preparedness, response, and recovery. This study has conducted a systematic review of social media use in disaster management literature to identify how social media sites have been used during these four critical phases of disaster management life cycle in order to recommend strategies for government officials. A systematic method has been used to search four major academic databases for this review. The search resulted in 40 articles and categorized the findings in six main themes: situational awareness, data collection methods, distributed sensor systems, news and rumors, sentiment analysis, and digital volunteerism.

show abstract

“…This limits the applicability of probabilistic data structures to domain-specific use only, such as Genome Sequencing. Another application of probabilistic data structures is big data queries, for instance, BWand [32] for fast query on Twitter tweets, content filtering in MapReduce programs [21], and NoSQL databases such as Google BigTable, Apache HBase and Apache Cassandra. In these cases, the probabilistic data structures are used as an indexing technique to quickly locate information in a distributed storage system.…”

Section: Data Compressionmentioning

confidence: 99%

Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop

Dong

Herbert

2018

IEEE Trans. Big Data

View full text Add to dashboard Cite

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Abstract-A substantial amount of information in companies and on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. Compression as an effective means to reduce data size has been employed by many emerging data analytic platforms, whom the main purpose of data compression is to save storage space and reduce data transmission cost over the network. Since general purpose compression methods endeavour to achieve higher compression ratios by leveraging data transformation techniques and contextual data, this context-dependency forces the access to the compressed data to be sequential. Processing such compressed data in parallel, such as desirable in a distributed environment, is extremely challenging. This work proposes techniques for more efficient textual big data analysis with an emphasis on content-aware compression schemes suitable for the Hadoop analytic platform. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of public and private real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

show abstract

Fast candidate generation for real-time tweet search with bloom filter chains

Cited by 24 publications

References 63 publications

The Impact of Future Term Statistics in Real-Time Tweet Search

The Impact of Future Term Statistics in Real-Time Tweet Search

Characterization of the Use of Social Media in Natural Disasters: A Systematic Review

Content-Aware Partial Compression for Textual Big Data Analysis in Hadoop

Contact Info

Product

Resources

About