2016
DOI: 10.1007/978-3-319-49586-6_33
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable Document-Based Architecture for Text Analysis

Abstract: Abstract. Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. Thus, we prop… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
2
2
2

Relationship

3
3

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 15 publications
0
4
0
1
Order By: Relevance
“…Previous works have proven that the combination between topic model algorithm and weighting schema is really dependent on the document dataset [52], [53]. The quality of topic modeling can be influenced by the length of documents in the corpus, the frequency of rare words, etc.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous works have proven that the combination between topic model algorithm and weighting schema is really dependent on the document dataset [52], [53]. The quality of topic modeling can be influenced by the length of documents in the corpus, the frequency of rare words, etc.…”
Section: Resultsmentioning
confidence: 99%
“…The textual datasets are processed and stored in a MongoDB database. The architecture used for processing and storing documents is presented in detail in [53].…”
Section: Methodsmentioning
confidence: 99%
“…We designed our benchmarks' queries by logging computational linguists working on a text analysis platform [12,13] on real-world data. After analyzing and clustering similar user queries, we ended up with 8 queries (4 for T 2 K 2 and 4 for T 2 K 2 D 2 ) that we consider generic enough to benchmark any similar system.…”
Section: Introductionmentioning
confidence: 99%
“…Hence, we propose arXiv:1709.04747v1 [cs.DB] 14 Sep 2017 in this paper the Twitter Top-K Keywords Benchmark (T 2 K 2 ), which features a real tweet dataset and queries with various complexities and selectivities. We designed T 2 K 2 to be somewhat generic, i.e., it can compare various weighting schemes, database logical and physical implementations and even text analytics platforms [18] in terms of computing efficiency. As a proof of concept of T 2 K 2 's relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand.…”
Section: Introductionmentioning
confidence: 99%