A Scalable Document-Based Architecture for Text Analysis

Truică, Ciprian-Octavian; Darmont, Jérôme; Velcin, Julien

doi:10.1007/978-3-319-49586-6_33

Cited by 7 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Previous works have proven that the combination between topic model algorithm and weighting schema is really dependent on the document dataset [52], [53]. The quality of topic modeling can be influenced by the length of documents in the corpus, the frequency of rare words, etc.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term Recognition

Truică

Apostol

2021

IEEE Access

View full text Add to dashboard Cite

Topic modeling is a probabilistic graphical model for discovering latent topics in text corpora by using multinomial distributions of topics over words. Topic labeling is used to assign meaningful labels for the discovered topics. In this paper, we present a new topic labeling method that uses automatic term recognition to discover and assign relevant labels for each topic, i.e., TLATR (Topic Labeling using Automatic Term Recognition). TLATR uses domain-specific multi-terms that appear in the set of documents belonging to a topic. The multi-term having the highest score as determined by the automatic term recognition algorithm is chosen as the label for that topic. To evaluate TLATR, we use two real, publicly available datasets that contain scientific articles' abstracts. The topic label evaluation is done both automatically and using human annotators. For the automatic evaluation, we use Pointwise Mutual Information, Normalized Pointwise Mutual Information, and document similarity. For human evaluation, we employ the average rating method. Furthermore, we also evaluate the quality of the topic models using the Adjusted Rand Index. To prove that our novel method extracts relevant topic labels, we compare TLATR with two state-of-the-art methods, one supervised and one unsupervised, provided by the NETL Automatic Topic Labelling system. The experimental results show that our method outperforms or provides similar results with both NETL's supervised and unsupervised approaches.

show abstract

Section: Resultsmentioning

confidence: 99%

“…The textual datasets are processed and stored in a MongoDB database. The architecture used for processing and storing documents is presented in detail in [53].…”

Section: Methodsmentioning

confidence: 99%

TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term Recognition

Truică

Apostol

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…We designed our benchmarks' queries by logging computational linguists working on a text analysis platform [12,13] on real-world data. After analyzing and clustering similar user queries, we ended up with 8 queries (4 for T 2 K 2 and 4 for T 2 K 2 D 2 ) that we consider generic enough to benchmark any similar system.…”

Section: Introductionmentioning

confidence: 99%

Benchmarking Top-K Keyword and Top-K Document Processing with T${}^2$K${}^2$ and T${}^2$K${}^2$D${}^2$

Truica,

Darmont,

Boicea

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T 2 K 2 , a top-k keywords and documents benchmark, and its decision support-oriented evolution T 2 K 2 D 2 .Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our benchmarks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.

show abstract

“…Hence, we propose arXiv:1709.04747v1 [cs.DB] 14 Sep 2017 in this paper the Twitter Top-K Keywords Benchmark (T 2 K 2 ), which features a real tweet dataset and queries with various complexities and selectivities. We designed T 2 K 2 to be somewhat generic, i.e., it can compare various weighting schemes, database logical and physical implementations and even text analytics platforms [18] in terms of computing efficiency. As a proof of concept of T 2 K 2 's relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand.…”

Section: Introductionmentioning

confidence: 99%

T $$^2$$ K $$^2$$ : The Twitter Top-K Keywords Benchmark

Truică

Darmont

2017

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Topk keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T 2 K 2 , which features a real tweet dataset and queries with various complexities and selectivities. T 2 K 2 helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T 2 K 2 's relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand. f t,dmax t ∈d (f t ,d ) ) by the inverted document frequency IDF (t, D) =

show abstract

A Scalable Document-Based Architecture for Text Analysis

Cited by 7 publications

References 15 publications

TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term Recognition

TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term Recognition

Benchmarking Top-K Keyword and Top-K Document Processing with T${}^2$K${}^2$ and T${}^2$K${}^2$D${}^2$

T $$^2$$ K $$^2$$ : The Twitter Top-K Keywords Benchmark

Contact Info

Product

Resources

About