Proceedings of the 2017 ACM Symposium on Document Engineering 2017
DOI: 10.1145/3103010.3121040
|View full text |Cite
|
Sign up to set email alerts
|

Distributing Text Mining tasks with librAIry

Abstract: We present librAIry, a novel architecture to store, process and analyze large collections of textual resources, integrating existing algorithms and tools into a common, distributed, high-performance work ow. Available text mining techniques can be incorporated as independent plug&play modules working in a collaborative manner into the framework. In the absence of a pre-de ned ow, librAIry leverages on the aggregation of operations executed by di erent components in response to an emergent chain of events. Exte… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
3
1
1

Relationship

4
4

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 6 publications
0
4
0
Order By: Relevance
“…In [14], an investigation of the influence of text length on IR tasks is shown using a Probabilistic Topic Model (PTM). The goal is to reduce the dimensions of the vectors in IR models to simplify the operations without affecting the system performance.…”
Section: Related Workmentioning
confidence: 99%
“…In [14], an investigation of the influence of text length on IR tasks is shown using a Probabilistic Topic Model (PTM). The goal is to reduce the dimensions of the vectors in IR models to simplify the operations without affecting the system performance.…”
Section: Related Workmentioning
confidence: 99%
“…Then, we set the number of topics 𝐾 = 500 (several configurations were evaluated, but this was the closest to the performance obtained with the supervised model based on categories). We run the Gibbs samplers for 1000 training iterations on LDA from the open-source librAIry [1] software. The Dirichlet priors 𝛼 = 0.1 and 𝛽 = 0.01 were set following [20].…”
Section: Cross-lingual Modelsmentioning
confidence: 99%
“…For each dataset, documents are mapped to two latent topic spaces with different dimensions using LDA. We perform parameter estimation using collapsed Gibbs sampling for LDA [16] from the open-source librAIry [4] software. It is a framework that combines natural language processing (NLP) techniques with machine learning algorithms on top of the Mallet toolkit [32], an open-source machine learning package.…”
Section: Datasets and Evaluation Metricsmentioning
confidence: 99%