Scaling Data Mining Algorithms to Large and Distributed Datasets

Totad, Shashikumar G.; Geeta, R. B.; Prasanna, Chennupati R; Santhosh, N Krishna; Reddy, Pvgd Prasad

doi:10.5121/ijdms.2010.2403

Cited by 9 publications

(5 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this case, different mining techniques are needed, and partitioning is one such technique. Most disk-based partitioning techniques [ 11 - 14 ] find frequent patterns from each partition and check to discover all frequent patterns. This approach, however, has some drawbacks, because frequent patterns may look infrequent due to local support pruning.…”

Section: Methodsmentioning

confidence: 99%

“…Another aspect to consider is the size of real DNA sequence databases, which is ever increasing. For the cases where a DNA sequence database can not fit into the main memory, disk-based mining has been studied, based on partitioning [ 11 - 14 ]. Most of these techniques, however, only consider local frequency counting, although many frequent patterns may look infrequent due to local support pruning.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

et al. 2012

View full text Add to dashboard Cite

Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

et al. 2012

View full text Add to dashboard Cite

show abstract

“…In addition to BI, meteorology, petroleum exploration, and bioinformatics are among the scientific fields where big data and data mining are gaining popularity. Software, hardware, and sophisticated algorithms are required to support this data sequence [21]. M.Jayasree et al have proposed the difficulty in locating rules of association between products in a large database of sales transactions.…”

Section: Related Workmentioning

confidence: 99%

Modern Business Data Analysis and Data Visualization: A Real-Time Fusion Study

Priya J,

Vijayadharsan,

Vasumathi

et al. 2023

ITM Web Conf.

View full text Add to dashboard Cite

In contemporary data science and analytics, data clustering is a small bucket that divides computation among various child nodes. The network’s capacity, specialized tools, and applications that cannot be trained quickly are among these methods’ drawbacks. In addition, the IoT-formed Big Data raw data can result in highly heterogeneous and unstructured data. This kind of data is difficult to analyze for real-time analytics. Real-time analytical challenges can be reduced by making computational values available locally rather than via distributed resources. Most of the time, it takes a long time and a lot of money to run these teams and skill sets. As an alternative, provide tools that let end users, professionals in the industry, and data scientists directly create and deploy complex data analytics application solutions with less technical knowledge. It highlights key advantages, disadvantages, and potential future directions by contrasting various current research and practice approaches to assisting end users with data analytics.

show abstract

“…The hierarchy tree documents are ranked using CSI ranking, and It creates another level depending upon the document fitting to the same shard as the previous document. Further, a shard rank is determined using the Lex-S approach [22,23].…”

Section: Connected Shire (Conn-s)mentioning

confidence: 99%

HSSA: A Novel Hybrid Shard Selection Algorithm for performance enhancement of distributed processing system

praveen

Totad

2022

Preprint

View full text Add to dashboard Cite

Distributed processing systems are widely used for query search operations , Where the large-size data is partitioned into different numbers of nodes for parallel processing and replication operations. In an exhaustive search approach, for a given query, all the data nodes or shards are searched to find relevant documents matching the user query. Using the sharding technique, we search selected shards to retrieve relevant data for the given query. The conventional shard selection algorithm has significant challenges: Shard ranking Shard cutoff estimation, high latency, less throughput, and high cost in processing extensive size data. Among them are CORI, ReDDe, RankS , and SHiRE are the most popular ones. The limitations of these algorithms are that the performance tends to decrease with the increasing data size, affecting search efficiency and effectiveness. To overcome these challenges, we propose a novel hybrid shard selection algorithm (HSS) to enhance search effectiveness and efficiency. The proposed HSS algorithm is designed and tested with medium and large-size datasets (Gov2, clueweb 9) considering precision, recall, and MAP performance metrics. Considering average throughput, the HSS algorithm performs 21%, 16%, and 12% better compared to CORI, Ranks , and SHiRE algorithms. Similarly, in terms of average latency, the HSS algorithm performs 14.2%, 9.4%, and 8.2% better compared to CORI, RankS , and SHiRE algorithms.

show abstract

Scaling Data Mining Algorithms to Large and Distributed Datasets

Abstract: Abstract. In the contemporary world of global economy real-life data is

Cited by 9 publications

References 26 publications

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

Modern Business Data Analysis and Data Visualization: A Real-Time Fusion Study

HSSA: A Novel Hybrid Shard Selection Algorithm for performance enhancement of distributed processing system

Contact Info

Product

Resources

About