Efficient jaccard-based diversity analysis of large document collections

Fan, Daidu; Siersdorfer, Stefan; Zerr, Sergej

doi:10.1145/2396761.2398445

Cited by 12 publications

(7 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The kappa can range between −1 and + 1, and the kappa result is explained as follows: if the values are ≤0, there is no agreement; between 0.01 and 0.20, there is minor arrangement; between 0.21 and 0.40, there is known fair agreement; between 0.41 and 0.60, there is moderate agreement, between 0.61 and 0.80, there is substantial agreement; and from 0.81 to 1.00, there is nearly perfect agreement [ 29 ]. The Jaccard index determines how close the commonality of the two datasets can be a measured [ 30 ]. The Jaccard coefficient is given in the following equation:

…”

Section: Resultsmentioning

confidence: 99%

Classification of Alzheimer’s Disease and Mild Cognitive Impairment Based on Cortical and Subcortical Features from MRI T1 Brain Images Utilizing Four Different Types of Datasets

Toshkhujaev

Lee

Choi

et al. 2020

Journal of Healthcare Engineering

View full text Add to dashboard Cite

Alzheimer’s disease (AD) is one of the most common neurodegenerative illnesses (dementia) among the elderly. Recently, researchers have developed a new method for the instinctive analysis of AD based on machine learning and its subfield, deep learning. Recent state-of-the-art techniques consider multimodal diagnosis, which has been shown to achieve high accuracy compared to a unimodal prognosis. Furthermore, many studies have used structural magnetic resonance imaging (MRI) to measure brain volumes and the volume of subregions, as well as to search for diffuse changes in white/gray matter in the brain. In this study, T1-weighted structural MRI was used for the early classification of AD. MRI results in high-intensity visible features, making preprocessing and segmentation easy. To use this image modality, we acquired four types of datasets from each dataset’s server. In this work, we downloaded 326 subjects from the National Research Center for Dementia homepage, 123 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) homepage, 121 subjects from the Alzheimer’s Disease Repository Without Borders homepage, and 131 subjects from the National Alzheimer’s Coordinating Center homepage. In our experiment, we used the multiatlas label propagation with expectation–maximization-based refinement segmentation method. We segmented the images into 138 anatomical morphometry images (in which 40 features belonged to subcortical volumes and the remaining 98 features belonged to cortical thickness). The entire dataset was split into a 70 : 30 (training and testing) ratio before classifying the data. A principal component analysis was used for dimensionality reduction. Then, the support vector machine radial basis function classifier was used for classification between two groups—AD versus health control (HC) and early mild cognitive impairment (MCI) (EMCI) versus late MCI (LMCI). The proposed method performed very well for all four types of dataset. For instance, for the AD versus HC group, the classifier achieved an area under curve (AUC) of more than 89% for each dataset. For the EMCI versus LMCI group, the classifier achieved an AUC of more than 80% for every dataset. Moreover, we also calculated Cohen kappa and Jaccard index statistical values for all datasets to evaluate the classification reliability. Finally, we compared our results with those of recently published state-of-the-art methods.

show abstract

…”

Section: Resultsmentioning

confidence: 99%

Classification of Alzheimer’s Disease and Mild Cognitive Impairment Based on Cortical and Subcortical Features from MRI T1 Brain Images Utilizing Four Different Types of Datasets

Toshkhujaev

Lee

Choi

et al. 2020

Journal of Healthcare Engineering

View full text Add to dashboard Cite

show abstract

“…The above definition is used to compute the pairwise diversity of a set of paths P. This in general requires O(|P| 2 ) computations. To avoid pairwise computations, one can use min-wise hashing [32].…”

Section: Path Diversitymentioning

confidence: 99%

Building relatedness explanations from knowledge graphs

Pirrò

2019

View full text Add to dashboard Cite

Knowledge graphs (KGs) are a key ingredient to complement search results, discover entities and their relations and support several knowledge discovery tasks. We face the problem of building relatedness explanations, that is, graphs that can explain how a pair of entities is related in a KG. Explanations can be used in a variety of tasks; from exploratory search to query answering. We formalize the notion of explanation and present two algorithms. The first, E4D (Explanations from Data), assembles explanations starting from all paths interlinking the source and target entity in the data. The second algorithm E4S (Explanations from Schema) builds explanations focused on a specific relatedness perspective expressed by providing a predicate. E4S first generates candidate explanation patterns at the level of schema; then, it assembles explanations by proceeding to their verification in the data. Given a set of paths, found by E4D or E4S, we describe different criteria to build explanations based on information-theory, diversity and their combination. As a concrete use-case of relatedness explanations, we introduce relatedness-based KG querying, which revisits the query-by-example paradigm from the perspective of relatedness explanations. We implemented all machineries in the RECAP tool, which is based on RDF and SPARQL. We discuss an evaluation of the explanation building algorithms and a comparison of RECAP with related systems on real-world data.

show abstract

“…To express the similarity between graph nodes, a meaningful similarity measure is required. One such measure is the Jaccard similarity, which has been applied successfully in areas such as duplicate detection [6,19], link prediction [15], similarity evaluation in wikipedia [4], triangle counting in massive graphs [5] and diversity analysis in documents [9].…”

Section: Related Workmentioning

confidence: 99%

“…In general, similarity is expressed by a function V ×V → [0,1], where a value close to 0 means low similarity and a value close to 1 denotes a high similarity between a node pair. In this work, we express similarity by means of the Jaccard similarity coefficient, which enjoys a widespread use in diverse areas such as link prediction and recommendation [15], data cleaning [3], near duplicate detection [19], diversity analysis [9], whereas it is one of the most important measures for set similarity. We associate with each node u the set of its immediate neighbors N (u) (u inclusive).…”

Section: Introductionmentioning

confidence: 99%

Continuous Similarity Computation over Streaming Graphs

Valari

Papadopoulos

2013

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. Large network analysis is a very important topic in data mining. A significant body of work in the area studies the problem of node similarity. One way to express node similarity is to associate with each node the set of 1-hop neighbors and compute the Jaccard similarity between these sets. This information can be used subsequently for more complex operations like link prediction, clustering or dense subgraph discovery. In this work, we study algorithms to monitor the result of a similarity join between nodes continuously, assuming a sliding window accommodating graph edges. Since the arrival of a new edge or the expiration of an existing one may change the similarity between several node pairs, the challenge is to maintain the similarity join result as efficiently as possible. Our theoretical study is validated by a thorough experimental evaluation, based on real-world as well as synthetically generated graphs, demonstrating the superiority of the proposed technique in comparison to baseline approaches.

show abstract

Efficient jaccard-based diversity analysis of large document collections

Cited by 12 publications

References 29 publications

Classification of Alzheimer’s Disease and Mild Cognitive Impairment Based on Cortical and Subcortical Features from MRI T1 Brain Images Utilizing Four Different Types of Datasets

Classification of Alzheimer’s Disease and Mild Cognitive Impairment Based on Cortical and Subcortical Features from MRI T1 Brain Images Utilizing Four Different Types of Datasets

Building relatedness explanations from knowledge graphs

Continuous Similarity Computation over Streaming Graphs

Contact Info

Product

Resources

About