In recent years Dynamic Time Warping (DTW) has emerged as the distance measure of choice for virtually all time series data mining applications. For example, virtually all applications that process data from wearable devices use DTW as a core sub-routine. This is the result of significant progress in improving DTW’s efficiency, together with multiple empirical studies showing that DTW-based classifiers at least equal (and generally surpass) the accuracy of all their rivals across dozens of datasets. Thus far, most of the research has considered only the one-dimensional case, with practitioners generalizing to the multi-dimensional case in one of two ways, dependent or independent warping. In general, it appears the community believes either that the two ways are equivalent, or that the choice is irrelevant. In this work, we show that this is not the case. The two most commonly used multi-dimensional DTW methods can produce different classifications, and neither one dominates over the other. This seems to suggest that one should learn the best method for a particular application. However, we will show that this is not necessary; a simple, principled rule can be used on a case-by-case basis to predict which of the two methods we should trust at the time of classification. Our method allows us to ensure that classification results are at least as accurate as the better of the two rival methods, and, in many cases, our method is significantly more accurate. We demonstrate our ideas with the most extensive set of multi-dimensional time series classification experiments ever attempted.
The ability to make predictions about future events is at the heart of much of science; so, it is not surprising that prediction has been a topic of great interest in the data mining community for the last decade. Most of the previous work has attempted to predict the future based on the current value of a stream. However, for many problems the actual values are irrelevant, whereas the shape of the current time series pattern may foretell the future. The handful of research efforts that consider this variant of the problem have met with limited success. In particular, it is now understood that most of these efforts allow the discovery of spurious rules. We believe the reason why rule discovery in real-valued time series has failed thus far is because most efforts have more or less indiscriminately applied the ideas of symbolic stream rule discovery to real-valued rule discovery. In this work, we show why these ideas are not directly suitable for rule discovery in time series. Beyond our novel definitions/representations, which allow for meaningful and extendable specifications of rules, we further show novel algorithms that allow us to quickly discover high quality rules in very large datasets that accurately predict the occurrence of future events.
In the last decade, Dynamic Time Warping (DTW) has emerged as the distance measure of choice for virtually all time series data mining applications. This is the result of significant progress in improving DTW's efficiency, and multiple empirical studies showing that DTW-based classifiers at least equal the accuracy of all their rivals across dozens of datasets. Thus far, most of the research has considered only the one-dimensional case, with practitioners generalizing to the multi-dimensional case in one of two ways. In general, it appears the community believes either that the two ways are equivalent, or that the choice is irrelevant. In this work, we show that this is not the case. The two most commonly used multidimensional DTW methods can produce different classifications, and neither one dominates over the other. This seems to suggest that one should learn the best method for a particular application. However, we will show that this is not necessary; a simple, principled rule can be used on a case-by-case basis to predict which of the two methods we should give credence to. Our method allows us to ensure that classification results are at least as accurate as the better of the two rival methods, and in many cases, our method is strictly more accurate. We demonstrate our ideas with the most extensive set of multi-dimensional time series classification experiments ever attempted.
Clustering is arguably the most important primitive for data mining, finding use as a subroutine in many higher-order algorithms. In recent years, the community has redirected its attention from the batch case to the online case. This need to support online clustering is engendered by the proliferation of cheap ubiquitous sensors that continuously monitor various aspects of our world, from heartbeats as we exercise to the number of mosquitoes visiting a well in a village in Ethiopia. In this work, we argue that current online clustering solutions offer a room for improvement. To some degree they all have at least one of the following shortcomings: they are parameter-laden, only defined for certain distance functions, sensitive to outliers, and/or they are approximate. This last point requires clarification; in some sense almost all clustering algorithms are approximate. For example, in general, k-means only approximately optimizes its objective function. However, streaming versions of the k-means algorithm are further approximating this approximation, potentially leading to very poor solutions. In this work, we introduce an algorithm that mitigates these flaws. It is parameter-lite, defined for any distance function, insensitive to outliers and produces the same output as the batch version of the algorithm. We demonstrate the utility and effectiveness of our ideas with case studies in entomology, cardiology and biological audio processing.
Abstract-The discovery of repeated structure, i.e. motifs/nearduplicates, is often the first step in exploratory data mining. As such, the last decade has seen extensive research efforts in motif discovery algorithms for text, DNA, time series, protein sequences, graphs, images, and video. Surprisingly, there has been less attention devoted to finding repeated patterns in audio sequences, in spite of their ubiquity in science and entertainment. While there is significant work for the special case of motifs in music, virtually all this work makes many assumptions about data (often to the point of being genre specific) and thus these algorithms do not generalize to audio sequences containing animal vocalizations, industrial processes, or a host of other domains that we may wish to explore.In this work we introduce a novel technique for finding audio motifs. Our method does not require any domainspecific tuning and is essentially parameter-free. We demonstrate our algorithm on very diverse domains, finding audio motifs in laboratory mice vocalizations, wild animal sounds, music, and human speech. Our experiments demonstrate that our ideas are effective in discovering objectively correct or subjectively plausible motifs. Moreover, we show our novel probabilistic early abandoning approach is efficient, being two to three orders of magnitude faster than brute-force search, and thus faster than real-time for most problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.