International audienceAbstract Text mining approaches are commonly used to discover relevant information and relationships in hugeamounts of text data. The term data mining refers to methods for analyzing data with the objective of finding patternsthat aggregate the main properties of the data. The merger between the data mining approaches and on-line analyticalprocessing (OLAP) tools allows us to refine techniques used in textual aggregation. In this paper, we propose a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix. Our contribution aims at using a data mining technique, mainly a closed pattern mining algorithm, to aggregate keywords. An experimental study on areal corpus of more than 700 scientific papers collected on Microsoft Academic Search shows that the proposed algorithm largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime
Data warehousing and On-Line Analytical Processing (OLAP) are essential elements to decision support. In the case of textual data, decision support requires new tools, mainly textual aggregation functions, for better and faster high level analysis and decision making. Such tools will provide textual measures to users who wish to analyse documents online. In this paper, we propose a new aggregation function for textual data in an OLAP context based on the K-means method. This approach will highlight aggregates semantically richer than those provided by classical OLAP operators. The distance used in K-means is replaced by the Google similarity distance which takes into account the semantic similarity of keywords for their aggregation. The performance of our approach is analyzed and compared to other methods such as Topkeywords, TOPIC, TuBE and BienCube. The experimental study shows that our approach achieves better performances in terms of recall, precision,F-measure complexity and runtime.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.