The analysis of multiple datasets on users’ behaviors opens interesting information fusion possibilities and, at the same time, creates a potential for re-identification and de-anonymization of users’ data. On the one hand, this kind of approaches can breach users’ privacy despite anonymization. On the other hand, combining different datasets is a key enabler for advanced context-awareness in that information from multiple sources can complement and enrich each other. In this work we analyze different anonymized mobility datasets in the direction of highlighting re-identification and information fusion possibilities. In particular we focus on call detail record (CDR) datasets released by mobile telecom operators and datasets comprising geo-localized messages released by social network sites. Results shows that: (1) in line with previous findings, few (about 4) data points are enough to uniquely pin point the majority (90 %) of the users, (2) more than 20 % of CDR users have a single social network user exhibiting a number of matching data points. We speculate that these two users might be the same person. (3) We derive an estimate of the probability of two users begin the same person given the number of data points they have in common, and estimate that for 3 % of the social network users we can find a CDR user very likely (>90 % probability) to be the same person
Accurately forecasting how crowds of people are distributed in urban areas during daily activities is of key importance for the smart city vision and related applications. In this work we forecast the crowd density and distribution in an urban area by analyzing an aggregated mobile phone dataset. By comparing the forecasting performance of statistical and deep learning methods on the aggregated mobile data we show that each class of methods has its advantages and disadvantages depending on the forecasting scenario. However, for our time-series forecasting problem, deep learning methods are preferable when it comes to simplicity and immediacy of use, since they do not require a time-consuming model selection for each different cell. Deep learning approaches are also appropriate when aiming to reduce the maximum forecasting error. Statistical methods instead show their superiority in providing more precise forecasting results, but they require data domain knowledge and computationally expensive techniques in order to select the best parameters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.