In a modernized statistical production process, non-traditional data sources such as ‘big’ data are increasingly being considered either as the main or as a supplementary source for official statistics. Their use however has brought new sorts of challenges: messy datasets, duplicate entries, missing information and misspellings to name a few. In many cases, there is also no unique identifier which can be used to unambiguously identify a record for the purpose of data integration. These challenges can be compounded by a non-English based alphabet like the Persian/Farsi alphabet used in Iran. In this paper, two innovative methods have been elaborated to address such data challenges. More specifically, the application of probabilistic record linkage using an ACSII coding system is an innovative way to deal with both data challenges and lack of unique identifier simultaneously. Moreover, text mining is an innovative way to address categorization and grouping systems that are not suitable for statistical purposes. Both innovative approaches can improve the accuracy and coherency of datasets and for data integration result in higher quality datasets. Results of research undertaken by the authors show the innovations lead to more effective data integration and improve the quality of the resulting official statistics. The innovations have wide applicability especially in non-English alphabet countries.
In this paper, monitoring of simple linear profiles is investigated in the presence of nonequality of variances or heteroscedasticity, ie, generalized autoregressive conditional heteroscedasticity. In this condition, using of the common methods regardless of the heteroscedasticity leads to the fault interpretations. We consider a simple linear profile and assume that there is a generalized autoregressive conditional heteroscedasticity (GARCH) (1,1) model within the profiles. Here, we particularly focus on Phase II monitoring of simple linear regression. We studied the generalized autoregressive conditional heteroscedasticity effect, briefly GARCH effect, on the average run length criterion. As the remedial measures, the weighted least squares method to estimate the regression parameters and the heteroscedasticity‐consistent approaches to estimate the covariance matrix of regression parameters, are used to extract the GARCH effect. Two control chart methods namely T2 and exponentially weighted moving average 3 are discussed to monitor the simple linear profiles. Their performances are evaluated by using the average run length criterion. Finally, a real case from an industry field is studied.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.