Interval versions of statistical techniques with applications to environmental analysis, bioinformatics, and privacy in statistical databases

Advance Trends in Soft Computing

2014

Self Cite

Abstract. In many real-life situations, e.g., in medicine, it is necessary to process data while preserving the patients' confidentiality. One of the most efficient methods of preserving privacy is to replace the exact values with intervals that contain these values. For example, instead of an exact age, a privacy-protected database only contains the information that the age is, e.g., between 10 and 20, or between 20 and 30, etc. Based on this data, it is important to compute correlation and covariance between different quantities. For privacy-protected data, different values from the intervals lead, in general, to different estimates for the desired statistical characteristic. Our objective is then to compute the range of possible values of these estimates. Algorithms for effectively computing such ranges have been developed for situations when intervals come from the original surveys, e.g., when a person fills in whether his or her age is between 10 or 20, between 20 and 30, etc. These intervals, however, do not always lead to an optimal privacy protection; it turns out that more complex, computer-generated "intervalization" can lead to better privacy under the same accuracy, or, alternatively, to more accurate estimates of statistical characteristics under the same privacy constraints. In this paper, we extend the existing efficient algorithms for computing covariance and correlation based on privacy-protected data to this more general case of interval data. Formulation of the ProblemNeed for processing data in statistical databases. Often, we collect data for the purpose of finding possible dependencies between different quantities. For example, we collect all possible information about the medical patients with the hope of finding out which factors affect different illnesses and which factors affect the success of different cures. The resulting collection of records r i = (r i1 , . . . , r ip ), 1 ≤ i ≤ n, is known as a statistical database since typically, statistical methods are used for look for possible dependencies; see, e.g., [7]. These statistical methods are usually based on computing statistical characteristics such as mean

Section: Formulation Of the Problemmentioning

confidence: 99%

Computing Covariance and Correlation in Optimally Privacy-Protected Statistical Databases: Feasible Algorithms

Day

Jalal-Kamali

Advance Trends in Soft Computing

2014

Self Cite

“…Interval computations -in particular, interval computations of statistical characteristics -have many applications, in particular, engineering applications; see, e.g., [1,4,5,7,8,9,10,11,13].…”

Section: Need To Take Into Account Interval Uncertaintymentioning

confidence: 99%

Estimating correlation under interval uncertainty

Jalal-Kamali

Mechanical Systems and Signal Processing

2013

Self Cite

In many engineering situations, we are interested in finding the correlation ρ between different quantities x and y based on the values xi and yi of these quantities measured in different situations i. Measurements are never absolutely accurate; it is therefore necessary to take this inaccuracy into account when estimating the correlation ρ. Sometimes, we know the probabilities of different values of measurement errors, but in many cases, we only know the upper bounds ∆xi and ∆yi on the corresponding measurement errors. In such situations, after we get the measurement results xi and yi, the only information that we have about the actual (unknown) values xi and yi is that they belong to the corresponding intervals [ xi − ∆xi, xi + ∆xi] and [ yi − ∆yi, yi + ∆yi]. Different values from these intervals lead, in general, to different values of the correlation ρ. It is therefore desirable to find the range [ρ, ρ] of possible values of the correlation when xi and yi take values from the corresponding intervals. In general, the problem of computing this range is NP-hard. In this paper, we provide a feasible (= polynomial-time) algorithm for computing at least one of the endpoints of this interval: for computing ρ when ρ > 0 and for computing ρ when ρ < 0.

“…Moreover, if the input intervals do not have a common non-empty intersection -e.g., if there is a value C for which every collection of C intervals have an empty intersection -then it is possible to have a feasible algorithm for computing the range of the variance [2][3][4][10][11][12].…”

Section: When We Can Expect the Variance To Be Small By Definition mentioning

confidence: 99%

No-Free-Lunch Result for Interval and Fuzzy Computing: When Bounds Are Unusually Good, Their Computation Is Unusually Slow

Ceberio

Advances in Soft Computing

2011

Self Cite

Abstract.On several examples from interval and fuzzy computations and from related areas, we show that when the results of data processing are unusually good, their computation is unusually complex. This makes us think that there should be an analog of Heisenberg's uncertainty principle well known in quantum mechanics: when we an unusually beneficial situation in terms of results, it is not as perfect in terms of computations leading to these results. In short, nothing is perfect. First Case Study: Interval ComputationsNeed for data processing. In science and engineering, we want to understand how the world works, we want to predict the results of the world processes, and we want to design a way to control and change these processes so that the results will be most beneficial for the humankind.For example, in meteorology, we want to know the weather now, we want to predict the future weather, and -if, e.g., floods are expected, we want to develop strategies that would help us minimize the flood damage.Usually, we know the equations that describe how these systems change in time. Based on these equations, engineers and scientists have developed algorithms that enable them to predict the values of the desired quantities -and find the best values of the control parameters. As input, these algorithms take the current and past values of the corresponding quantities.For example, if we want to predict the trajectory of the spaceship, we need to find its current location and velocity, the current position of the Earth and of the celestial bodies, then we can use Newton's equations to find the future locations of the spaceship.In many situations -e.g., in weather prediction -the corresponding computations require a large amount of input data and a large amount of computations steps. Such computations (data processing) are the main reason why computers were invented in the first place -to be able to perform these computations in reasonable time.