-In this paper, we first identify semantic heterogeneities that, when not resolved, often cause serious data quality problems. We discuss the especially challenging problems of temporal and aggregational ontological heterogeneity, which concerns how complex entities and their relationships are aggregated and reinterpreted over time. Then we illustrate how the COntext INterchange (COIN) technology can be used to capture data semantics and reconcile semantic heterogeneities in a scalable manner, thereby improving data quality.Index Terms -Data Semantics, Semantic Heterogeneity, Aggregation, Temporal, Ontology, Context.
I. INTRODUCTIONN our research, we have discovered that many problems arise due to confusion regarding data semantics. To illustrate how complex this can become, consider Fig. 1. This data summarizes the P/E ratio for DaimlerChrysler obtained from four different financial information sources -all obtained on the same day within minutes of each other. Note that the four sources gave radically different values for P/E ratio. are all correct! The issue is, what do you really mean by "P/E ratio" 1 . The answer lies in the multiple interpretations and uses of the term "P/E ratio" in financial circles. The earnings are for the entire year in some sources but in one source are only for the last quarter. Even when earnings are for a full year, are they:-the last 12 months? -the last calendar year? -the last fiscal year? or -the last three historical quarters and the estimated current quarter -a popular usage? Such information, which we call context, is often not explicitly captured in a form that can be used by the query answering system to reconcile semantic differences in data from different sources. Serious consequences can result from not being aware of the differences in contexts and data semantics. Consider a financial trader that used DBC to get P/E ratio information yesterday and got 19.19. Today he used Bloomberg and got 5.57 (low P/E's usually indicate good bargains) -thinking that something wonderful had happened he might decide to buy many shares of DaimlerChrysler today. In fact, nothing had actually changed, except for changing the source that he used. It would be natural for this trader (after possibly losing a significant amount of money due to this decision) to feel that he had encountered a data quality problem.We would argue that what appeared to be a data quality problem is actually a data misinterpretation problem. The data source did not have any "error," the data that it provided was exactly the data that it intended to provideit just did not have the meaning that the receiver expected. In other words, the issue is not what is right or wrong, it is about how data in one context can be used in a different context.Before going any further, it should be noted that if all sources and all receivers of data always had the exact same meanings, this problem would not occur. This is a desirable goal -one frequently sought through 1 Some of these sites even provide a glossary which give...