Abstract. We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.Keywords: catalogs -surveys -methods: data analysis -astronomical data bases: miscellaneous PACS: 95.80.+p, 95.75.Pq
DATA-INTENSIVE ASTRONOMY AND THE LSST SKY SURVEYThe development of models to describe and understand scientific phenomena has historically proceeded at a pace driven by new data. The more we know, the more we are driven to tweak or to revolutionize our models, thereby advancing our scientific understanding. This data-driven modeling and discovery linkage has entered a new paradigm [1]. The acquisition of scientific data in all disciplines is now accelerating and causing a nearly insurmountable data avalanche [2]. In astronomy in particular, rapid advances in three technology areas (telescopes, detectors, and computation) have continued unabated [3] -all of these advances lead to more and more data [4]. With this accelerated advance in data generation capabilities, we will require novel, increasingly automated, and increasingly more effective scientific knowledge discovery systems [5].Astronomers have been doing data mining for centuries: "the data are mine, and you can't have them!". Seriously, astronomers are trained as data miners, because we are trained to: (a) characterize the known (i.e., unsupervised learning, clustering); (b) assign the new (i.e., supervised learning, classification); and (c) discover the unknown (i.e., semi-supervised learning, outlier detection) [6,7,8]. These skills are more critical than ever since astronomy is now a data-intensive science, and it will become even more data-intensive in the coming decade [4,9,10]. New surveys may produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs (databases). Discovering the ensuing hidden wealth of new scientific knowledge will require more sophisticated algorithms and networks that discover, integrate, and learn from distributed petascale databases more effectively [11], [12].