Our analytics challenge is is to identify, characterize, and visualize anomalous subsets of large collections of network connection data. We use a combination of HPC resources, advanced algorithms, and visualization techniques. To effectively and efficiently identify the salient portions of the data, we rely on a multi-stage workflow that includes data acquisition, summarization (feature extraction), novelty detection, and classification. Once these subsets of interest have been identified and automatically characterized, we use a state-of-the-art highdimensional query system to extract data subsets for interactive visualization. Our approach is equally useful for other large-data analysis problems where it is more practical to identify interesting subsets of the data for visualization than to render all data elements. By reducing the size of the rendering workload, we enable highly interactive and useful visualizations. As a result of this work we were able to analyze six months worth of data interactively with response times two orders of magnitude shorter than with conventional methods.Motivating Example: Figure 1 shows an example of thousands of network connections (white lines) that one of our automatic clustering methods identifies as being correlated. Each of the shaded regions delimits a hyperbox for subsequent high dimensional queries. Since the entire dataset may represent billions of connections, it is critical to have good methods for defining and executing subset queries over multiple dimensions. In the past year, on the order of 500 Terabytes (TB) crossed the boundary between the Internet and unclassified networks at LANL, NERSC, and LBNL. Data collection tools at each of these boundaries collect summary information of approximately 10 billion distinct connections, or 1 TB, per year. In addition, routerbased information saved for every subnet internal to the LANL unclassified network totals 46 billion records and 2.5 TB per year, representing several Petabytes of network traffic. Post-processing of this data for analysis and indexing can increase the size several times. In the past, this data could only be analyzed as a whole by large batch processing jobs or in small segments, usually 6 to 24 hours at a time. The raw data consists of summary information on each session, and consists of start time, duration, protocol, source and destination byte counts, packet counts, IP addresses, port numbers, and flags describing completeness of the connection. Subsequently, some of these fields are further decomposed (octets of IP addresses, for example) and statistical properties derived on a per-host basis.