This research outlines the process of building a sample frame of US SMEs. The method starts with a list of patenting organizations and defines the boundaries of the population and subsequent frame using free to low-cost data sources, including search engines and websites. Generating high-quality data is of key importance throughout the process of building the frame and subsequent data collection; at the same time, there is too much data to curate by hand. Consequently, we turn to machine learning and other computational methods to apply a number of data matching, filtering, and cleaning routines. The results show that it is possible to generate a sample frame of innovative SMEs with reasonable accuracy for use in subsequent research: Our method provides data for 79% of the frame. We discuss implications for future work for researchers and NSIs alike and contend that the challenges associated with big data collections require not only new skillsets but also a new mode of collaboration.