Performance profiling tools are crucial for HPC specialists to identify performance bottlenecks in parallel codes at various levels of granularity (i.e., across nodes, ranks, and threads). Although numerous sophisticated profiling tools have been developed, achieving scalable performance introspection on large scales remains a challenge. This is particularly evident in efficiently writing profiles to disk during runtime and subsequently reading them with constrained computing resources for posthoc analysis. In this paper, we present TinyProf , a performance introspection framework that tackles I/O-related challenges in profiling performance data at scale. TinyProf 's scalability is attributed to an optimal runtime that consists of three key components: (1) an efficient in-memory data structure that minimizes memory consumption and decreases communication overhead during parallel file I/O; (2) a customizable threephase I/O scheme that generates optimal I/O patterns capable of scaling with high core counts; and (3) a streamlined data format for profiles, which guarantees minimal sizes for profile files. These three techniques instill scalability into the profiler, making it low overhead, even at high process counts (less than 5%). This low overhead makes it possible for the profiler to be run with an application as a default (whenever the application is running)-enabling continuous introspection of performance. We demonstrate the efficiency of our framework using large-scale parallel applications and perform a thorough evaluation against existing systems up to 32k processes.