Use-after-free vulnerabilities have plagued software written in low-level languages, such as C and C++, becoming one of the most frequent classes of exploited software bugs. Attackers identify code paths where data is manually freed by the programmer, but later incorrectly reused, and take advantage by reallocating the data to themselves. They then alter the data behind the program's back, using the erroneous reuse to gain control of the application and, potentially, the system. While a variety of techniques have been developed to deal with these vulnerabilities, they often have unacceptably high performance or memory overheads, especially in the worst case.We have designed MarkUs, a memory allocator that prevents this form of attack at low overhead, sufficient for deployment in real software, even under allocation-and memory-intensive scenarios. We prevent use-after-free attacks by quarantining data freed by the programmer and forbidding its reallocation until we are sure that there are no dangling pointers targeting it. To identify these we traverse live-objects accessible from registers and memory, marking those we encounter, to check whether quarantined data is accessible from any currently allocated location. Unlike garbage collection, which is unsafe in C and C++, MarkUs ensures safety by only freeing data that is both quarantined by the programmer and has no identifiable dangling pointers. The information provided by the programmer's allocations and frees further allows us to optimise the process by freeing physical addresses early for large objects, specialising analysis for small objects, and only performing marking when sufficient data is in quarantine. Using MarkUs, we reduce the overheads of temporal safety in low-level languages to 1.1× on average for SPEC CPU2006, with a maximum slowdown of only 2×, vastly improving upon the state-of-the-art.
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This paper develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× and 1.1× for an Intel Haswell processor and an ARM Cortex-A57, both out-of-order cores, and performance improvements of 2.1× and 3.7× for the in-order ARM Cortex-A53 and Intel Xeon Phi.
Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3×, and up to 3.3×, with little impact on memory bandwidth requirements.
Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.