Since the 1970's, microprocessor-based digital platforms have been riding Moore's law, allowing for doubling of density for the same area roughly every two years. However, whereas microprocessor fabrication has focused on increasing instruction execution rate, memory fabrication technologies have focused primarily on an increase in capacity with negligible increase in speed. This divergent trend in performance between the processors and memory has led to a phenomenon referred to as the "Memory Wall."
To overcome the memory wall, designers have resorted to a hierarchy of cache memory levels, which rely on the principal of memory access locality to reduce the observed memory access time and the performance gap between processors and memory. Unfortunately, important workload classes exhibit adverse memory access patterns that baffle the simple policies built into modern cache hierarchies to move instructions and data across cache levels. As such, processors often spend much time idling upon a demand fetch of memory blocks that miss in higher cache levels. Prefetching, predicting future memory accesses and issuing requests for the corresponding memory blocks in advance of explicit accessesâ€”is an effective approach to hide memory access latency. There have been a myriad of proposed prefetching techniques, and nearly every modern processor includes some hardware prefetching mechanisms targeting simple and regular memory access patterns. This primer offers an overview of the various classes of hardware prefetchers for instructions and data proposed in the research literature, and presents examples of techniques incorporated into modern microprocessors.
Table of Contents
About the Author(s)Babak Falsafi
, EPFL, Switzerland
Babak Falsafi is a Professor in the School of Computer and Communication Sciences at EPFL, and the founding director of the EcoCloud research center, targeting future energy-efficient and environmentally friendly cloud technologies. He has made numerous contributions to computer system design and evaluation including: a scalable multiprocessor architecture that laid the foundation for the Sun (now Oracle) WildFire servers; snoop filters; temporal stream prefetchers that are incorporated into IBM BlueGene/P and BlueGene/Q; and computer system simulation sampling methodologies that have been in use by AMD and HP for research and product development. His most notable contribution has been to be first to show that, contrary to conventional wisdom, multiprocessor memory programming models (known as memory consistency models) prevalent in all modern systems are neither necessary nor sufficient to achieve high performance. He is a recipient of an NSF CAREER award, IBM Faculty Partnership Awards, and an Alfred P. Sloan Research Fellowship. He is a fellow of IEEE. Thomas F. Wenisch
, University of Michigan
Thomas Wenisch is an Associate Professor of Computer Science and Engineering at the University of Michigan, specializing in computer architecture. His prior research includes memory streaming for commercial server applications, store-wait-free multiprocessor memory systems, memory disaggregation, and rigorous sampling-based performance evaluation methodologies. His ongoing work focuses on computational sprinting, memory persistency, data center architecture, energy-efficient server design, and accelerators for medical imaging. Wenisch received the NSF CAREER award in 2009 and the University of Michigan Henry Russell Award in 2013. He received his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University.