Locality-aware Cache Hierarchy Management for Multicore Processors

Locality-aware Cache Hierarchy Management for Multicore Processors
Title Locality-aware Cache Hierarchy Management for Multicore Processors PDF eBook
Author
Publisher
Pages 194
Release 2015
Genre
ISBN

Download Locality-aware Cache Hierarchy Management for Multicore Processors Book in PDF, Epub and Kindle

Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core).

Locality-aware Task Management on Many-core Processors

Locality-aware Task Management on Many-core Processors
Title Locality-aware Task Management on Many-core Processors PDF eBook
Author Richard Myungon Yoo
Publisher
Pages
Release 2012
Genre
ISBN

Download Locality-aware Task Management on Many-core Processors Book in PDF, Epub and Kindle

The landscape of computing is changing. Due to limits in transistor scaling, the traditional approach to exploit instruction-level parallelism through wide-issue out-of-order execution cores provided diminishing performance gains. As a result, computer architects now rely on thread-level parallelism to obtain sustainable performance improvement. In particular, many-core processors are designed to exploit parallelism by implementing multiple cores that can execute in parallel. Both industry and academia agree that scaling the number of cores to hundreds or thousands is the only way to scale performance from now on. However, such a shift in design increases processor system demands. As a result, the cache hierarchies on many-core processors are becoming larger and increasingly complex. Such cache hierarchies suffer from high latency and energy consumption, and non-uniform memory access effects become prevalent. Traditionally, exploiting locality was an option to reduce execution time and energy consumption. On the complex many-core cache hierarchy, however, failing to exploit locality may end up having more cores stalled, thereby undermining the very viability of parallelism. Locality can be exploited at various hardware and software layers. By implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimized for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Especially, the recent proliferation of runtime-based programming systems further stresses the importance of locality-aware scheduling. Although many efforts have been made to exploit locality on a runtime, they fail to take the underlying cache hierarchy into consideration, are limited to specific programming models, and suffer high management costs. This thesis shows that locality-aware schedules can be generated at low costs by utilizing high-level information. In particular, by optimizing a MapReduce runtime on a multi-socket many-core system, we show that runtimes can leverage explicit producer-consumer information to exploit locality. Specifically, the locality on the data structures that buffer intermediate results becomes significantly important. In addition, the optimization should be performed across all the software layers. To handle the case where the explicit data dependency information is not available, we develop a graph-based locality analysis framework that allows to analyze key scheduling attributes while being independent of hardware specifics and scale. Using the framework, we also develop a reference scheduling scheme that shows significant performance improvement and energy savings. We then develop a novel class of practical locality-aware task managers, that leverage workload pattern information and simple locality hints to approximate the reference scheduling scheme. Through experiments, we show that the quality of generated schedules can match that of the reference scheme, and that the schedule generation costs are minimal. While exploiting significant locality, these managers maintain the simple task programming interface intact. We also point out that task stealing can be made compatible with locality-aware scheduling. Traditional task management schemes believed there exists a fundamental tradeoff between locality and load balance, and fixated on one to sacrifice the other. We show that a stealing scheme can be made locality-aware, by trying to preserve the original schedule while transferring tasks for load balancing. In summary, utilizing high-level information allows the construction of efficient locality-aware task management schemes that make programs run faster while consuming less energy.

Multi-Core Cache Hierarchies

Multi-Core Cache Hierarchies
Title Multi-Core Cache Hierarchies PDF eBook
Author Rajeev Balasubramonian
Publisher Morgan & Claypool Publishers
Pages 155
Release 2011-06-06
Genre Technology & Engineering
ISBN 1598297546

Download Multi-Core Cache Hierarchies Book in PDF, Epub and Kindle

A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip memory access by improving the efficiency of the on-chip cache. Future multi-core processors will have many large cache banks connected by a network and shared by many cores. Hence, many important problems must be solved: cache resources must be allocated across many cores, data must be placed in cache banks that are near the accessing core, and the most important data must be identified for retention. Finally, difficulties in scaling existing technologies require adapting to and exploiting new technology constraints. The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors. It is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent cache research. The book is suitable as a reference for advanced computer architecture classes as well as for experienced researchers and VLSI engineers. Table of Contents: Basic Elements of Large Cache Design / Organizing Data in CMP Last Level Caches / Policies Impacting Cache Hit Rates / Interconnection Networks within Large Caches / Technology / Concluding Remarks

Handbook of Research on High Performance and Cloud Computing in Scientific Research and Education

Handbook of Research on High Performance and Cloud Computing in Scientific Research and Education
Title Handbook of Research on High Performance and Cloud Computing in Scientific Research and Education PDF eBook
Author Despotovi?-Zraki?, Marijana
Publisher IGI Global
Pages 476
Release 2014-03-31
Genre Computers
ISBN 1466657855

Download Handbook of Research on High Performance and Cloud Computing in Scientific Research and Education Book in PDF, Epub and Kindle

As information systems used for research and educational purposes have become more complex, there has been an increase in the need for new computing architecture. High performance and cloud computing provide reliable and cost-effective information technology infrastructure that enhances research and educational processes. Handbook of Research on High Performance and Cloud Computing in Scientific Research and Education presents the applications of cloud computing in various settings, such as scientific research, education, e-learning, ubiquitous learning, and social computing. Providing various examples, practical solutions, and applications of high performance and cloud computing; this book is a useful reference for professionals and researchers discovering the applications of information and communication technologies in science and education, as well as scholars seeking insight on how modern technologies support scientific research.

A Higher Order Theory of Locality and Its Application in Multicore Cache Management

A Higher Order Theory of Locality and Its Application in Multicore Cache Management
Title A Higher Order Theory of Locality and Its Application in Multicore Cache Management PDF eBook
Author Xiaoya Xiang
Publisher
Pages 186
Release 2014
Genre
ISBN

Download A Higher Order Theory of Locality and Its Application in Multicore Cache Management Book in PDF, Epub and Kindle

"As multi-core processors become commonplace and cloud computing is gaining acceptance, applications are increasingly run in parallel over a shared memory hierarchy. While the traditional machine and program metrics such as miss ratio and reuse distance can precisely characterize the memory performance of a single program, they are not composable and therefore cannot model the dynamic interaction between simultaneously running programs. This dissertation presents an alternative metric called program footprint. Given a program execution, its footprint is the amount of data accessed in a given time period. The footprint is composable-- the aggregate footprint of a set of programs is the sum of the footprint of the individual footprints. The dissertation presents the following techniques: Near real-time footprint measurement, first by using two novel algorithms, one for footprint distribution and the other for footprint average, and then by run-time sampling. Higher order theory of cache locality, which shows that traditional metrics can be derived from the footprint and vice versa. (As a result, previous locality metrics can also be obtained in near real time.) Composable model of cache sharing, by footprint composition, which is faster and simpler to use than previous reuse-distance based models. Cache-conscious task regrouping, which reorganizes a parallel workload to minimize the interference in shared cache. Through these techniques, the dissertation establishes the thesis that program interaction in shared cache can be efficiently and accurately modeled and dynamically optimized"--Page vi-vii.

Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies

Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies
Title Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies PDF eBook
Author Jiajun Wang
Publisher
Pages 296
Release 2019
Genre
ISBN

Download Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies Book in PDF, Epub and Kindle

Memory subsystem with larger capacity and deeper hierarchy has been designed to achieve the maximum performance of data intensive workloads. What grows with the depth and capacity is the amount of data movement happened between different levels of caches and the associated energy consumption. Prior art [65] shows that the energy cost of moving data from memory to register is two orders higher than the cost of register-to-register double-precision floating point operations. As the cache hierarchy grows deeper, the energy cost on the large amount of data movement between cache layers has become non-negligible. Energy dissipation of future systems will be dominated by the cost of data movement. Thus, reducing data movement through exploiting data locality becomes essential to build energy-efficient architectures. A promising technique to improve the energy efficiency of modern memory subsystem is to adaptively guide data placement into appropriate caches with the performance benefit and energy cost of data movement in mind. An intelligent data placement scheme should only move data blocks with future re-reference into cache. As the working set size of emerging workloads exceeds cache capacity and the number of cores and IPs sharing caches keeps increasing, a data movement aware data placement scheme can maximize the performance of cache-sensitive workloads and minimize the cache energy consumption of cache-insensitive workloads. Researchers have noticed that exclusive caches have better performance compared to inclusive caches. However, high performance improvement is always at odds with low energy consumption. The amount of data movement and energy consumption of exclusive caches is higher than inclusive ones. A few state-of-the-art CPU caching insertion/bypass policies have been proposed in literature. However these techniques are either at great expense of metadata overhead when adapting to exclusive caches, or they focus on reducing data movement at the sacrifice of performance. On the GPU side, designing efficient data placement schemes also faces great challenge. CPU caching schemes do not work for GPU memory subsystems, because the SRAM capacity per GPU thread is far smaller than the number per CPU threads. The capacity of GPU on-chip SRAMs is too small to hold large data structures in the GPU workloads. Data with frequent reuse is evicted before it is re-referenced which results in high GPU cache miss rate. Keeping the above shortcomings of prior work and key limitations in mind, this dissertation focuses on improving the performance and energy efficiency of modern cache subsystems of CPU and GPU by proposing performance and energy sensitive data placement schemes. This dissertation first presents a data placement for multilevel CPU caches to guide data placement into appropriate cache layers based on data reuse patterns. PC is utilized as the prediction heuristic based on the observation of good correlation between memory instruction and the locality of the data accessed by the instruction. Unlike prior art that includes great overhead for meta-data (e.g., PC) transmission and storage, a holistic approach to manage data placement is presented, which leverages bloom filters to record the memory instruction PC of data blocks. The proposed scheme incorporates quick detection and correction of stale/incorrect bypass decisions and an explicit mechanism for handling prefetches. This leads to energy efficiency improvement by cutting down wasteful cache block insertions and data movement. To overcome the challenges on the GPU side, an explicitly managed data placement scheme in GPU memory hierarchy is presented in this dissertation. In order to improve data reuse of a popular HPC application and eliminate redundant memory accesses, data access sequence is rearranged by fusing multiple GPU kernel execution. Bank level fine-grained on-chip SRAM data placement and replacement is designed based on the microarchitecture of GPU memory hierarchy to maximize capacity utilization and interconnect bandwidth. The proposed scheme achieves the best performance and least energy consumption through reducing memory access latency and eliminating redundant data movement

Cyberspace Safety and Security

Cyberspace Safety and Security
Title Cyberspace Safety and Security PDF eBook
Author Arcangelo Castiglione
Publisher Springer
Pages 335
Release 2018-10-24
Genre Computers
ISBN 3030016897

Download Cyberspace Safety and Security Book in PDF, Epub and Kindle

This book constitutes the proceedings of the 10th International Symposium on Cyberspace Safety and Security, CSS 2018, held in Amalfi, Italy, in October 2018. The 25 full papers presented in this volume were carefully reviewed and selected from 79 submissions. The papers focus on cybersecurity; cryptography, data security, and biometric techniques; and social security, ontologies, and smart applications.