Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies

Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies
Title Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies PDF eBook
Author Jiajun Wang
Publisher
Pages 296
Release 2019
Genre
ISBN

Download Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies Book in PDF, Epub and Kindle

Memory subsystem with larger capacity and deeper hierarchy has been designed to achieve the maximum performance of data intensive workloads. What grows with the depth and capacity is the amount of data movement happened between different levels of caches and the associated energy consumption. Prior art [65] shows that the energy cost of moving data from memory to register is two orders higher than the cost of register-to-register double-precision floating point operations. As the cache hierarchy grows deeper, the energy cost on the large amount of data movement between cache layers has become non-negligible. Energy dissipation of future systems will be dominated by the cost of data movement. Thus, reducing data movement through exploiting data locality becomes essential to build energy-efficient architectures. A promising technique to improve the energy efficiency of modern memory subsystem is to adaptively guide data placement into appropriate caches with the performance benefit and energy cost of data movement in mind. An intelligent data placement scheme should only move data blocks with future re-reference into cache. As the working set size of emerging workloads exceeds cache capacity and the number of cores and IPs sharing caches keeps increasing, a data movement aware data placement scheme can maximize the performance of cache-sensitive workloads and minimize the cache energy consumption of cache-insensitive workloads. Researchers have noticed that exclusive caches have better performance compared to inclusive caches. However, high performance improvement is always at odds with low energy consumption. The amount of data movement and energy consumption of exclusive caches is higher than inclusive ones. A few state-of-the-art CPU caching insertion/bypass policies have been proposed in literature. However these techniques are either at great expense of metadata overhead when adapting to exclusive caches, or they focus on reducing data movement at the sacrifice of performance. On the GPU side, designing efficient data placement schemes also faces great challenge. CPU caching schemes do not work for GPU memory subsystems, because the SRAM capacity per GPU thread is far smaller than the number per CPU threads. The capacity of GPU on-chip SRAMs is too small to hold large data structures in the GPU workloads. Data with frequent reuse is evicted before it is re-referenced which results in high GPU cache miss rate. Keeping the above shortcomings of prior work and key limitations in mind, this dissertation focuses on improving the performance and energy efficiency of modern cache subsystems of CPU and GPU by proposing performance and energy sensitive data placement schemes. This dissertation first presents a data placement for multilevel CPU caches to guide data placement into appropriate cache layers based on data reuse patterns. PC is utilized as the prediction heuristic based on the observation of good correlation between memory instruction and the locality of the data accessed by the instruction. Unlike prior art that includes great overhead for meta-data (e.g., PC) transmission and storage, a holistic approach to manage data placement is presented, which leverages bloom filters to record the memory instruction PC of data blocks. The proposed scheme incorporates quick detection and correction of stale/incorrect bypass decisions and an explicit mechanism for handling prefetches. This leads to energy efficiency improvement by cutting down wasteful cache block insertions and data movement. To overcome the challenges on the GPU side, an explicitly managed data placement scheme in GPU memory hierarchy is presented in this dissertation. In order to improve data reuse of a popular HPC application and eliminate redundant memory accesses, data access sequence is rearranged by fusing multiple GPU kernel execution. Bank level fine-grained on-chip SRAM data placement and replacement is designed based on the microarchitecture of GPU memory hierarchy to maximize capacity utilization and interconnect bandwidth. The proposed scheme achieves the best performance and least energy consumption through reducing memory access latency and eliminating redundant data movement

Data Placement Optimizations for Multilevel Cache Hierarchies

Data Placement Optimizations for Multilevel Cache Hierarchies
Title Data Placement Optimizations for Multilevel Cache Hierarchies PDF eBook
Author Clark L. Coleman
Publisher
Pages 344
Release 2004
Genre
ISBN

Download Data Placement Optimizations for Multilevel Cache Hierarchies Book in PDF, Epub and Kindle

Locality-aware Cache Hierarchy Management for Multicore Processors

Locality-aware Cache Hierarchy Management for Multicore Processors
Title Locality-aware Cache Hierarchy Management for Multicore Processors PDF eBook
Author
Publisher
Pages 194
Release 2015
Genre
ISBN

Download Locality-aware Cache Hierarchy Management for Multicore Processors Book in PDF, Epub and Kindle

Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core).

A Framework for Cache-conscious Data Placement Simulation

A Framework for Cache-conscious Data Placement Simulation
Title A Framework for Cache-conscious Data Placement Simulation PDF eBook
Author Amy Margaret Henning
Publisher
Pages 80
Release 2005
Genre
ISBN

Download A Framework for Cache-conscious Data Placement Simulation Book in PDF, Epub and Kindle

Searchable Storage in Cloud Computing

Searchable Storage in Cloud Computing
Title Searchable Storage in Cloud Computing PDF eBook
Author Yu Hua
Publisher Springer
Pages 204
Release 2019-02-08
Genre Computers
ISBN 9811327211

Download Searchable Storage in Cloud Computing Book in PDF, Epub and Kindle

This book presents the state-of-the-art work in terms of searchable storage in cloud computing. It introduces and presents new schemes for exploring and exploiting the searchable storage via cost-efficient semantic hashing computation. Specifically, the contents in this book include basic hashing structures (Bloom filters, locality sensitive hashing, cuckoo hashing), semantic storage systems, and searchable namespace, which support multiple applications, such as cloud backups, exact and approximate queries and image analytics. Readers would be interested in the searchable techniques due to the ease of use and simplicity. More importantly, all these mentioned structures and techniques have been really implemented to support real-world applications, some of which offer open-source codes for public use. Readers will obtain solid backgrounds, new insights and implementation experiences with basic knowledge in data structure and computer systems.

Data Access and Storage Management for Embedded Programmable Processors

Data Access and Storage Management for Embedded Programmable Processors
Title Data Access and Storage Management for Embedded Programmable Processors PDF eBook
Author Francky Catthoor
Publisher Springer Science & Business Media
Pages 316
Release 2013-03-14
Genre Computers
ISBN 1475749031

Download Data Access and Storage Management for Embedded Programmable Processors Book in PDF, Epub and Kindle

Data Access and Storage Management for Embedded Programmable Processors gives an overview of the state-of-the-art in system-level data access and storage management for embedded programmable processors. The targeted application domain covers complex embedded real-time multi-media and communication applications. Many of these applications are data-dominated in the sense that their cost related aspects, namely power consumption and footprint are heavily influenced (if not dominated) by the data access and storage aspects. The material is mainly based on research at IMEC in this area in the period 1996-2001. In order to deal with the stringent timing requirements and the data dominated characteristics of this domain, we have adopted a target architecture style that is compatible with modern embedded processors, and we have developed a systematic step-wise methodology to make the exploration and optimization of such applications feasible in a source-to-source precompilation approach.

Resource Management for Big Data Platforms

Resource Management for Big Data Platforms
Title Resource Management for Big Data Platforms PDF eBook
Author Florin Pop
Publisher Springer
Pages 509
Release 2016-10-27
Genre Computers
ISBN 3319448811

Download Resource Management for Big Data Platforms Book in PDF, Epub and Kindle

Serving as a flagship driver towards advance research in the area of Big Data platforms and applications, this book provides a platform for the dissemination of advanced topics of theory, research efforts and analysis, and implementation oriented on methods, techniques and performance evaluation. In 23 chapters, several important formulations of the architecture design, optimization techniques, advanced analytics methods, biological, medical and social media applications are presented. These chapters discuss the research of members from the ICT COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet). This volume is ideal as a reference for students, researchers and industry practitioners working in or interested in joining interdisciplinary works in the areas of intelligent decision systems using emergent distributed computing paradigms. It will also allow newcomers to grasp the key concerns and their potential solutions.