Locality-aware Task Management on Many-core Processors

Locality-aware Task Management on Many-core Processors
Title Locality-aware Task Management on Many-core Processors PDF eBook
Author Richard Myungon Yoo
Publisher
Pages
Release 2012
Genre
ISBN

Download Locality-aware Task Management on Many-core Processors Book in PDF, Epub and Kindle

The landscape of computing is changing. Due to limits in transistor scaling, the traditional approach to exploit instruction-level parallelism through wide-issue out-of-order execution cores provided diminishing performance gains. As a result, computer architects now rely on thread-level parallelism to obtain sustainable performance improvement. In particular, many-core processors are designed to exploit parallelism by implementing multiple cores that can execute in parallel. Both industry and academia agree that scaling the number of cores to hundreds or thousands is the only way to scale performance from now on. However, such a shift in design increases processor system demands. As a result, the cache hierarchies on many-core processors are becoming larger and increasingly complex. Such cache hierarchies suffer from high latency and energy consumption, and non-uniform memory access effects become prevalent. Traditionally, exploiting locality was an option to reduce execution time and energy consumption. On the complex many-core cache hierarchy, however, failing to exploit locality may end up having more cores stalled, thereby undermining the very viability of parallelism. Locality can be exploited at various hardware and software layers. By implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimized for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Especially, the recent proliferation of runtime-based programming systems further stresses the importance of locality-aware scheduling. Although many efforts have been made to exploit locality on a runtime, they fail to take the underlying cache hierarchy into consideration, are limited to specific programming models, and suffer high management costs. This thesis shows that locality-aware schedules can be generated at low costs by utilizing high-level information. In particular, by optimizing a MapReduce runtime on a multi-socket many-core system, we show that runtimes can leverage explicit producer-consumer information to exploit locality. Specifically, the locality on the data structures that buffer intermediate results becomes significantly important. In addition, the optimization should be performed across all the software layers. To handle the case where the explicit data dependency information is not available, we develop a graph-based locality analysis framework that allows to analyze key scheduling attributes while being independent of hardware specifics and scale. Using the framework, we also develop a reference scheduling scheme that shows significant performance improvement and energy savings. We then develop a novel class of practical locality-aware task managers, that leverage workload pattern information and simple locality hints to approximate the reference scheduling scheme. Through experiments, we show that the quality of generated schedules can match that of the reference scheme, and that the schedule generation costs are minimal. While exploiting significant locality, these managers maintain the simple task programming interface intact. We also point out that task stealing can be made compatible with locality-aware scheduling. Traditional task management schemes believed there exists a fundamental tradeoff between locality and load balance, and fixated on one to sacrifice the other. We show that a stealing scheme can be made locality-aware, by trying to preserve the original schedule while transferring tasks for load balancing. In summary, utilizing high-level information allows the construction of efficient locality-aware task management schemes that make programs run faster while consuming less energy.

Task Scheduling for Multi-core and Parallel Architectures

Task Scheduling for Multi-core and Parallel Architectures
Title Task Scheduling for Multi-core and Parallel Architectures PDF eBook
Author Quan Chen
Publisher Springer
Pages 251
Release 2017-11-23
Genre Computers
ISBN 9811062382

Download Task Scheduling for Multi-core and Parallel Architectures Book in PDF, Epub and Kindle

This book presents task-scheduling techniques for emerging complex parallel architectures including heterogeneous multi-core architectures, warehouse-scale datacenters, and distributed big data processing systems. The demand for high computational capacity has led to the growing popularity of multicore processors, which have become the mainstream in both the research and real-world settings. Yet to date, there is no book exploring the current task-scheduling techniques for the emerging complex parallel architectures. Addressing this gap, the book discusses state-of-the-art task-scheduling techniques that are optimized for different architectures, and which can be directly applied in real parallel systems. Further, the book provides an overview of the latest advances in task-scheduling policies in parallel architectures, and will help readers understand and overcome current and emerging issues in this field.

Locality-aware Cache Hierarchy Management for Multicore Processors

Locality-aware Cache Hierarchy Management for Multicore Processors
Title Locality-aware Cache Hierarchy Management for Multicore Processors PDF eBook
Author
Publisher
Pages 194
Release 2015
Genre
ISBN

Download Locality-aware Cache Hierarchy Management for Multicore Processors Book in PDF, Epub and Kindle

Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core).

Euro-Par 2018: Parallel Processing

Euro-Par 2018: Parallel Processing
Title Euro-Par 2018: Parallel Processing PDF eBook
Author Marco Aldinucci
Publisher Springer
Pages 849
Release 2018-08-20
Genre Computers
ISBN 3319969838

Download Euro-Par 2018: Parallel Processing Book in PDF, Epub and Kindle

This book constitutes the proceedings of the 24th International Conference on Parallel and Distributed Computing, Euro-Par 2018, held in Turin, Italy, in August 2018. The 57 full papers presented in this volume were carefully reviewed and selected from 194 submissions. They were organized in topical sections named: support tools and environments; performance and power modeling, prediction and evaluation; scheduling and load balancing; high performance architecutres and compilers; parallel and distributed data management and analytics; cluster and cloud computing; distributed systems and algorithms; parallel and distributed programming, interfaces, and languages; multicore and manycore methods and tools; theory and algorithms for parallel computation and networking; parallel numerical methods and applications; and accelerator computing for advanced applications.

OpenMP: Memory, Devices, and Tasks

OpenMP: Memory, Devices, and Tasks
Title OpenMP: Memory, Devices, and Tasks PDF eBook
Author Naoya Maruyama
Publisher Springer
Pages 352
Release 2016-09-28
Genre Computers
ISBN 3319455508

Download OpenMP: Memory, Devices, and Tasks Book in PDF, Epub and Kindle

This book constitutes the proceedings of the 12th International Workshop on OpenMP, IWOMP 2016, held in Nara, Japan, in October 2016. The 24 full papers presented in this volume were carefully reviewed and selected from 28 submissions. They were organized in topical sections named: applications, locality, task parallelism, extensions, tools, accelerator programming, and performance evaluations and optimization.

Algorithms and Architectures for Parallel Processing

Algorithms and Architectures for Parallel Processing
Title Algorithms and Architectures for Parallel Processing PDF eBook
Author Jesus Carretero
Publisher Springer
Pages 397
Release 2016-11-30
Genre Computers
ISBN 3319499564

Download Algorithms and Architectures for Parallel Processing Book in PDF, Epub and Kindle

This book constitutes the refereed workshop proceedings of the 16th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2016, held in Granada, Spain, in December 2016. The 30 full papers presented were carefully reviewed and selected from 58 submissions. They cover many dimensions of parallel algorithms and architectures, encompassing fundamental theoretical approaches, practical experimental projects, and commercial components and systems trying to push beyond the limits of existing technologies, including experimental efforts, innovative systems, and investigations that identify weaknesses in existing parallel processing technology.

Programming Multicore and Many-core Computing Systems

Programming Multicore and Many-core Computing Systems
Title Programming Multicore and Many-core Computing Systems PDF eBook
Author Sabri Pllana
Publisher John Wiley & Sons
Pages 522
Release 2017-01-23
Genre Computers
ISBN 1119331994

Download Programming Multicore and Many-core Computing Systems Book in PDF, Epub and Kindle

Programming multi-core and many-core computing systems Sabri Pllana, Linnaeus University, Sweden Fatos Xhafa, Technical University of Catalonia, Spain Provides state-of-the-art methods for programming multi-core and many-core systems The book comprises a selection of twenty two chapters covering: fundamental techniques and algorithms; programming approaches; methodologies and frameworks; scheduling and management; testing and evaluation methodologies; and case studies for programming multi-core and many-core systems. Program development for multi-core processors, especially for heterogeneous multi-core processors, is significantly more complex than for single-core processors. However, programmers have been traditionally trained for the development of sequential programs, and only a small percentage of them have experience with parallel programming. In the past, only a relatively small group of programmers interested in High Performance Computing (HPC) was concerned with the parallel programming issues, but the situation has changed dramatically with the appearance of multi-core processors on commonly used computing systems. It is expected that with the pervasiveness of multi-core processors, parallel programming will become mainstream. The pervasiveness of multi-core processors affects a large spectrum of systems, from embedded and general-purpose, to high-end computing systems. This book assists programmers in mastering the efficient programming of multi-core systems, which is of paramount importance for the software-intensive industry towards a more effective product-development cycle. Key features: Lessons, challenges, and roadmaps ahead. Contains real world examples and case studies. Helps programmers in mastering the efficient programming of multi-core and many-core systems. The book serves as a reference for a larger audience of practitioners, young researchers and graduate level students. A basic level of programming knowledge is required to use this book.