High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel Routines, and Qualitative Performance Modeling
View/
Open
Author
Wu, Wenhao.
Item Type
ThesisAdvisor
Skjellum, AnthonyCommittee
Luke, Edward A.Reese, Donna S.
Metrics
Abstract
The optimal implementation of matrix multiplication on
modern computer architectures is of great importance
for scientific and engineering applications.
However, achieving the optimal performance for matrix multiplication
has been continuously challenged both by
the ever-widening performance
gap between the processor and memory hierarchy and the introduction
of new architectural features
in modern architectures.
The conventional way of dealing with these challenges benefits
significantly from the blocking algorithm, which
improves the data locality in the cache memory,
and from the highly tuned inner kernel routines,
which in turn exploit the architectural aspects on
the specific processor to deliver near peak performance.
A state-of-art improvement of the blocking algorithm is the self-tuning
approach that utilizes "heroic" combinatorial optimization
of parameters spaces. Other recent research approaches include
the approach that explicitly blocks for the TLB
(Translation Lookaside Buffer) and the hierarchical formulation that
employs memory-friendly Morton Ordering
(a space-filling curve methodology).
This thesis compares and contrasts the TLB-blocking-based and
Morton-Order-based methods for
dense matrix multiplication, and offers a qualitative
model to explain the performance behavior. Comparisons to the performance
of self-tuning library and the "vendor" library
are also offered for the Alpha architecture.
The practical benchmark experiments demonstrate that neither conventional
blocking-based implementations nor the
self-tuning libraries are optimal to achieve consistent
high performance in dense matrix multiplication
of relatively large square matrix size.
Instead, architectural constraints and issues evidently restrict
the critical path and options available for optimal performance, so that
the relatively simple strategy and framework presented in this study
offers higher and flatter overall performance.
Interestingly, maximal inner kernel
efficiency is not a guarantee of global minimal multiplication time.
Also, efficient and flat performance is possible at all problem sizes
that fit in main memory, rather than "jagged" performance curves
often observed in blocking and self-tuned blocking libraries.