Lakshminarayanan Renganarayana
Colorado State University
ln@cs.colostate.edu
Tiling is a widely used loop
transformation for exposing/exploiting parallelism and data locality. Effective
use of tiling requires selection and tuning of the tile sizes. This is usually
achieved by developing cost models that characterize the performance of the
tiled program as a function of tile sizes. All previous approaches to optimal
tile size selection (TSS) are cost model specific. Due to this they are neither
extensible (to richer program classes/newer architectures) nor scalable (to
multiple levels of tiling).
Instead of proposing yet
another TSS method, we take a fundamentally different approach and derive a
unified TSS framework based on a mathematical property shared by almost all TSS
models. Here are a few terms commonly used in the TSS models: total execution
time, number of cache/TLB misses, size of communication
messages/temporary storage, loop overhead, communication latency/bandwidth,
cache size, miss penalty. Although these functions and parameters are very
diverse, they all share a simple mathematical property, viz., positivity. We
describe this property and use it to identify a class functions called
posynomials which can be used to derive an efficient, scalable and cost-model
independent framework for optimal TSS. Our framework not only unifies the
variety of models proposed in the literature, but also lays the foundations to
build more sophisticated models.
The compute and data intensive
kernels of several important applications are loops. Tiling [11,25,14,28]
partitions the computations of the loop into groups called tiles. The
tiles often have better data locality and serve as units of coarse grained
parallelism. Typically multiple levels of tilings are used, e.g., a level of
tiling to obtain coarse grained parallelism, a level for cache locality, and
possibly another level for register reuse [5,29,21]. Recently, multi-level tiling has also been
successfully used in automatic mapping of loops onto GPUs [3]. With the advent of multi- and many-core
architectures multi-level tiling has become the standard design pattern for
deriving high-performance implementations.
Effective use of tiling requires techniques for tile
shape / size selection and tiled code generation. In this paper we focus on the
key step of tile size selection (TSS) and our solution can be combined with the
existing techniques for tiled code generation and tile shape selection. Tile
size selection is an important problem due its performance impact: often there
is a significant performance difference between a good tile size and a bad one.
The importance is also evident from the extensive study of TSS methods across
the last two decades.
Optimal Tile Size Selection is the problem
of selecting the tile sizes that are optimal with respect to some cost
model. For example, in the use of tiling to improve cache locality, consider
the selection of sizes x and y which form the sides of a 2D tile.
A widely used cost function is the number of cache misses. This cost function
is used together with the constraint that the data accessed by a given tile-tile
footprint-fits in the cache. The optimal TSS problem can be stated as
follows:
|
where, Misses(x,y) estimates the
number of misses experienced with tile of size x×y, FootPrint(x,y)
estimates the number of cache lines touched by a tile of size x×y,
and CacheCapacity is the capacity of the cache in number of lines. The
cost function together with the constraint is called the cost model. One
can view the optimal TSS problem as a constrained optimization problem and in
such a view the cost function is also referred to as the objective
function.
Solutions ranging from closed form solutions to heuristic algorithms and empirical search methods have been proposed (e.g., see [10,28,26,29]). However, all these solutions are cost model specific, i.e., their applicability is very sensitive to the particular properties of the functions used in the cost model. Due to this cost model specificity they are neither extensible to newer architectures / program class, nor scalable to multiple levels of tiling. The next section describes these limitations with an example.
In this section, we
discuss the two primary limitations of current TSS methods, viz., (i)
non-extensibility to newer architectures and program classes and (ii)
non-scalability to multiple levels of tiling. We first describe the design
process followed by current methods and using that highlight the limitations.
All TSS solutions proposed currently in the literature follow a design process
which can be summarized as follows:
1.
Design a cost
model. This includes the design of a cost metric (objective
function) that estimates a desired quantity as a function of tile sizes and
constraints that qualify tile sizes as valid or not. The cost models seek to
estimate quantities that are related to the execution characteristics and hence
are inherently strongly tied to the class of programs and architectural features
for which they are designed.
2.
Reason about
the structure of the cost functions. For example, one can check whether
the objective function is linear or quadratic in terms of the tile size
variables.
3.
Exploit the
properties of functions to derive a closed form solution or a heuristic/search
algorithm.
As an illustration, consider the optimal TSS problem
proposed by Andonov et al. [1]. They study the problem of tiling 2D iteration spaces
with uniform dependencies for parallel SPMD style execution. After a detailed
study of the class programs they want to tile, the architectural parameters,
and the execution characteristics they come up with a cost model. The objective
function T(x,y) estimates the
total (parallel) execution time of the tile program and the goal is to pick the
tile sizes that minimize this metric. The objective function and the
constraints can be abstractly viewed as
|
where x,y are the tile size variables and A,B,C,D,E
are constants. Then they use the following reasoning to obtain a closed form
solution: for xy=K, K Î R the
function T(x,y) monotonically
decreases with x. As a result, the optimal solution is on certain
boundaries of the feasible space, and using this information, one of the
variables can be eliminated, yielding a closed form solution for x and y.
A subtle but
important feature of the above process is the following: the cost model is
strongly coupled to the program class/architectural features and the solution
(method) is derived by exploiting the properties of the functions used in the
cost model. Any extension of the cost model to a different architecture, richer
program class, or to multiple levels of tiling, change the structure
of the functions used in the cost model, and hence leave the solution
(method) inapplicable. For example, an extension of the Andonov et al.'s model
to a richer program class, viz., 3D iteration spaces, leaves their solution
method inapplicable.
All optimal TSS
solutions proposed in the literature are cost model specific and do not lend
themselves to extensions. Any non-trivial extension typically requires an
effort equal to or more than the earlier one, and are often publishable results
(e.g., extension from direct mapped caches to set associative caches, from 2D
to nD iteration spaces, etc.). Generally, one wants to use an TSS solution for a program class or architecture which is
slightly different than the one considered by the author of the solution. But
accounting for the differences lead to changes in the cost model, which leave
the solution inapplicable. This is in fact an important reason for the
popularity of exhaustive search (run the program for
different tile sizes and pick the best).
The scalability
limitation of the current approaches also stems from their strong dependence on
the properties of the functions used in the cost model. For example, in a 2D
one level tiling, the optimal TSS problem has the two tile sizes as variables
and functions used in the cost models are of degree at most two (linear,
quadratic, hyperbolic, etc.) and are easy to reason about. However, when we
move to two levels of tiling there are four variables and functions of four
variables with degree up to four are much harder to reason about. Single level
empirical search solutions do not scale to multiple levels due to the
combinatorial explosion of the search space size (induced by the increase in
the number of tile size variables and also the possible values for the tile
sizes). A simple solution is to find the tile sizes in a level-by-level
approach (find tile sizes for one level, and use them to find the ones for the
next level). However, with the multi-core architectures, there are complex
interactions between different levels of resources (e.g., parallelism and
locality) and due to this such a level-by-level
approach is clearly sub-optimal. This sub-optimality of the level-by-level
approach is also shown, in three different architectural scenarios, by Mitchell
et al. [16].
To summarize, the
cost-model specificity of the solution methods lead to their non-extensibility
and non-scalability. Our framework overcomes these limitations by providing a
cost-model independent solution method. When using our framework one does not
have to perform the second and third steps of the traditional design process
described above.
Instead of
proposing yet another TSS method, we take a fundamentally different approach
and seek to identify the common property shared by the variety of TSS models.
Several authors have exploited particular properties such as linear, quadratic,
hyperbolic, etc., of the cost functions to derive optimal TSS solutions.
Instead of exploiting the specific properties of a cost model to derive a
solution, we identify a fundamental mathematical property that is inherent to
the TSS models and use it derive a unified TSS framework. Table 1 lists several functions and parameters that are used
in TSS models. There is a fundamental mathematical property shared by all them,
viz., positivity. All the machine and system parameters are positive
quantities and the functions model positive quantities. The tile sizes which
appear as variables in these functions are also positive. Essentially, the
functions used in TSS models estimate positive quantities using positive
parameters and positive variables. This positivity property might seem to be a
simple one, but it has profound implications. This positivity property
distinguishes the class of optimization problems that are solvable in
polynomial time1 and those that are
not [4]. As we show in
the coming sections we can use this property as a basis to identify a class of
polynomials called posynomials which can be used to formulate optimal
TSS problems that can be solved efficiently.
|
Functions modeling quantities of interest |
|
|
Cache/TLB miss
penalty Cache/TLB sizes Number of
registers Number of
functional units Latency of
functional units Network latency Network bandwidth
MPI Call start-up
cost |
Tile volume Number of tiles Number of cache
misses Cache/register
foot print Idle time in
parallel execution Communication
volume Temporary
storage size Array pad size |
Table 1: Parameters and functions
that are widely used in TSS models.
Our
framework provides a single unified solution for TSS which is useful in the
model-based approach as well as the hybrid model-driven empirical search
approach. In the later case, our framework would be used in the first filtering
or starting point selection phase [30]. There by, our
framework is a good candidate for inclusion in compilers as well as
auto-tuners. Further, it is insightful to find that such a unified framework
can be derived by exploiting a simple but fundamental property shared by all
TSS models.
We would like to emphasize that the
goal of this paper is not to show the benefits of loop tiling-several authors
across two decades have shown that. Our goal is to propose an efficient
framework which not only unifies the variety of models proposed in the
literature, but also lays the foundations to build more sophisticated models.
To this end, we substantiate our claim by (i) taking five different TSS models
proposed in the literature by different authors and casting them into our framework
and (ii) showing that almost all the cost functions used in the TSS literature
are posynomials. This paper makes the following contributions:
Monomials and posynomials are used
in building the cost models. TSS models built with posynomials can be
transformed into a particular class of numerical convex optimization problems
called Geometric Programs (GP) which can solved efficiently [8,4]. We refer to GPs solved for integer solutions as
Integer Geometric Programs (IGP).
Let x denote the vector (x1,x2,¼,xn) of n real, positive
variables. A function f is called a posynomial function of x
if it has the form
|
where cj ³ 0 and aij Î R. Note
that the coefficients ck must be non negative, but the
exponents aij can be any real numbers, including negative or
fractional. When there is exactly one nonzero term in the sum, i.e., t=1
and c1 > 0, we call f a monomial function.2 For example,
0.7+2x1/x32+x20.3
is a posynomial (but not a monomial); 2.3(x1/x2)1.5
is a monomial (and, hence a posynomial); while 2x1/x32-x20.3 is neither.
Monomials and posynomials enjoy a
rich set of closure properties, which are very useful in composition of smaller
(say single level) TSS models to build larger (multi-level) ones. Monomials are
closed under product, division, non-negative scaling, power and inverse.
Posynomials are closed under sum, product, non-negative scaling, division by monomials, and positive integer powers.
Recent advances [4] in convex optimization provide efficient polynomial
time solution methods. GPs can be transformed into convex optimization problems
using a variable substitution and solved efficiently using polynomial time
interior point methods [12,4]. The positivity
property of the posynomials is extensively exploited in this transformation of
GPs to convex optimization problems. The computational complexity
of solving GPs are similar to that of solving linear programs.
Continuous real solutions can be found very quickly in polynomial time. Integer
solutions need a branch and bound style algorithm, which in the worst-cast can
take exponential time. However, we have found (cf. Sec 4.3) that for optimal TSS problems the IGPs are very small
(few tile size variables and constraints) and solutions can be found quickly.
Further, in the context of TSS, it is very common to solve for real solutions
and round them to obtain integer solutions. In such an approach we can obtain
the solution in polynomial time irrespective of the complexity of the model.
Posynomials are
well suited for modeling TSS problems. The appropriateness is evident from the
fact that almost all TSS cost functions considered in the literature turn out
to be posynomials. A few of them are discussed in this section.
The fact that such a large number of TSS
models-proposed across two decades by a several different authors-can all be reduced to the posynomial based TSS framework shows its
generality and wide applicability. We have given in [18] detailed evidence of the wide applicability of our
framework via a reduction of five different tiling models (from a wide range of
tiling contexts) to our framework. Due to space limitations we do not discuss
these reductions here.
|
Cost function used for selecting the optimal tile
size |
|
|
ESS [9] |
C/(h*w) |
|
LRW [13]
|
1/h+1/w+(2h+w)/C |
|
TSS [7] |
(2h+w)/h*w |
|
EUC [22]
|
1/h+1/w |
|
MOON [17] |
1/h+1/w+(h+w)/C |
|
TLI [6]
|
1/h+1/w+(h+w)/C+h*w/C2 |
|
WMC [27] |
C/h*w |
|
MHCF [16] |
(1/h+1/w)(1/n+1/l)+2/(h*w) |
Table 2: Cost functions used in the literature for optimal cache locality tiling are shown, where C is the cache size, h,w represent the height and width of the rectangular tile, n represents the size of a 2D array and l represents the cache line size. A simple inspection shows that they are all posynomials. This table is derived from Hsu and Kremer [10,table 2].
We have implemented the TSS framework as a tool called PosyOpt. The
implementation uses MATLAB and YALMIP [15]
a tool which provides a symbolic interface to several optimization tools on top
of MATLAB. The symbolic interface allows a high level specification of the
optimal TSS problems. The overall structure of our tool PosyOpt is shown in
Figure 1. The optimal TSS problems are
specified at a high level using posynomials as a IGP.
These problems are then automatically transformed to a convex optimization
problem. The transformed problem is then fed to the convex optimization solver
of MATLAB and solved for real solutions. Integer solutions are found via a
branch-bound algorithm which internally uses the MATLAB solver for solving
continuous relaxations. The output of our tool is the set of optimal tile
sizes.
Figure 1: Overall structure of the PosyOpt tool.
Note that the specification and subsequent refinement and extensions are
performed at the posynomial level (cf. top box in Figure 1). These steps are done without any concern
about the solution method. The only concern is that the specifications and
extensions use posynomials. Further, as shown in the Figure 1 (top box) different models can be combined
or composed together to form multi-level tiling models. We have found the
closure properties of monomials and posynomials to be very useful during the
extensions and compositions.
|
Real |
Real |
Integer |
Integer |
|
|
|
Yalmip |
Solver |
Yalmip |
Solver |
|
tm1n2 |
0.087 |
0.045 |
0.09 |
0.06 |
|
tm1n3 |
0.1 |
0.05 |
0.11 |
1.21 |
|
tm2n2 |
0.09 |
0.07 |
0.09 |
0.26 |
|
tm2n3 |
0.1 |
0.05 |
0.11 |
0.05 |
|
SOB |
0.08 |
0.04 |
0.08 |
0.04 |
Table 3: Running times in seconds of the PosyOpt tool
on various optimal TSS problems.
Using our tool, we formulated
and solved a variety of single level and two level optimal TSS problems. The
number of variables in any optimization problem is determined by the loop nest
depth and the number of levels of tiling. For example, a 2D loop nest tiled
twice, would have 4 variables in the optimal TSS problem. The time our tool
takes to find the integer and real (continuous) solutions are shown in
Table 3. The time taken to find
real and integer solutions are shown with the break up of the time taken by
Yalmip to process the problem specified in symbolic form and the time taken by
the solver to find the solutions. For the first four problems (rows) tmXnY means a
loop nest of depth Y tiled X number of levels and are taken from data locality
tiling models in [19]. For example, tm2n3
means a 3D loop nest tiled twice. The last row is from a TSS problem for single
level tiling of 3D stencil computations for parallelism, taken from [21]. We can note that on the average
the overhead for Yalmip's preprocessing is around 0.1 seconds. The average
solver time to find the real solutions is around 0.05 and the solver time for
integer solutions is between 1.21 and 0.06 seconds. One can observe that for
many of the cases the solver time for real and integer solutions are very
close. These are the cases for which a rounding of the real solution is equal
to the integer solution. For the uses in the context of an auto-tuner or a
compiler, the current speed of our tool seems very reasonable, particularly
given the ease with which the problem can stated, refined, and solved.
We got the insight about the
positivity property of the TSS models after we developed three different cost
model specific TSS methods [19,20,21]. While deriving
these solutions we observed that we were repeatedly using the same class of
functions (posynomials) and the same optimization problem (GPs). This led us to
seek for the common properties shared by the TSS models.
We have proposed a framework
based on a simple yet fundamental property of functions used in TSS models. Our
framework not only generalizes the TSS models proposed in the literature, but
also provides the foundation for developing more sophisticated and particularly
multi-level tiling models. The closure properties of posynomials can be
directly exploited to build multi-level tiling models by composing the well
understood single level models. We are currently developing one such posynomial
based multi-level TSS models for OpenMP based parallel programs. Our framework
provides a single unified solution for TSS that is useful in both compilers and
auto-tuners.
R. Andonov
and S. Rajopadhye. Optimal
orthogonal tiling of 2-d iterations. Journal of Parallel and
Distributed Computing, 45(2):159-165, September 1997.
Rumen Andonov, Stephan Balev,
Sanjay V. Rajopadhye, and Nicola Yanev. Optimal semi-oblique tiling. IEEE
Trans. Parallel Distrib. Syst., 14(9):944-960, 2003.
Muthu Manikandan
Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam,
Atanas Rountev, and
P. Sadayappan. Automatic data movement and computation mapping for
multi-level parallel architectures with explicitly managed memories. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN
Symposium on Principles and practice of parallel programming, pages 1-10,
New York, NY, USA, 2008. ACM.
Stephen Boyd and Lieven Vandenberghe. Convex Optimization.
Larry Carter, Jeanne Ferrante,
and Susan Flynn Hummel. Hierarchical tiling for improved superscalar performance. In
IPPS '95: Proceedings of the 9th International Symposium on Parallel
Processing, pages 239-245,
Jacqueline Chame and Sungdo Moon. A tile selection algorithm for data locality and cache
interference. In Proceedings of the 13th international conference on
Supercomputing, pages 492-499. ACM Press, 1999.
Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache
organization and data layout. In Proceedings of the ACM SIGPLAN 1995
conference on Programming language design and implementation, pages
279-290. ACM Press, 1995.
R.J. Duffin, E.L. Peterson,
and C. Zener. Geometric Programming - Theory
and Applications. John Wiley, 1967.
Karim Esseghir. Improving data locality for
caches. Master's thesis,
Chung hsing Hsu and Ulrich
Kremer. A
quantitative analysis of tile size selection algorithms. J. Supercomput., 27(3):279-294, 2004.
F. Irigoin and R. Triolet. Supernode
partitioning. In 15th ACM Symposium on Principles of Programming
Languages, pages 319-328. ACM, Jan 1988.
K. O. Kortanek, Xiaojie Xu, and Yinyu Ye. An infeasible interior-point algorithm for solving primal and dual
geometric programs. Math. Program.,
76(1):155-181, 1997.
Monica D. Lam, Edward E. Rothberg, and
Michael E. Wolf. The cache performance and optimizations of blocked algorithms.
In Proceedings of the fourth international conference on Architectural support
for programming languages and operating systems, pages 63-74. ACM Press, 1991.
Monica S. Lam and Michael E. Wolf. A data locality optimizing
algorithm (with retrospective). In Best of PLDI, pages 442-459,
1991.
J. Löfberg. YALMIP : A toolbox for
modeling and optimization in MATLAB. In Proceedings of
the CACSD Conference,
N. Mitchell,
N. Hogstedt, L. Carter, and J. Ferrante. Quantifying the multi-level
nature of tiling interactions. International Journal of Parallel
Programming, 26(6):641-670, 1998.
S. Moon
and R. Saavedra. Hyperblocking:
A data reorganization method to eliminate cache conflicts in tiled loop nests.
Technical Report TR-98-671,
Lakshminarayanan Renganarayana. Scalable
and Efficient Tools for Multi-level Tiling. PhD
thesis, Department of Computer Science,
Lakshminarayanan Renganarayana
and Sanjay Rajopadhye. A geometric programming framework
for optimal multi-level tiling. In SC '04: Proceedings of the 2004
ACM/IEEE conference on Supercomputing, page 18,
Lakshminarayanan Renganarayana,
Ramakrishna Upadrasta, and Sanjay Rajopadhye. Optimal ILP and register tiling: Analytical model and
optimization framework. In LCPC 2005: 12th International Workshop on Languages
and Compilers for Parallel Computing. Springer Verlag,
2005.
Lakshminarayanan Renganarayanan,
Manjukumar Harthi-kote, Rinku Dewri, and Sanjay Rajopadhye. Towards optimal multi-level tiling for stencil computations.
In 21st IEEE International Parallel and Distributed
Processing Symposium (IPDPS), 2007.
Gabriel Rivera and Chau-Wen
Tseng. A
comparison of compiler tiling algorithms. In CC '99: Proceedings of
the 8th International Conference on Compiler Construction, pages 168-182. Springer-Verlag, 1999.
V. Sarkar. Automatic selection of high-order transformations
in the ibm xl fortran compilers. IBM J. Res. Dev.,
41(3):233-264, 1997.
Vivek Sarkar. Optimized unrolling of nested
loops. International Journal of Parallel Programming,
29(5):545-581, 2001.
R. Schreiber
and J. Dongarra. Automatic
blocking of nested loops. Technical Report 90.38, RIACS,
R. Clint
Whaley and Jack J. Dongarra. Automatically
tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE
conference on Supercomputing (CDROM), pages 1-27. IEEE
Computer Society, 1998.
M. Wolf,
D. Maydan, and D. Chen. Combining loop
transformations considering caches and scheduling. In 29th
International Symposium on Microarchitecture,
December 1996.
Jingling Xue. Loop Tiling For Parallelism. Kluwer Academic
Publishers, 2000.
K. Yotov, Xiaoming Li, Gang Ren, M. J. S. Garzaran,
D.
Kamen Yotov, Keshav Pingali, and Paul Stodghill.
Think globally, search locally. In ICS '05: Proceedings of the 19th annual
international conference on Supercomputing, pages 141-150,
1Use of polynomial functions with this property leads to convex optimization problems which can be solved for real solutions in polynomial time. On the other hand, optimization problems formulated with arbirtrary polynomials are not solvable in polynomial time.
2Note that this definition of monomial is different from the standard one used in algebra.