SC18:G: RACE - Recursive Algebraic Coloring Engine

Christie Louis Alappat and Gerhard Wellein
Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg, Germany

1 Problem & Motivation
Sparse linear algebra is a key component in many scientific simulations ranging from quantum physics to fluid and structural mechanics. However, iterative numerical methods and important building blocks of sparse linear algebra frequently feature strong data dependencies, making them difficult to parallelize. Typically, loop-carried dependencies occur in iterative solvers (e.g., Kacmarz, Gauss-Seidel) or preconditioners and write conflicts show up in the parallelization of building blocks such as symmetric sparse matrix-vector multiplication. Scalable, hardware-efficient parallelization of such methods and kernels is known to be a challenge. Multi-coloring is a widely used approach to enable parallelization of iterative solvers with distance-\(k\) dependency; e.g., the red-black Gauss-Seidel algorithm solves the distance-1 dependency problem. However, most of those standard solutions suffer from low performance on modern hardware, are highly problem specific, or require tailored sparse matrix storage formats.

RACE addresses these shortcomings by combining ideas from graph traversal and multi-coloring to ensure data locality, to generate appropriate levels of parallelism, and to enable hardware-efficient parallelization schemes. It is applicable to many problems (i.e., matrix structures) and general sparse data storage formats.

Outline
The paper is structured as follows. Section 2 describes the underlying dependency problems and conventional solutions. Section 3 demonstrates the major drawbacks of the existing approaches. We then introduce the RACE method in Section 4, its uniqueness and how its basic design addresses the existing problems. In Section 5 we compare RACE performance for thread-level parallelization of symmetric sparse matrix-vector multiplication (SymmSpMV) to available standard solutions including Intel MKL. Finally we use RACE to parallelize a sparse eigenvalue solver provided by Intel MKL and demonstrate RACE’s superiority in terms of performance and attainable problem sizes.

2 Background & Related Work
Data dependencies often prevent a straightforward parallelization of sparse linear algebra kernels. As a representative and highly relevant example for a distance-2 dependency problem, we use symmetric sparse matrix-vector multiplication (SymmSpMV). Algorithm 1 shows the pseudo-code of the basic SymmSpMV kernel for upper triangular matrices stored in Compressed Row Storage (CRS) [7] format. The kernel exploits the symmetry of the matrix (\(A_{ij} = A_{ji}\)) to reduce storage size and overall memory traffic, which is known to be pivotal hardware bottleneck for this operation on all modern compute devices. However, SymmSpMV cannot be parallelized easily as different threads working on different rows in parallel could potentially write to the same element \(b[col[idx]]\), causing write conflicts. In terms of graph theory this means a vertex (row in a matrix) and its distance-2 neighbors [9] cannot be operated on in parallel. Here we concentrate on such distance-2 dependency problems, although the underlying method and library is capable of handling the general case of distance-\(k\) dependencies as well.

A popular approach to solve the above problem is multi-coloring (MC). The earliest work on coloring is the red-black Gauss-Seidel scheme [6], which was applied to matrices with a known regular sparsity pattern. Later multi-coloring techniques were expanded using graph theory for general sparse matrices [10, 15]. Recent variants like algebraic block multi-coloring (ABMC) [14] tried to improve the performance of MC methods. In [8], MC was applied to the Kaczmarz iterative solver [16], which has the same distance-2 dependency as SymmSpMV. Specifically for SymmSpMV there has been no previous attempt to use multi-coloring techniques. General solutions for SymmSpMV are lock-based methods and thread-private target arrays [5, 11]. Depending on the matrix structure these solutions can lead to performance degradation due to serialization and massive increase in data traffic.

Recent research in this direction uses specialized storage formats like CSB [2] or RSB [20], but this requires rewriting of existing code and substantial tuning efforts.

3 Uniqueness of the Approach
Multi-coloring methods can extract parallelism for kernels with data dependencies like SymmSpMV. For distance-2 coloring of a matrix, MC groups rows that do not overlap in

---

Algorithm 1 SymmSpMV kernel, \(b = Ax\), in CRS format.

//Loop over all matrix rows
1: \textbf{for} row = 1 \textbf{to} nrows \textbf{do}
2: \hspace{1em} diag_idx = rowPtr\[row]\n3: \hspace{1em} b[row] += A[diag_idx] \times x[row] //Loop over all non-zero entries in a row
4: \textbf{for} idx = rowPtr\[row] + 1 \textbf{to} rowPtr[\row + 1] \textbf{do}
5: \hspace{1em} b[row] += A[idx] \times x[col[idx]]
6: \hspace{1em} b[col[idx]] += A[idx] \times x[row]
ACM-SRC Grand Finals, 2019, USA

C. Alappat

(see algorithm 1) the original matrix has good data locality. However this process comes at the cost of destroying data locality in the matrix by the required permutations. In the SymmSpMV example (see algorithm 1) threads within a color operate on different rows having entirely different col[idx] avoiding write conflicts in b vector. Note, that within a color (for e.g., red) none of the rows share same column index. As the matrix is traversed row by row (see algorithm 1) the original matrix has good data locality and most of the indirect vector accesses (v[col[idx]] and b[col[idx]]) correspond to nearby elements that were loaded in the computation of previous rows. This ensures these vectors need to be loaded only once from main memory, and the rest of the accesses are served by fast caches. However coloring the matrix destroys this data locality. For example in Figure 1b computing all the red colored rows leads to loading the entire vector completely. If the cache holds only six elements, computation on green and blue rows require loading almost the entire vector again from the slow main memory.

Destroying data locality along with secondary effects like synchronization costs and false sharing may, thus lead to severe performance degradation for MC methods. We demonstrate the impacts on performance and data transfer volumes for the SymmSpMV computations in Figure 2 for a single 10-core Intel Ivy Bridge EP (E5-2660 v2) CPU clocked at 2.2 GHz. The experiment was done on a large (number of rows = 10400600) Spin^p ∼ 26 [24] matrix taken from quantum physics application. We find that performance of MC methods scale decently within a socket but are far off the RACE performance which saturates main memory bandwidth at 6-7 cores (Figure 2a). The reason for the large performance difference is given in Figure 2b which shows the average main memory data traffic per non-zero of the general matrix during SymmSpMV execution. It can be clearly seen that the memory traffic is almost 4× higher for the MC method compared to ideal traffic (red line) predicted by an appropriate performance model. The extra data traffic is mainly due to the low data locality and thereby incurred extra accesses of the indirectly accessed vectors. Algebraic block multi-coloring (ABMC) tries to reduce the memory traffic by first partitioning the matrix into blocks and then applying coloring. This improves (reduces) the data traffic compared to MC but is still far from optimal in this case.

As main memory bandwidth is the main bottleneck on modern compute devices, this extra traffic reflects directly on the performance. This is seen in Figure 2a, where the performance is shown in giga floating point operations in seconds (GF/s). The ideal performance as predicted by performance model is ≈ 7.6 GF/s (not shown in figure) for this matrix, but MC and ABMC are well below this limit. However our RACE method closely approaches the ideal values both for the data traffic and performance and provides a speed-up of almost 4× compared to other methods.

4 RACE Method

RACE was designed with the shortcomings of coloring approaches in mind. The idea is to have a general hardware-friendly approach applicable even for simple matrix storage formats like CRS. The RACE method consists of three steps: (1) level construction, (2) distance-k coloring, and (3) load balancing. Depending on the matrix and hardware the steps are applied recursively if required. To illustrate the method we choose a simple matrix which is associated with an artificially constructed two-dimensional-seven-point (2d-7pt) stencil. Figure 3a shows the corresponding graph and the

Figure 1. Illustration of data locality degradation due to MC. Numbers represent thread id. Note that this figure shows only rows of matrix permuted according to MC, but in practice one would permute both rows and columns.

any column entries [9] (structurally orthogonal rows). These groups of rows are referred to as colors and parallelization can be done across the rows of a color (see Figure 1 for a simple example). However this process comes at the cost of destroying data locality in the matrix by the required permutations. In the SymmSpMV example (see algorithm 1) threads within a color operate on different rows having entirely different col[idx] avoiding write conflicts in b vector. Note, that within a color (for e.g., red) none of the rows share same column index. As the matrix is traversed row by row (see algorithm 1) the original matrix has good data locality and most of the indirect vector accesses (v[col[idx]] and b[col[idx]]) correspond to nearby elements that were loaded in the computation of previous rows. This ensures these vectors need to be loaded only once from main memory, and the rest of the accesses are served by fast caches. However coloring the matrix destroys this data locality. For example in Figure 1b computing all the red colored rows leads to loading the entire vector completely. If the cache holds only six elements, computation on green and blue rows require loading almost the entire vector again from the slow main memory.

Destroying data locality along with secondary effects like synchronization costs and false sharing may, thus lead to severe performance degradation for MC methods. We demonstrate the impacts on performance and data transfer volumes for the SymmSpMV computations in Figure 2 for a single 10-core Intel Ivy Bridge EP (E5-2660 v2) CPU clocked at 2.2 GHz. The experiment was done on a large (number of rows = 10400600) Spin^p ∼ 26 [24] matrix taken from quantum physics application. We find that performance of MC methods scale decently within a socket but are far off the RACE performance which saturates main memory bandwidth at 6-7 cores (Figure 2a). The reason for the large performance difference is given in Figure 2b which shows the average main memory data traffic per non-zero of the general matrix during SymmSpMV execution. It can be clearly seen that the memory traffic is almost 4× higher for the MC method compared to ideal traffic (red line) predicted by an appropriate performance model. The extra data traffic is mainly due to the low data locality and thereby incurred extra accesses of the indirectly accessed vectors. Algebraic block multi-coloring (ABMC) tries to reduce the memory traffic by first partitioning the matrix into blocks and then applying coloring. This improves (reduces) the data traffic compared to MC but is still far from optimal in this case.

As main memory bandwidth is the main bottleneck on modern compute devices, this extra traffic reflects directly on the performance. This is seen in Figure 2a, where the performance is shown in giga floating point operations in seconds (GF/s). The ideal performance as predicted by performance model is ≈ 7.6 GF/s (not shown in figure) for this matrix, but MC and ABMC are well below this limit. However our RACE method closely approaches the ideal values both for the data traffic and performance and provides a speed-up of almost 4× compared to other methods.

4 RACE Method

RACE was designed with the shortcomings of coloring approaches in mind. The idea is to have a general hardware-friendly approach applicable even for simple matrix storage formats like CRS. The RACE method consists of three steps: (1) level construction, (2) distance-k coloring, and (3) load balancing. Depending on the matrix and hardware the steps are applied recursively if required. To illustrate the method we choose a simple matrix which is associated with an artificially constructed two-dimensional-seven-point (2d-7pt) stencil. Figure 3a shows the corresponding graph and the

Figure 1. Illustration of data locality degradation due to MC. Numbers represent thread id. Note that this figure shows only rows of matrix permuted according to MC, but in practice one would permute both rows and columns.

See Section 5 for more details on modeling.

Figure 2. (a) Performance of SymmSpMV with MC and ABMC compared to RACE. (b) Average main memory data traffic in bytes (B) per nonzero entry (N_{nz}) of the full matrix as measured with LIKWID tool [26]. The ideal data traffic as predicted by performance model is shown for reference.
The distance-vertex corresponding to each level in a data structure called vertices in the vertex numbering in the permuted graph has changed to resolve dependencies. Two vertices are called distance-k independent if they are not distance-k neighbors. Based on this definition it can be proven that vertices between levels \( L(i) \) and \( L(i ± (k + j)) \) are distance-k independent \( \forall j ≥ 1 \). The levels that satisfy this criterion are called distance-k independent levels.

The above approach allows for many choices to form distance-k independent levels. Figure 4 shows one such possibility for distance-1 and distance-2 coloring each. As \( L(i) \) and \( L(i ± 2) \) are distance-1 independent, the distance-1 coloring assigns two colors to alternating levels. In case of distance-2 we group two adjacent levels and apply distance-1 coloring to the groups. These groups of levels are called level-groups and the \( i-th \) level-group is denoted as \( T(i) \) (see Figure 4b). For distance-1 coloring shown in Figure 4a the levels and level-groups coincide \( (L(i) = T(i)) \). In both cases all the vertices between level-groups of same color are distance-1/distance-2 independent and can be executed in parallel. For example, in case of distance-2, level-groups \( T(0), T(2), T(4) \) and \( T(6) \) can be executed by four threads in parallel. After synchronization the remaining four blue level-groups can be executed in parallel. Note that within a level-group/level the vertices are computed serially without destroying any data locality.

Choosing the same number of levels per level-group may cause severe load imbalance depending on the matrix. For example, in Figure 4b level-groups at extreme ends \( T(0), T(7) \) have a relatively low number of vertices (proportional to computational work) compared to the level-groups in the middle \( (T(3), T(4)) \).

4.3 Load Balancing

RACE applies a load-balancing scheme among the threads within each color. It generates just the right number of level-groups as required by the hardware (i.e., the number of available threads) and then applies a load-balancing algorithm that minimizes the variance among the number of available threads.

4.1 Level Construction

In the first step we determine the levels of a graph and permute the data structure accordingly. Here, we use well-known bandwidth reduction algorithms like Reverse Cuthill McKee (RCM)[3] or Breadth-first search (BFS)[19]. Although RCM is implemented in RACE, in the following we apply BFS reordering for better illustration. We start with choosing a root vertex and assign it to the first level \( L(0) \). The next levels \( L(i) \) are defined to contain all vertices that are directly related to the previous level \( L(i-1) \) but are not in \( L(i-2) \). This implies that the \( i-th \) level consists of all vertices that have a minimum distance of \( i \) from the root node. In Figure 3 the level numbers \( (i) \) are denoted in the superscript of the vertices.

After the levels are determined we permute (reorder) the matrix (and graph) according to the levels such that the vertices in \( L(i) \) appear before \( L(i+1) \). Figure 3b shows the graph and matrix after applying the permutation. Note that the vertex numbering in the permuted graph has changed compared to the original lexicographically ordered matrix. It is well known that such a permutation improves data locality, and it was previously applied to sparse matrix computations without dependencies [21].

In order to resolve dependencies, RACE additionally keeps information about the levels by storing the index of the first vertex corresponding to each level in a data structure called level_ptr (see Figure 3c).

4.2 Distance-k Coloring

The distance-k coloring step uses the information of the level_ptr to resolve dependencies. Two vertices are called distance-k neighbors if the shortest path connecting them consists of at most \( k \) edges [9]. This implies two vertices are distance-k independent if they are not distance-k neighbors. Based on this definition it can be proven that vertices between levels \( L(i) \) and \( L(i ± (k + j)) \) are distance-k independent \( \forall j ≥ 1 \). The levels that satisfy this criterion are called distance-k independent levels.
We evaluate the performance of RACE by parallelizing the SymmSpMV kernel shown in Algorithm 1. This allows a clear picture of the performance advantage of RACE. Finally, we use RACE to parallelize an eigenvalue solver, and compare against standard approaches.

5.1 Analysis of SymmSpMV Performance

Matrix-vector multiplication is frequently used in numerical algorithms. In many cases, however, the lack of an efficient and generic SymmSpMV implementation leads to the full (general) matrix being stored and used even if it is symmetric, which wastes not only CPU cycles but also memory. Modern HBM (High Bandwidth Memory) technology with its rather limited memory sizes makes this problem even more severe.

In this section we carry out experiments using a SymmSpMV kernel. We choose most of the test matrices from the public SuiteSparse Matrix Collection [4], that are frequently used in related publications [20, 22], as well as some from the quantum physics context in which RACE was developed [1]. The experiments are run on one Intel Skylake SP Gold 6148 CPU (20 threads) at a fixed clock speed of 2.4 GHz. The reported performance is purely for the SymmSpMV computation as in practical applications these kernels are called multiple times, making other costs (like setup time) negligible.

To establish a sensible performance baseline we use the Roofline model (RLM) [27] along the lines of [18] but adjusted for the SymmSpMV kernel. Figure 6a shows the performance of RACE on different matrices along with the range of upper performance bounds based on two saturated memory bandwidth measurements (RLM-load and RLM-copy). In almost all cases, RACE attains more than 85% of the possible maximum. This can be attributed to the good data locality and minimal data traffic, which we have already demonstrated in Figure 2b for the Spin-26 matrix.

In Figure 6a we compare against the SymmSpMV implementation of the latest version of Intel MKL [13], which uses the Inspector-Executor routines. The comparisons show that RACE outperforms MKL by a factor of 1.5× on an average. A simple analysis shows that the performance of the Intel MKL SymmSpMV kernel coincides with the matrix-vector multiplication using the full matrix (SpMV). We can only speculate (due to it being closed source) that MKL converts the symmetric matrix to a full matrix internally and then does a general SpMV operation.

We also compare RACE with two widely used coloring methods, MC and ABMC. For MC we apply the multi-coloring scheme generated by the COLPACK [10] library to parallelize the SymmSpMV kernel. In the ABMC method we first partition the matrix into blocks using METIS [17] and then apply coloring via COLPACK. The size of blocks was determined by a parameter scan (range 4 . . . 128, see [14]). Figure 6b shows the resulting performance data. Overall the MC method is not competitive, while ABMC delivers modest performance (85% of RACE) for small matrices. For large matrices, however, where data locality plays a vital role, ABMC falls substantially behind RACE as the indirect access to the vectors impairs temporal and spatial access locality. Overall, RACE shows an average speedup of 1.6× compared to ABMC, while in some cases the speedup is as high as 3×.

5.2 FEAST with RACE

FEAST[23] is a modern algorithm to compute inner eigenvalues. It uses contour integration to generate a subspace
(reduced system) containing the eigenvalues, which are then solved using a classic Rayleigh-Ritz procedure. The solver is well-suited for very large sparse systems, and is of particular interest in the field of quantum mechanics. The hot spot of the algorithm (more than 95%) is a solver for shifted linear systems \((A - \sigma I = b)\). These systems are, however, highly ill-conditioned, posing severe convergence problems for most linear iterative solvers. The standard approach is therefore to use direct solvers. However, in [8] it has been shown that the Kaczmarz iterative solver accelerated by a Conjugate Gradient (CG) method (the so-called CGMN solver [12]) is a robust alternative to direct solvers. Similar to SymmSpMV, the Kaczmarz method has a distance-2 dependency, making it difficult to parallelize. In [8], multi-coloring was used to parallelize the kernel. We have implemented a shared-memory parallel version of CGMN with RACE for use in FEAST.

We use the FEAST implementation of Intel MKL, which by default employs the PARDISO direct solver [25], but its Reverse Communication Interface (RCI) allows us to plug our CGMN implementation instead. In the following experiment we find ten inner eigenvalues of a simple discrete Laplacian matrix to an accuracy of \(10^{-8}\). Figure 7 shows the measured time and memory footprint of the default MKL version (using PARDISO) and the CGMN versions parallelized using both RACE and ABMC for different matrix sizes. In line with the observations in Section 5.1, ABMC is a factor of 4× slower than RACE. The time required by the default MKL with PARDISO is smaller than with CGMN using RACE for small sizes; however, the gap gets smaller as the size grows due to the direct solvers having a higher time complexity (here \(\approx O(n^2)\), see Figure 7) compared to iterative methods \(\approx O(n^{1.5})\)). Moreover, the direct solver requires more memory, and the memory requirement grows much faster (see Figure 7(b)) than with CGMN. In our experiment the direct solver ran out of the memory at problem sizes beyond \(140^3\), while CGMN using RACE used less than 10% of space at this point. Thus, CGMN with RACE can solve much larger problems compared to direct solvers, which is a major advantage in fields like quantum physics.

**Conclusion**

We have shown the parallelization problems that commonly occur in sparse computations and discussed on the drawbacks of existing approaches. We then introduced and described the RACE method, its novelties and how it mitigates the shortcomings of existing methods. Finally we have seen the application of the method in numerical linear algebra for achieving high performance and solving large problems on modern compute devices.

**Figure 6.** SymmSpMV performance of RACE compared to other methods. The Roofline model for SymmSpMV is shown in fig. 6a for reference. Note that the matrices are ordered according to increasing number of rows.

**Figure 7.** Comparison of FEAST with default Intel MKL direct solver and iterative solver CGMN, parallelized using RACE and run on one Skylake SP Platinum 8160 CPU (24 threads).
Acknowledgments

We would like to thank Georg Hager, Olaf Schenk, Jonas Thies and Thomas Gruber for providing valuable insights, comments and technical support throughout the work. We gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE) and RWTH Aachen.

References