1	Numerical Linear Algebra An Introduction
2	Levels of multiplication vector-vector a[i]b[i] matrix-vector A[i,j]b[j] matrix-matrix A[I,k]*B[k,j]
3	Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n³) work
4	Block matrix Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n^2.8) work
5	Pipeline method Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]);
6	Pipeline method
7	Pipeline method Similar method for matrix-vector multiplication. But you lose some of the cache reuse
8	A sense of speed – vector ops
9	A sense of speed – vector ops
10	Observation s Simple do loops not effective Cache and memory hierarchy bottlenecks For better performance, combine loops minimize memory transfer
11	LINPACK library of subroutines to solve linear algebra example – LU decomposition and system solve (dgefa and dgesl, resp.) In turn, built on BLAS see netlib.org
12	BLAS Basic Linear Algebra Subprograms a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.
13	BLAS Level 1
14	BLAS Level 2
15	BLAS Level 3
16	How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 level 2 SGEMV MN+N+2M 2MN 1/2 level 3 SGEMM 2MN+MK+KN 2MNK 2/N
17	Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) write y(1:n) back to slow memory
18	Matrix-vector m=# slow memory refs = n^2 +3n f=# arithmetic ops = 2n^2 q=f/m ~2 Mat-vec multiple limited by slow memory
19	Matrix-matrix
20	Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory
21	Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2n^2 read and write each element of C once = n^3 + 3n^2 So q = f/m = (2n^3)/(n^3 + 3n^2) ~= 2 for large n, no improvement over matrix-vector multiply
22	Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} write block C(i,j) back to slow memory
23	Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = Nn^2 read each block of B N^3 times (N^3 n/N * n/N) + Nn^2 read each block of A N^3 times + 2n^2 read and write each block of C = (2N + 2)n^2 So q = f/m = 2n^3 / ((2N + 2)*n^2) ~= n/N = b for large n
24	Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)
25	More on BLAS Industry standard interface(evolving) Vendors, others supply optimized implementations History BLAS1 (1970s): vector operations: dot product, saxpy m=2n, f=2n, q ~1 or less BLAS2 (mid 1980s) matrix-vector operations m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1
26	More on BLAS BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack
27	BLAS on an IBM RS6000/590