Notes
Slide Show
Outline
1
Numerical Linear Algebra
An Introduction
2
Levels of multiplication
  • vector-vector  a[i]*b[i]
  • matrix-vector  A[i,j]*b[j]
  • matrix-matrix  A[I,k]*B[k,j]




3
Matrix-Matrix
  • for (i=0; i<n; i++)
  •   for (j=0; j<n; j++)
  •     {
  •       C[i,j]=0.0;
  •        for (k=0; k<n; k++)
  •          C[i,j]=C[i,j] + A[i,k]*B[k,j];
  •      }
  • Note:O(n3) work
4
Block matrix
  • Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication
  • Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work
5
Pipeline method
  • Pump data left to right and top to bottom
  • recv(&A,P[i,j-1]);
  • recv(&B,P[i-1,j]);
  • C=C+A*B
  • send(&A,P[i,j+1]);
  • send(&B,P[i+1, j]);
6
Pipeline method
7
Pipeline method
  • Similar method for matrix-vector multiplication. But you lose some of the cache reuse
8
A sense of speed – vector ops
9
A sense of speed – vector ops
10
Observation s
  • Simple do loops not effective
  • Cache and memory hierarchy bottlenecks
  • For better performance,
    • combine loops
    • minimize memory transfer

11
LINPACK
  • library of subroutines to solve linear algebra
  • example – LU decomposition and system solve (dgefa and dgesl, resp.)
  • In turn, built on BLAS
  • see netlib.org


12
BLAS
 Basic Linear Algebra Subprograms
  • a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies.
  • The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM.
  • The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.
13
BLAS
  • Level 1
14
BLAS
  • Level 2
15
BLAS
  • Level 3
16
How efficient is the BLAS?
  • load/store float ops refs/ops
  • level 1
  • SAXPY 3N          2N   3/2
  • level 2
  • SGEMV MN+N+2M    2MN            1/2
  • level 3
  • SGEMM    2MN+MK+KN    2MNK            2/N
17
Matrix-vector
  • read x(1:n) into fast memory
  •   read y(1:n) into fast memory
  •     for i = 1:n
  •     read row i of A into fast memory
  •     for j = 1:n
  •    y(i) = y(i) + A(i,j)*x(j)
  • write y(1:n) back to slow memory


18
Matrix-vector
  • m=# slow memory refs = n^2 +3n
  • f=# arithmetic ops = 2n^2
  • q=f/m ~2
  • Mat-vec multiple limited by slow memory


19
Matrix-matrix
20
Matrix Multiply - unblocked
  • for i = 1 to n
  •   read row i of A into fast memory
  •    for j = 1 to n
  •        read C(i,j) into fast memory
  •        read column j of B into fast memory
  •        for k = 1 to n
  •            C(i,j) = C(i,j) + A(i,k) * B(k,j)
  •        write C(i,j) back to slow memory
21
Matrix Multiply unblocked
  • Number of slow memory references on unblocked matrix multiply
  • m =  n^3    read each column of B  n times
  •          + n^2    read each column of A once for each i
  •          + 2*n^2 read and write each element of C once
  •         =  n^3 + 3*n^2
  • So q = f/m = (2*n^3)/(n^3 + 3*n^2)
  •         ~= 2 for large n, no improvement over matrix-vector multiply
22
Matrix Multiply blocked
  • Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize
  • for i = 1 to N
  •      for j = 1 to N
  •        read block C(i,j) into fast memory
  •        for k = 1 to N
  •                   read block A(i,k) into fast memory
  •                   read block B(k,j) into fast memory
  •                   C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
  •       write block C(i,j) back to slow memory
23
Matrix Multiply blocked
  • Number of slow memory references on blocked matrix multiply
  • m =  N*n^2    read each block of B  N^3 times (N^3 * n/N * n/N)
  •          + N*n^2    read each block of A  N^3 times
  •          + 2*n^2     read and write each block of C
  • =  (2*N + 2)*n^2
  • So q = f/m = 2*n^3 / ((2*N + 2)*n^2)
  •         ~= n/N = b  for large n


24
Matrix Multiply blocked

  • So we can improve performance by increasing the blocksize b
  • Can be much faster than matrix-vector multiplty (q=2)


  • Limit:   All three blocks from A,B,C must fit in fast memory (cache), so we
  •  cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)
25
More on BLAS
  • Industry standard interface(evolving)
  • Vendors, others supply optimized implementations
  • History
    • BLAS1 (1970s):
      • vector operations: dot product, saxpy
      • m=2*n, f=2*n, q ~1 or less
    • BLAS2 (mid 1980s)
      • matrix-vector operations
      • m=n^2, f=2*n^2, q~2, less overhead
      • somewhat faster than BLAS1
26
More on BLAS
    • BLAS3 (late 1980s)
      • matrix-matrix operations: matrix matrix multiply, etc
      • m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2
  • Good algorithms used BLAS3 when possible (LAPACK)
  • www.netlib.org/blas, www.netlib.org/lapack
27
BLAS on an IBM RS6000/590