|
1
|
- Some material from lectures of J. Demmel, UC Berkeley Dept of CS
|
|
2
|
- Multigrid is a “unite-and conquer” algorithm for solving the elliptic
PDEs. Multigrid begins by obtaining (or improving) a solution on an NxN
grid by using and (N/2)x(N/2) grid (and proceeding recursively). This
strategy, in addition to reducing computational work, shifts the
frequency structure of errors present in the approximate solution.
- For a more detailed mathematical introduction to the multigrid
algorithm, see "A Multigrid Tutorial" by W. Briggs.
|
|
3
|
- Consider an (2^m-1)-by-(2^m-1) grid of unknowns (a convenient number to
work with). With boundary nodes, e add the nodes at the boundary, get a
(2^m+1)-by-(2^m+1) grid on which the algorithm will operate. We let
n=2^m+1.
- Let P(i) denote the problem of solving a discrete Poisson equation on a
(2^i+1)-by-(2^i+1) grid, with (2^i-1)^2 unknowns.
- The problem is specified by the grid size i, the coefficient matrix
T(i), and the right hand side b(i).
- Generate a sequence of related problems P(m), P(m-1), P(m-2), ... , P(1)
on coarser and coarser grids, where the solution to P(i-1) is a good
approximation to the solution of P(i).
|
|
4
|
|
|
5
|
- Let b(i) be the right-hand-side of the linear system P(i), and x(i) an
approximate solution to P(i). Thus, x(i) and b(i) are (2^i-1)-by-(2^i-1)
arrays of values at each grid point.
- Next define operators that transfer information between grids.
- The restriction operator R(i) takes a pair (b(i),x(i)), consisting of a
problem P(i) and its approximate solution x(i), and maps it to
(b(i-1),x(i-1)), the companion problem on the next coarser grid. On
grid (i-1), start with a guess at the solution x(i-1): ( b(i-1), x(i-1) ) = R(i)(
b(i), x(i) ) We will see that restriction is implemented simply by
computing a weighted average of each grid point value with its nearest
neighbors.
|
|
6
|
- The interpolation operator In(i-1) takes an approximate solution x(i-1)
for P(i-1) and converts it to an approximation x(i) for the problem
P(i) on the next finer grid: ( b(i), x(i) ) = In(i-1)( b(i-1), x(i-1) )
Its implementation also requires just a weighted average with nearest
neighbors.
- The solution operator S(i) take a problem P(i) and approximate solution
x(i), and computes an improved x(i). x_improved(i) = S(i)( b(i), x(i) )
We often a variation of Jacobi's method as the solution operator
(recall symmetry of Jacobi, in contrast to, say, GS)
|
|
7
|
|
|
8
|
|
|
9
|
- Starting with a problem on a fine grid ( b(i), x(i) ), we want to damp
the high frequency error by coarsening the grid, and improve the low
frequency error by a smoothing operator.
- First a cut at the high frequency error by smoothing/approximate
solution:
- x(i) = S(i)( b(i),
x(i) ).
- Coarsen the grid: R(i)( b(i), x(i) ).
- Solve the coarser problem recursively
- MGV( R(i)( b(i), x(i) ) ).
|
|
10
|
- Coarse grid solution to fine grid In(i-1)( MGV( R(i)( b(i), x(i) ) ) ).
- Subtract the correction computed on the coarse grid from the fine grid
solution x(i) = x(i) - d(i)
- Improve the solution more by smoothing: x(i) = S(i)( b(i), x(i) ).
|
|
11
|
|
|
12
|
- The work at each point in the algorithm is proportional to the number of
unknowns, since the value at each grid point is just averaged with its
nearest neighbors. So each grid point on level i will cost (2^i-1)^2 =
O(4^i) operations. If the finest grid is at level m, the total serial
work will be given by the geometric sum
- which is O(m) = O( log
#unknowns ).
|
|
13
|
- The full Multigrid algorithm uses the V-cycle as a building block.
- FMG( b(m), x(m) ) begins by solving the simplest problem P(1) exactly.
- Given a solution x(i-1) of the coarse problem P(i-1), prolongate to a
starting guess x(i) for the next finer problem P(i): In(i-1)( b(i-1),
x(i-1) ).
- Solve the finer problem using the Multigrid V-cycle with this starting
guess: MGV( In(i-1)( b(i-1), x(i-1) ) )
|
|
14
|
- To get a sense of the work, note there is one V cycle for each call to
MGV in the inner loop of FMG. On a serial machine, this means that, at
level i, it costs O(4^i) computations. Thus the total serial cost is
again O( 4^m ) = O( # unknowns )
- The serial complexity is optimal, in that a constant amount of work per
unknown is performed. Analysis of a parallel version is considered
later.
|
|
15
|
- Projection and prolongation are easier in 1D, and it is easier to get a
sense of the algorithmic details.
- Consider P(i), the problem on a 1D grid with 2^i+1 points, with the
middle 2^i-1 being the unknowns.
|
|
16
|
|
|
17
|
- Let T(i) be the coefficient matrix of problem P(i), that is a scaled
(2^i-1)-by-(2^i-1) tridiagonal matrix with 2s on the diagonal and -1s on
the off-diagonal.
- The scale factor 4^(-i) is the
1/h^2 term that shows up because T(i) is approximating a second
derivative on a grid with grid spacing h=2^(-i).
|
|
18
|
- [ 2 -1
]
- [ -1 2 -1 ]
- T(i) = 4^(-i) * [ -1 2 -1 ]
- [ ... ... ... ]
- [ -1 2 -1 ]
- [ -1 2 ]
|
|
19
|
- Recall from Fourier analysis, any function (in this case, the error in
our approximate solution) can be considered as a linear combination of
basis vectors - here, sine functions of different frequencies.
- These sine functions are in fact the eigenvectors of the T(i) matrix.
The following figure shows some of these sine-curves when i=5, so T(5)
is 31-by-31. The top plot is a plot of the eigenvalues (frequencies) of
4^5*T(5), from lowest to highest. The subsequent plots are the
eigenvectors (sine-curves), starting with the lowest 4 frequencies, and
then a few more frequencies up to the highest (number 31).
|
|
20
|
|
|
21
|
- Let v(j) be the j-th eigenvector of T(i), and V the eigenvector matrix
V=[v(1),v(2),...] whose j-th column is v(j). V is an orthogonal matrix,
so any vector x can be written as a linear combination of the v(j).
- alpha(j) the j-th frequency component of x,
- On the upper half of the frequency domain, multigrid acting on P(m),
damps the error in the solution by the solution operator S(i). On the
next coarser grid, with half as many points, multigrid damps the upper
half of the remaining frequency components in the error. On the next
coarser grid, the upper half of the remaining frequency components are
damped, and so on, until we solve the exact (1 unknown) problem P(1).
|
|
22
|
- Here we consider a so-called weighted Jacobi smoothing operator S(i).
- Pure Jacobi would replace the j-th component of the approximate solution
x(i) by the weighted average:
- x(i)(j) = .5*( x(i)(j-1) + x(i)(j+1) + 4^i * b(i)(j) )
- = x(i)(j) + .5*(
x(i)(j-1) - 2*x(i)(j) + x(i)(j+1)
- + 4^i * b(i)(j)
)
- Weighted Jacobi instead uses
- x(i)(j) = x(i)(j) + w*.5*( x(i)(j-1) - 2*x(i)(j) + x(i)(j+1)
- + 4^i * b(i)(j)
)
|
|
23
|
- Choose the weight w to damp the highest frequencies; w=2/3 is a good
choice
- x(i)(j) = (1/3)*( x(i)(j-1) + x(i)(j) + x(i)(j+1)
- + 4^i * b(i)(j) )
|
|
24
|
- Take i=6, so there are 2^6-1=63 unknowns. In the figures to follow, the
true solution is a sine curve, shown as a dotted line in the leftmost
plot in each row. The approximate solution is a solid line in the same
plot. The middle plot shows the error alone. The rightmost plot shows
the frequency components of the error.
- Initially, the norm of the vector decreases rapidly, from 1.78 to 1.11,
but then decays more gradually, because there is little more error in
the high frequencies to damp. Thus, it only makes sense to do 1 or
perhaps 2 iterations of S(i) at a time.
|
|
25
|
|
|
26
|
|
|
27
|
|
|
28
|
- The restriction operator R(i) takes a problem P(i) with right-hand-side
b(i) and an approximate solution x(i), and maps it to a simpler problem
P(i-1) with right-hand-side b(i-1) and approximate solution x(i-1).
- Let r(i) be the residual of the approximate solution x(i):
- r(i) = T(i)*x(i) - b(i)
- If x(i) were the exact solution, r(i) would be zero.
- If we could solve the equation T(i) * d(i) = r(i) exactly for the correction
d(i), then x(i)-d(i) would be the solution we seek, because
- T(i) * ( x(i)-d(i) ) =
T(i)*x(i) - T(i)*d(i)
-
= (r(i) + b(i)) - (r(i))
= b(i)
|
|
29
|
- Thus, to apply R(i) to ( b(i), x(i) ) and get P(i-1),
- compute the residual r(i),
- restrict it to the next coarser grid to get b(i-1) = r(i-1)
- letting the initial guess x(i-1) be all zeros.
|
|
30
|
- Thus, P(i-1) will be the problem of solving T(i-1)*d(i-1) = r(i-1), for
the correction d(i-1), starting with the initial guess x(i-1) of all
zeros.
- The restriction will be accomplished by averaging nearest neighbors of
r(i) on the fine grid (with 2^i-1 unknowns) to get an approximation on
the coarse grid (with 2^(i-1)-1 unknowns): The value at a coarse grid
point is .5 times the value at the corresponding fine grid point, plus
.25 times each of the fine grid point neighbors.
|
|
31
|
- The interpolation operator In(i-1) takes the solution of the residual
equation
- T(i-1)*x(i-1) = r(i-1)
- and uses it to correct the
fine grid approximation x(i).
|
|
32
|
- The next graphs summarize the convergence of Multigrid in two ways. The
graph on the right shows the residual
- residual(k) =
norm(T(i)x(i)-b(i))
- after k iterations of Full
Multigrid (FMG). The graph on the left shows the ratio of consecutive
residuals, showing how the error (residual) is decreased by a factor
bounded away from 1 at each step (for less smooth solutions, the
residual will tend to decrease by a factor closer to .5 at each step.)
|
|
33
|
|
|
34
|
- Look more carefully at a parallel version of multigrid in 2D, including
communication costs.
- Multigrid requires each grid point to be updated depending on as many as
8 neighbors (those to the N, E, S, W, NW, SW, SE and NE). This will
determine communication costs.
- Assume an n=(2^m+1)-by-n grid of data, and that p=s^2 is a perfect
square.
- Assume the data is laid out on an s-by-s grid of processors, with each
processor owning an ((n-1)/s)-by-((n-1)/s) subgrid.
|
|
35
|
- This is illustrated in the following diagram by the pink dotted line,
which encircles the grid points owned by the corresponding processor.
The grid points in the top processor row have been labeled (and color
coded) by the grid number i of the problem P(i) in which they
participate. There is exactly one point labeled 2 per processor. The
only grid point in P(1) with a nonboundary value is owned by the
processor above the one with the pink dotted line. In general, if
p>=16, there will be .5*log_2(p)-1 grids P(i) in which fewer than all
processor own participating grid points.
|
|
36
|
|
|
37
|
- The processor indicated by the pink dotted line requires certain grid
point data from its neighbors to execute the algorithm. For example, to
update its own (blue) grid points for P(5), the processor requires 8
blue grid point values from its N, S, E, and W neighbors, as well as
single blue grid point values from its NW, SW, SE and NE neighbors.
Similarly, updating the (red) grid points for P(4) requires 4 red grid
point values from the N, W, E and W neighbors, and one each from the NW,
SW, SE and NE neighbors. This pattern continues until each processor has
only one grid point. After this, only some processors participate in the
computation, requiring one value each from 8 other processors.
|
|
38
|
- From this, we can do a simple complexity analysis of the communication
costs. For simplicity, let p=2^(2*k), so each processor owns a
2^(m-k)-by-2^(m-k) subgrid. Consider a V-cycle starting at level m. Then
in levels m through k of the V-cycle, each processor will own at least
one grid point. After that, in levels k-1 through 1, fewer than all
processors will own an active grid point, and so some processors will
remain idle. Let f be the time per flop, alpha the time for a message
startup, and beta the time per word to send a message.
|
|
39
|
- At level m >= i >= k, the time in the V-cycle will be
- Time at level i = O( 4^(i-k) ) * f
... number of
-
flops, proportional to the
-
number of grid points
-
per processor
- + O(
1 ) * alpha ... send a
constant
-
number of messages
-
to neighbors
- + O(
2^(i-k) ) * ... number of
words sent
|
|
40
|
- Summing all these terms from i=k to m yields Time at levels k to m =
- O( 4^(m-k) )*f
+ O( m-k )*alpha
- + O( 2^(m-k)
)*beta
- = O( n^2/p )*f + O(
log(n/p) )*alpha
- + O( n/sqrt(p)
)*beta
|
|
41
|
- On levels k >= i >= 1, this effect disappears, because each
processor needs to communicate about as many words as it does floating
point operations:
- Time at level i = O( 1 ) * f
... number of flops, proportional
-
to # of grid points per processor
- + O( 1 )
* alpha ... send a constant
number of
-
messages to neighbors
- + O( 1 )
* beta ... number of words
sent
|
|
42
|
- Summing all these terms yields
- Time at levels 1 to k-1 = O( k-1 )*f
- + O( k-1
)*alpha + O( k-1 )*beta
- = O( log(p) )*f + O(
log(p) )*alpha
- + O( log(p)
)*beta
|
|
43
|
- The total time for a V-cycle starting at the finest level is therefore
- Time = O( n^2/p + log(p) )*f
- + O( log(n) )*alpha
- + O( n/sqrt(p) +
log(p) )*beta
|
|
44
|
- For Full Multigrid, assuming n^2 >= p, so that each processor has at
least 1 grid point:
- Time = O( N/p + log(p)*log(N) )*f
- + O( (log(N))^2
)*alpha
- + O( sqrt(N/p) +
log(p)*log(N) )*beta
- where N=n^2 is the number of unknowns. The speedup over the serial
floating point work O( N ), is nearly perfect, O( N/p + log(p)*log(N) ),
when N>>p, but reduces to log(N)^2 when p=N (the PRAM model).
|
|
45
|
- In practice, the time spent on the coarsest grids, when each processor
has one or zero active grid points, while negligible on a serial
machine, is significant on a parallel machine, enough so to make the
efficiency noticeably lower.
|