Notes
Slide Show
Outline
1
Load-Balancing
2
Load-Balancing
  • What is load-balancing?
    • Dividing up the total work between processes when running codes on a parallel machine
  • Load-balancing constraints
    • Minimize interprocess communication
  • Also called:
    • partitioning, mesh partitioning, (domain decomposition)
3
Know your data and memory
  • Memory is organized by banks. Between access to any bank, there is a latency period.
  •  Matrix entries are stored column-wise in FORTRAN.


4
 
5
Addressing Memory
  • For illustration purposes, lets imagine 8 banks [128 or 256 common on chips today], with bank busy time (bbt) of 8 cycles between accesses. Thus we have:
  • data     a13    a23    a33   a43   a14    a24   a34     a44
  • data     a11    a21    a31   a41   a12    a22   a32     a42
  • bank      1        2        3       4      5       6        7        8
6
Addressing Memory
  • If we access data column-wise, we proceed through each bank in order. By the time we call a13, we (just) avoid bbt.
  • On the other hand, if we access data row-wise, we get a11 in bank 1, a12 in bank 5, a13 in bank 1 again - so instead of access on clock cycle 3, we have to wait until cycle 9. Then we get a14 in bank 5 again on cycle 10, etc.


7
Indirect addressing
  • If addressing is indirect we may wind up jumping all over, and suffer performance hits because of it.
8
Shared Memory
  • Bank conflicts depend on granularity of memory
  • If N memory refs per cycle, p processors, memory with b cycles bbt, need p*N*b memory banks to see uninterrupted access of data
  • With B banks, granularity is
  •                g = B/(p*N*b)
9
Moral
  • Separate selection of data from its processing
  • Each subtask requires its own data structure. Be prepared to change structures between tasks
10
Load-balancing nomenclature
11
Partitioning
12
Work/Edge Weights
  • Need a good measure of what the expected work may be
    • Molecular dynamics:
      • number of molecules
      • regions
    • FEM/finite difference/finite volume, etc:
      • Degrees of freedom
      • Cells/elements


  • If edge weights are used, also need a good measure on how strongly objects are coupled to each other
13
Static/Dynamic Load-Balancing
  • Static load-balancing
    • Done as a “preprocessing” step before the actual calculation
    • If the objects and edges don’t change very much or at all, can do static load-balancing
  • Dynamic load-balancing
    • Done during the calculation
    • Significant changes in the objects and/or edges
14
Dynamic Load-Balancing Example
15
Static vs. Dynamic Load Balancing
  • Static partitioning insufficient for many applications
    • Adaptive mesh refinement
    • Multi-phase/Multi-physics computations
    • Particle simulations
    • Crash simulations
    • Parallel mesh generation
    • Heterogeneous
      computers
  • Need dynamic load balancing


16
Dynamic Load-Balancing Constraints
  • Minimize load-balancing time
    • Memory constraints
  • Minimize data migration -- incremental partitions
    • Small changes in the computation should result in small changes in the partitioning
    • Calculating new partition and data migration should take less time than the amount of time saved by performing computations on new grid
  • Done in parallel



17
Methods of Load-Balancing
  • Geometric
    • Based on geometric location
    • Faster load-balancing time with medium quality results
  • Graph-based
    • Create a graph to represent the objects and their connections
    • Slower load-balancing time but high quality results
  • Incremental methods
    • Use graph representation and “shuffle” around objects


18
Choosing a Load-Balancing Algorithm/Method
  • No algorithm/method is appropriate for all applications!


  • Graph load-balancing algorithms for:
    • Static load-balancing
    • Computations where computation to load-balancing time ratio is high
      • Implicit schemes with a linear and non-linear solution scheme
19
Choosing a Load-Balancing Algorithm/Method

  • Geometric load-balancing algorithms for:
    • Computations where computation to load-balancing time ratio is low
      • For explicit time stepping calculations with many time steps and varying workload (MD, FEM crash simulations, etc.)
      • Problems with many load-balancing objects
20
Geometric Load-Balancing
  • Based on the objects’ coordinates
    • Want a unique coordinate associated with an object
      • Node coordinates, element centroid, molecule coordinate/centroid, etc.
  • Partition “space” which results in a partition of the load-balancing objects
  • Edge cuts are usually not explicitly dealt with
21
Geometric Load-Balancing Assumptions
  • Objects that are close will likely need to share information
    • Want compact partitions
      • High volume to surface area or high area to perimeter length ratios
  • Coordinate information
  • Bounded domain


22
Geometric Load-Balancing Algorithms
  • Recursive Coordinate Bisection (RCB)
    • Berger & Bokhari
  • Recursive Inertial Bisection (RIB)
    • Taylor & Nour-Omid
  • Space Filling Curves (SFC)
    • Warren & Salmon, Ou, Ranka, & Fox, Baden & Pilkington
  • Octree Partitioning/Refinement-tree Partitioning
    • Loy & Flaherty, Mitchell
23
Recursive Coordinate Bisection
  • Choose an axis for the cut
  • Find the proper location of the cut
  • Group objects together according to location relative to cut
  • If more partitions are needed, go to step 1
24
Recursive Inertial Bisection
  • Choose a direction for the cut
  • Find the proper location of the cut
  • Group objects together according to location relative to cut
  • If more partitions are needed, go to step 1
25
Space Filling Curves
26
Load-Balancing with Space Filling Curves
  • The SFC gives a 1-dimensional ordering of objects located in an n-dimensional domain
    • Easier to work with objects in 1 dimension than in n dimensions
  • Algorithm:
    • Sort objects by their location on the SFC
    • Calculate cuts along the SFC
27
Octree Partitioning/Refinement-Tree Partitioning
  • Tree based algorithms for applications with multiple levels of data, simulation accuracy, etc.
    • Tree is usually built from specific computational schemes
    • Tightly coupled with the simulation
28
Comparisons of RCB, RIB, and SFC
  • RCB and RIB usually give slightly better partitions than SFC
  • SFC is usually a little faster
  • SFC is a little better for incremental partitions
    • RIB can be real unstable for incremental partitions
29
Load-Balancing Libraries
  • There are many load-balancing libraries downloadable from the web
    • Mostly graph partitioning libraries
      • Static:  Chaco, Metis, Party, Scotch
      • Dynamic:  ParMetis, DRAMA, Jostle, Zoltan
  • Zoltan (www.cs.sandia.gov/Zoltan)
    • Dynamic load-balancing library with:
      • SFC, RCB, RIB, Octree, ParMetis, Jostle
    • Same interface to all load-balancing algorithms
30
Methods to Avoid Communication
  • Avoiding load-balancing
    • Load-balancing not needed every time the workload and/or edge connectivity changes
  • Ghost cells
  • Predictive load-balancing



31
Accessing Information on Other Processors
  • Need communication between processors
  • Use ‘ghost’ cells – need to maintain consistency of data in ghost cells
32
Ghost Cells
  • Copies of cells assigned to other processors
  • Make needed information available
  • No solution values are computed at the ghost cells
  • Ghost cell information needs to be updated whenever necessary
  • Ghost cells need to be calculated dynamically because of changing mesh and dynamic load-balancing


33
Predictive Load-Balancing
  • Predict the workload and/or edge connectivity and load-balance with that information
    • Assumes that you can predict the workload and/or edge connectivity
  • Still need to perform communication but reduces data migration
34
Predictive Load-Balancing
  • Refine then load-balance – 4 objects migrated
  • Predictive load-balance then refine – 1 object migrated