Notes
Slide Show
Outline
1
High Performance Computing: Paradigms
2
Standard Measure
  • flops – floating operations per second
    • mega, giga, tera, peta
    • often taken as aggregate ops of all the processors of a system
  • technology of HPC machines is in the low teraflops regime
  • workstations in the gigaflops range
  • aggregate flops misleading measure





3
flops
  • clockspeed of chips is very fast
  • most chips can perform 2-4 ops per second and can usually do some other arithmetic
  • but the chips can’t deliver data to the processors fast enough to keep up with clock





4
Pipelining
  • for example, pipelining
  • break up data into several sub-operations that execute in different sub-units in one cycle
  • vector pipelining used to be the mark of supercomputers
  • commodity clusters replaced vector machines; but making a comeback
5
Instruction Pipelining


6
Instruction Pipelining (cont)
  • Almost all current computers use some pipelining e.g. IBM RS6000
  • Speedup of instruction pipelining cannot always be achieved
    • Next instruction may not be known till execution - e.g. branch
    • Data for execution may not be available
7
Speeding up machines
  • reduce clock cycle
  • pipelining
  • internal parallelism
  • external parallelism (multiple processors)


8
Hardware Classification (Flynn)
  • SISD
    • single instruction, single data. this is the traditional von Neumann model of computing – that is, a basic serial machine
  • SIMD
    • single instruction, multiple data. classical vector supercomputers, and the old Connection Machines


9
Classification (cont)
  • MISD
    • multiple instruction, single data. no current hpc machine based on this principal
  • MIMD
    • multiple instruction, multiple data. usual parallelism model
    • could be shared memory, or distributed memory
10
Shared Memory
  • Use multiple processors
    • Shared Memory (SMP: Symmetric Multi-processors)
      • many processors accessing the same memory
      • limited by memory-processors bandwidth
      • SUN Ultra2, SGI Origin, Compaq
11
Distributed Memory
    • Distributed memory
      • many processors each with local memory and some type of high speed interconnect




12
Today’s machines
  • multiple nodes, each with several processors that share memory locally
  • best – and worst – of both models
  • programming difficulties
13
Programming Model
  • SPMD (Single Program Multiple Data)
    • single program is run on all processors with different data
    • each processor knows its ID -- thus
      • if(proc ID .eq. N) then
        • ….
      • Else
        • ….
    • Constructs can be used for program control


14
Programming Model
  • MPMD (Multiple Program Multiple Data)
    • Different programs run on different processors
    • often a master-slave model is used

15
Topology of connections
  • Hypercube
  • Torus


16
Hypercube
17
Torus
  •        ·¾·¾·¾·
  •        ·¾·¾·¾·
  •        ·¾·¾·¾·
  •        ·¾·¾·¾·
18
Chips (oversimplified)
  • Processor
  • Registers
  • Cache (multiple levels)
  • Memory
  • Disk
19
Fetch data
  • registers O(1) clock cycle bytes
  • L1 cache O(10) cycles Kbytes
  • L2 cache O(20) cycles Mbytes
  • Memory O(100) cycles Gbytes
  • Disk O(1000) cycles Tbytes
20
Speed of computation
  • Intelligent use of cache
  • Smart units pre-fetch data
  • multi-tasking to hide latency