Notes
Slide Show
Outline
1
High Performance Computing: Performance Issues
2
Chips
  • Basic Architecture
    • CISC vs. RISC
    • Superscalar





3
Measure
  • Ttheor: theoretical peak performance; obtained by multiplying clock rate with no. of CPU and no. of FPU/CPU
  • Treal:real performance on some specific operation e.g. vector add and multiply
  • Tsustained: sustained performance on an application e.g. CFD
    •              Tsustained <<   Treal  <<  Ttheor




4
Memory
  • Performance degrades if the CPU has to wait for data to operate


  • Fast CPU => need adequate fast memory


  • Thumb rule --
    • Memory in MB = Ttheor in MFLOPS
5
Chip Architecture
  • Use multiple Functional Units per processor
    • Cray T90 has 2 track vector units; NEC SX4, Fujitsu VPP300 -- 8 track vector units
    • superscalar e.g. IBM RS6000 Power2 uses 2 arithmetic units
  • Need to provide data to multiple functional unit => fast memory access
  • Limiting factors are memory-processor bandwidth


6
Chips (oversimplified)
  • Processor
  • Registers
  • Cache (multiple levels)
  • Memory
  • Disk
7
Fetch data
  • registers O(1) clock cycle bytes
  • L1 cache O(10) cycles Kbytes
  • L2 cache O(20) cycles Mbytes
  • Memory O(100) cycles Gbytes
  • Disk O(1000) cycles Tbytes
8
Speed of computation
  • Intelligent use of cache
  • Smart units pre-fetch data
  • multi-tasking to hide latency
  • Blocked algorithms
  • Contiguous storage
  • Avoid strides and random/non-deterministic access
9
Machines
  • Prototype processors
    • Vector Processors
    • Superscalar Processors


  • Prototype Parallel Computers
    • Shared Memory
      • Without Cache
      • With Cache SMP
    • Distributed Memory



10
Vector Processors
  • Components
    • Vector registers
    • ADD/Logic pipeline and MULTIPLY Pipelines
    • Load/Store pipelines
    • Scalar registers + pipelines
11
Vector Processor
12
Vector Registers
  • Finite length of vector registers  32/64/128 bit
    • Strip mining to operate on longer vectors
    • Codes often manually restructured to vector-length loops
    • Sawtooth performance curve -- maximum at multiples of vector length
13
Vector Processors
  • Memory-processor bandwidth
    • performance depends completely on keeping the vector registers supplied with operands from memory
  • Size of main memory and extended memory
    • bandwidth of main memory is much higher but main memory is more expensive
    • size determines -- size of problem that can be run
  • scalar registers/scalar processors for scalar instructions
  • I/O through special processor - -
    • T90 can produce data at 14400 MB/sec -- Disk 20MB/s. Thus single word can take 720 cycles on Cray T90 !!
14
Superscalar Processor
  • main components are
    • Multiple ALU and FPU
    • data and instruction caches


    • superscalar since the ALU and FPU’s can operate in parallel producing more than one result per cycle


    • e.g. IBM POWER2 - 2 FPU/ALU’s each can operate in parallel producing up to 4 results per cycle if operands are in registers
15
Superscalar Processor
  •  Workstations and  nodes of parallel supercomputers


16
Superscalar Processor
  • RISC architecture operating at very high clock speeds (>1GHz now --  more in a year)


  • Processor works only on data in registers which come only from and go only to data cache. If data is not in cache -- “cache miss” -- processor is idle while another cache line (4 -16 words) are fetched from memory !!




17
Superscalar Processor
  • Large off chip Level 2 caches to help in data availability. L1 cache data is accessed in ~1 cycles while L2 cache is ~4 cycles and memory can be several tens times that!


  • Efficiency directly related to reuse of data in cache


  • Remedies:
    • Blocked algorithms,
    •  contiguous storage,
    •  avoid strides and random/non-deterministic access


18
Superscalar Processor
  • Remedies:
    • Blocked algorithms,
      • do I=1,1000 do j=1,20
        • a(I)=…. do i=(j-1)*50,j*50
        •                                                   a(i)=....
    •  contiguous storage
    •  avoid strides and random/non-deterministic access
      • a(ix(i)) = ...

19
Superscalar Processors
  • Memory bandwidth critical to performance
    • Many engineering applications are difficult to optimize for cache efficiency
    • Application efficiency => memory bandwidth

  • Size of memory determines size of problem that can be solved
  • DMA (direct memory access) channels take memory access duties for external application (I/O) remote processor request away from CPU
20
Shared Memory Parallel Computer
  • Memory in banks is accessed equally through a switch (crossbar) by the processors (usually vector)


  • Processors run “p” independent tasks with possibly shared data


  • Usually some compilers and preprocessors can extract the fine-grained parallelism available


  • Shared Memory Computer
21
Shared Memory Paralllel ...
  • Memory contention and bandwidth limits the number of processors that may be connected


  • Memory contention can be reduced by increasing banks and reducing the bank busy time (bbt)


  • This type of parallel computer is closest in programming model to the general purpose single processor computer
22
Symmetric Multiprocessors (SMP)
  • Processors are usually superscalar -- SUN Ultra, MIPS R10000 with large cache
  • Bus/crossbar used to connect to memory modules


  • For bus -- 1 processor can access memory at a time
  • SMP Computer
23
Symmetric Multi-processors
  • If interconnect -- then there will be memory contention


  • Data flows from memory to cache to processors;


  • Cache coherence:
    • If a piece  of data is changed in one cache then all other caches that contain that data must update the value. Hardware and software must take care of this.
24
Symmetric Multi-Processors
  • Performance depends dramatically on the reuse of data in cache;
    • Fetching data from larger memory with potential memory contention can be expensive!
    • Caches and cache lines also are bigger


  • Large L2 cache really plays the role of local fast memory with memory banks are more like extended memory accessed in blocks
25
Distributed Memory Parallel Computer
  • Prototype DMP
  • Processors are superscalar RISC with only LOCAL memory


  • Each processor can only work on data in local memory
  • Communication required for access to remote memory




26
Distributed Memory Parallel Computer
  • Problems need to be broken up into independent tasks with independent memory -- naturally matches a data based decomposition of problem using a “owner computes” rule


  • Parallelization mostly at high granularity level controlled by user -- difficult for compilers/ automatic parallelization tools
  • Computers are scalable to very large numbers of processors


27
Distributed Memory Parallel Computer
  • Hybrid Parallel Computer
  • NUMA : non uniform memory access based classification


  • Intel Paragon (1st teraflop machine had 4 Pentiums per node with a bus)
  • HP exemplar has bus at node
28
Distributed Memory Parallel Computer
  • Semi-autonomous memory
  • Semi-automomous memory: Processor can access remote memory using memory control units (MCU)


  • CRAY T3E and SGI Origin 2000 (ccNUMA)
29
Distributed Memory Parallel Computer
  • Fully autonomous memory
  • Memory and procesors are equally distributed over the network


  • Tera MTA is only example


  • Latency and data transfer from memory is at the speed of network!
30
Accessing Distributed Memory
  • Message Passing
    • User transfers all data using explicit send/receive instructions
    • synchronous message passing can be slow
    • Programming with NEW programming model !
    • User must optimize communication
    • asynchronous/one-sided get and put are faster but need more care in programming
    • Codes used to be machine specific -- Intel NEXUS etc. until  standardized to PVM (parallel virtual machine) and subsequently MPI (message passing interface)


31
Accessing Distributed Memory

  • Global distributed memory
    • Physically distributed and globally addressable -- Cray T3E/ SGI Origin 2000
    • User formally accesses remote memory as if it were local -- operating system/compilers will translate such accesses to fetches/stores over the communication network
    • High Performance FORTRAN (HPF) -- software realization of distributed memory -- arrays etc. when declared can be distributed using compiler directives. Compiler translates remote memory access to appropriate calls (message passing/ OS calls as supported by the hardware)
32
Processor interconnects/topologies
  • Buses
    • Lower cost -- but only one pair of devices (processors/memories etc. can communicate at a time) e.g. ethernet used to link workstation networks

  • Switches
    • Like the telephone network -- can sustain many-many communications; higher cost!
    • Critical measure is bisection bandwidth -- how much data can be passed between units
33
Processor interconnects/topologies
  • .
34
Processor interconnects/topologies
  • .
35
Processor interconnects/topologies
  • Workstation network on ethernet




  • Very high latency -- processors must participate in communication
36
Topology of connections
  • Hypercube
  • Torus


37
Processor interconnects/topologies
  • 1D and 2D Meshes and rings/toruses
38
Processor interconnects/topologies
  • 3DMeshes and rings/toruses
39
Processor interconnects/topologies
  • D- dimensional hypercubes
40
Processor Scheduling
  • Space Sharing
    • Processor banks of 4/8/16 etc. assigned to users for specific times

  • Time sharing on processor partitions



  • Livermore Gang Scheduling