1	High Performance Computing: Paradigms
2	Standard Measure flops – floating operations per second mega, giga, tera, peta often taken as aggregate ops of all the processors of a system technology of HPC machines is in the low teraflops regime workstations in the gigaflops range aggregate flops misleading measure
3	flops clockspeed of chips is very fast most chips can perform 2-4 ops per second and can usually do some other arithmetic but the chips can’t deliver data to the processors fast enough to keep up with clock
4	Pipelining for example, pipelining break up data into several sub-operations that execute in different sub-units in one cycle vector pipelining used to be the mark of supercomputers commodity clusters replaced vector machines; but making a comeback
5	Instruction Pipelining
6	Instruction Pipelining (cont) Almost all current computers use some pipelining e.g. IBM RS6000 Speedup of instruction pipelining cannot always be achieved Next instruction may not be known till execution - e.g. branch Data for execution may not be available
7	Speeding up machines reduce clock cycle pipelining internal parallelism external parallelism (multiple processors)
8	Hardware Classification (Flynn) SISD single instruction, single data. this is the traditional von Neumann model of computing – that is, a basic serial machine SIMD single instruction, multiple data. classical vector supercomputers, and the old Connection Machines
9	Classification (cont) MISD multiple instruction, single data. no current hpc machine based on this principal MIMD multiple instruction, multiple data. usual parallelism model could be shared memory, or distributed memory
10	Shared Memory Use multiple processors Shared Memory (SMP: Symmetric Multi-processors) many processors accessing the same memory limited by memory-processors bandwidth SUN Ultra2, SGI Origin, Compaq
11	Distributed Memory Distributed memory many processors each with local memory and some type of high speed interconnect
12	Today’s machines multiple nodes, each with several processors that share memory locally best – and worst – of both models programming difficulties
13	Programming Model SPMD (Single Program Multiple Data) single program is run on all processors with different data each processor knows its ID -- thus if(proc ID .eq. N) then …. Else …. Constructs can be used for program control
14	Programming Model MPMD (Multiple Program Multiple Data) Different programs run on different processors often a master-slave model is used
15	Topology of connections Hypercube Torus
16	Hypercube
17	Torus ·¾·¾·¾· ·¾·¾·¾· ·¾·¾·¾· ·¾·¾·¾·
18	Chips (oversimplified) Processor Registers Cache (multiple levels) Memory Disk
19	Fetch data registers O(1) clock cycle bytes L1 cache O(10) cycles Kbytes L2 cache O(20) cycles Mbytes Memory O(100) cycles Gbytes Disk O(1000) cycles Tbytes
20	Speed of computation Intelligent use of cache Smart units pre-fetch data multi-tasking to hide latency