1	High Performance Computing: Performance Issues
2	Chips Basic Architecture CISC vs. RISC Superscalar
3	Measure T_theor: theoretical peak performance; obtained by multiplying clock rate with no. of CPU and no. of FPU/CPU T_real:real performance on some specific operation e.g. vector add and multiply T_sustained: sustained performance on an application e.g. CFD T_sustained<< T_real<<T_theor
4	Memory Performance degrades if the CPU has to wait for data to operate Fast CPU => need adequate fast memory Thumb rule -- Memory in MB = T_theor in MFLOPS
5	Chip Architecture Use multiple Functional Units per processor Cray T90 has 2 track vector units; NEC SX4, Fujitsu VPP300 -- 8 track vector units superscalar e.g. IBM RS6000 Power2 uses 2 arithmetic units Need to provide data to multiple functional unit => fast memory access Limiting factors are memory-processor bandwidth
6	Chips (oversimplified) Processor Registers Cache (multiple levels) Memory Disk
7	Fetch data registers O(1) clock cycle bytes L1 cache O(10) cycles Kbytes L2 cache O(20) cycles Mbytes Memory O(100) cycles Gbytes Disk O(1000) cycles Tbytes
8	Speed of computation Intelligent use of cache Smart units pre-fetch data multi-tasking to hide latency Blocked algorithms Contiguous storage Avoid strides and random/non-deterministic access
9	Machines Prototype processors Vector Processors Superscalar Processors Prototype Parallel Computers Shared Memory Without Cache With Cache SMP Distributed Memory
10	Vector Processors Components Vector registers ADD/Logic pipeline and MULTIPLY Pipelines Load/Store pipelines Scalar registers + pipelines
11	Vector Processor
12	Vector Registers Finite length of vector registers 32/64/128 bit Strip mining to operate on longer vectors Codes often manually restructured to vector-length loops Sawtooth performance curve -- maximum at multiples of vector length
13	Vector Processors Memory-processor bandwidth performance depends completely on keeping the vector registers supplied with operands from memory Size of main memory and extended memory bandwidth of main memory is much higher but main memory is more expensive size determines -- size of problem that can be run scalar registers/scalar processors for scalar instructions I/O through special processor - - T90 can produce data at 14400 MB/sec -- Disk 20MB/s. Thus single word can take 720 cycles on Cray T90 !!
14	Superscalar Processor main components are Multiple ALU and FPU data and instruction caches superscalar since the ALU and FPU’s can operate in parallel producing more than one result per cycle e.g. IBM POWER2 - 2 FPU/ALU’s each can operate in parallel producing up to 4 results per cycle if operands are in registers
15	Superscalar Processor Workstations and nodes of parallel supercomputers
16	Superscalar Processor RISC architecture operating at very high clock speeds (>1GHz now -- more in a year) Processor works only on data in registers which come only from and go only to data cache. If data is not in cache -- “cache miss” -- processor is idle while another cache line (4 -16 words) are fetched from memory !!
17	Superscalar Processor Large off chip Level 2 caches to help in data availability. L1 cache data is accessed in ~1 cycles while L2 cache is ~4 cycles and memory can be several tens times that! Efficiency directly related to reuse of data in cache Remedies: Blocked algorithms, contiguous storage, avoid strides and random/non-deterministic access
18	Superscalar Processor Remedies: Blocked algorithms, do I=1,1000 do j=1,20 a(I)=…. do i=(j-1)50,j50 a(i)=.... contiguous storage avoid strides and random/non-deterministic access a(ix(i)) = ...
19	Superscalar Processors Memory bandwidth critical to performance Many engineering applications are difficult to optimize for cache efficiency Application efficiency => memory bandwidth Size of memory determines size of problem that can be solved DMA (direct memory access) channels take memory access duties for external application (I/O) remote processor request away from CPU
20	Shared Memory Parallel Computer Memory in banks is accessed equally through a switch (crossbar) by the processors (usually vector) Processors run “p” independent tasks with possibly shared data Usually some compilers and preprocessors can extract the fine-grained parallelism available Shared Memory Computer
21	Shared Memory Paralllel ... Memory contention and bandwidth limits the number of processors that may be connected Memory contention can be reduced by increasing banks and reducing the bank busy time (bbt) This type of parallel computer is closest in programming model to the general purpose single processor computer
22	Symmetric Multiprocessors (SMP) Processors are usually superscalar -- SUN Ultra, MIPS R10000 with large cache Bus/crossbar used to connect to memory modules For bus -- 1 processor can access memory at a time SMP Computer
23	Symmetric Multi-processors If interconnect -- then there will be memory contention Data flows from memory to cache to processors; Cache coherence: If a piece of data is changed in one cache then all other caches that contain that data must update the value. Hardware and software must take care of this.
24	Symmetric Multi-Processors Performance depends dramatically on the reuse of data in cache; Fetching data from larger memory with potential memory contention can be expensive! Caches and cache lines also are bigger Large L2 cache really plays the role of local fast memory with memory banks are more like extended memory accessed in blocks
25	Distributed Memory Parallel Computer Prototype DMP Processors are superscalar RISC with only LOCAL memory Each processor can only work on data in local memory Communication required for access to remote memory
26	Distributed Memory Parallel Computer Problems need to be broken up into independent tasks with independent memory -- naturally matches a data based decomposition of problem using a “owner computes” rule Parallelization mostly at high granularity level controlled by user -- difficult for compilers/ automatic parallelization tools Computers are scalable to very large numbers of processors
27	Distributed Memory Parallel Computer Hybrid Parallel Computer NUMA : non uniform memory access based classification Intel Paragon (1st teraflop machine had 4 Pentiums per node with a bus) HP exemplar has bus at node
28	Distributed Memory Parallel Computer Semi-autonomous memory Semi-automomous memory: Processor can access remote memory using memory control units (MCU) CRAY T3E and SGI Origin 2000 (ccNUMA)
29	Distributed Memory Parallel Computer Fully autonomous memory Memory and procesors are equally distributed over the network Tera MTA is only example Latency and data transfer from memory is at the speed of network!
30	Accessing Distributed Memory Message Passing User transfers all data using explicit send/receive instructions synchronous message passing can be slow Programming with NEW programming model ! User must optimize communication asynchronous/one-sided get and put are faster but need more care in programming Codes used to be machine specific -- Intel NEXUS etc. until standardized to PVM (parallel virtual machine) and subsequently MPI (message passing interface)
31	Accessing Distributed Memory Global distributed memory Physically distributed and globally addressable -- Cray T3E/ SGI Origin 2000 User formally accesses remote memory as if it were local -- operating system/compilers will translate such accesses to fetches/stores over the communication network High Performance FORTRAN (HPF) -- software realization of distributed memory -- arrays etc. when declared can be distributed using compiler directives. Compiler translates remote memory access to appropriate calls (message passing/ OS calls as supported by the hardware)
32	Processor interconnects/topologies Buses Lower cost -- but only one pair of devices (processors/memories etc. can communicate at a time) e.g. ethernet used to link workstation networks Switches Like the telephone network -- can sustain many-many communications; higher cost! Critical measure is bisection bandwidth -- how much data can be passed between units
33	Processor interconnects/topologies .
34	Processor interconnects/topologies .
35	Processor interconnects/topologies Workstation network on ethernet Very high latency -- processors must participate in communication
36	Topology of connections Hypercube Torus
37	Processor interconnects/topologies 1D and 2D Meshes and rings/toruses
38	Processor interconnects/topologies 3DMeshes and rings/toruses
39	Processor interconnects/topologies D- dimensional hypercubes
40	Processor Scheduling Space Sharing Processor banks of 4/8/16 etc. assigned to users for specific times Time sharing on processor partitions Livermore Gang Scheduling