1	IBM SP
2	IBM RS/6000 SP
3	POWER 2 Processor Different versions -- with different frequency, cache size and bandwidth
4	POWER 2 ARCHITECTURE
5	POWER2 Double fixed point/floating point units -- multiply/add in each Max. 4 Floating Point results/cycle ICU (with 32 KB instruction cache) can execute a branch and a condition/cycle Per cycle 8 instructions may be issued and executed -- truly SUPERSCALAR!
6	Wide 77 Node Performance Theoretical peak performance: 277 = 154 MFLOP for dyad 477 = 308 MFLOP for triad Cache Effects dominate performance 256 KB Cache and 256 bit path to cache and from cache to memory -- 2 words (8 bytes each) may be fetched and 2 words stored per cycle
7	Expected Performance Expected Performance For Dyad a_i= b_ic_i or a_i=b_i+c_i -- needs 2 load and 1 store i.e. 6 memory references to feed 2 FPUs -- only 4 are available: (277)(4/6) = 102.7 MFLOP For linked triad a_i= b_i + sc_i(2 load 1 store) (477)(4/6) = 205.3 MFLOP For vector triad a_i = b_i + c_i * d_i(3 load 1 store) (477)(4/8)=154 MFLOPS
8	Cache Hit/Miss The Performance numbers assumed that data was available in cache If data is not in cache it must be fetched in cache lines of 256 bytes each from memory at a much slower pace
9
10	TERM PAPER Based on the analysis of the Power 2 processor and IBM SP presented here prepare a similar analysis (including estimates of performance) for the new NEC SX chips of the Earth Simulator, or the Power 4 chips.