|
1
|
- Some material from a lecture by
- David H. Bailey
- NERSC
|
|
2
|
- 1 Pflop/s (1015 flop/s) in computing power.
- Will likely need between 10,000 and 1,000,000 processors.
- With ~ 10 Tbyte - 1 Pbyte main memory
- and ~1 Pbyte - 100 Pbyte on-line storage.
- and between 100 Pbyte and 10 Ebyte archival storage.
|
|
3
|
- The system will require I/O bandwidth of similar scale
- Estimated cost today ~ $50 billion
- It would consume 1,000 Mwatts of electric power.
- Demand will be in place by 2010; may be affordable by then too
|
|
4
|
- Nuclear weapons stewardship.
- Cryptology and digital signal processing.
- Satellite data processing.
- Climate and environmental modeling.
- Design of advanced aircraft and spacecraft.
- Nanotechnology.
|
|
5
|
- Design of practical fusion energy systems.
- Large-scale DNA sequencing.
- 3-D protein molecule simulations.
- Global-scale economic modeling.
- Virtual reality design tools
|
|
6
|
- Characteristic 1999 2001 2003 2006 2009
- Feature size (micron) 0.18 0.15 0.13 0.10 0.07
- DRAM size (Mbit) 256 1024 1024 4096 16K
- RISC processor (MHz) 1200 1400 1600 2000 2500
- Transistors (millions)
21 39 77
203 521
- Cost per transistor (ucents) 1735 1000
580 255 100
|
|
7
|
- Observations:
- Moore’s Law of increasing density will continue until at least 2009.
- Clock rates of RISC processors and DRAM memories are not expected to be
more than about twice today’s rates.
- Conclusion: Future high-end systems will feature tens of thousands of
processors, with deeply hierarchical memories.
|
|
8
|
- Commodity technology design:
- 100,000 nodes, each of which is a 10 Gflop processor.
- Clock rate = 2.5 GHz; each
processor can do four flop per clock.
- Multi-stage switched network.
|
|
9
|
- Hybrid technology, multi-threaded (HTMT) design:
- 10,000 nodes, each with one superconducting RSFQ processor.
- Clock rate = 100 GHz; each processor sustains 100 Gflop/s.
|
|
10
|
- Multi-threaded processor design handles a large number of outstanding
memory references.
- Multi-level memory hierarchy (CRAM, SRAM, DRAM, etc.).
- Optical interconnection network.
|
|
11
|
- Little’s Law:
- Average number of waiting customers
=
- average arrival rate x average wait time per customer.
|
|
12
|
- Assume:
- Single processor-memory system.
- Computation deals with data in local main memory.
- Pipeline between main memory and processor is fully utilized.
- Then by Little’s Law, the number of words in transit between CPU and
memory (i.e. length of vector pipe, size of cache lines, etc.)
- = memory latency x
bandwidth.
|
|
13
|
- This observation generalizes to multiprocessor systems:
- concurrency = latency x bandwidth,
- where “concurrency” is aggregate system concurrency, and “bandwidth” is
aggregate system memory bandwidth.
- This form of Little’s Law was first noted by Burton Smith of Tera.
|
|
14
|
- Proof:
- Set f(t) = cumulative number of arrived customers, and g(t) = cumulative
number of departed customers.
- Assume f(0) = g(0) = 0, and f(T) = g(T) = N.
- Consider the region between f(t) and g(t).
|
|
15
|
- By Fubini’s theorem of measure theory, one can evaluate this area by
integration along either axis.
Thus Q T = D N, where Q is average length of queue, and D is
average delay per customer. In
other words, Q = (N/T) D.
|
|
16
|
- Assume:
- DRAM memory latency = 100 ns.
- There is a 1-1 ratio between memory bandwidth (word/s) and sustained
performance (flop/s).
- Cache and/or processor system can maintain sufficient outstanding memory
references to cover latency.
|
|
17
|
- Commodity design:
- Clock rate = 2.5 GHz, so latency = 250 CP. Then system concurrency = 100,000 x 4
x 250 = 108.
- HTMT design:
- Clock rate = 100 GHz, so latency = 10,000 CP. Then system concurrency = 10,000 x
10,000 = 108.
|
|
18
|
- But by Little’s Law, system concurrency
- = 10-7 x 1015 = 108 in each case.
|
|
19
|
- Assume:
- Commodity petaflops system -- 100,000 CPUs, each of which can sustain 10
Gflop/s.
- 90% of operations can fully utilize 100,000 CPUs.
- 10% can only utilize 1,000 or fewer processors.
|
|
20
|
- Then by Amdahl’s Law,
- Sustained performance < 1015 / [0.9/105
+ 0.1/103]
- = 9.2 x 1012
flop/s,
- which is only about 1% of the system’s presumed achievable performance.
|
|
21
|
- Conclusion: No matter what type
of processor technology is used, applications on petaflops computer
systems must exhibit roughly 100 million way concurrency at virtually
every step of the computation, or else performance will be
disappointing.
|
|
22
|
- This assumes that most computations access data from local DRAM memory,
with little or no cache re-use (typical of many applications).
- If substantial long-distance communication is required, the concurrency
requirement may be even higher!
|
|
23
|
- Key question: Can applications
for future systems be structured to exhibit these enormous levels of
concurrency?
|
|
24
|
- Latency
- System Sec. Clocks
- SGI O2, local DRAM 320 ns
62
- SGI Origin, remote DRAM
1us 200
- IBM SP2, remote node 40 us
3,000
- HTMT system, local DRAM 50
ns 5,000
- HTMT system, remote memory 200 ns
20,000
- SGI cluster, remote memory 3
ms 300,000
|
|
25
|
- Can we quantify the inherent data locality of key algorithms?
- Do there exist “hierarchical” variants of key algorithms?
- Do there exist “latency tolerant” variants of key algorithms?
- Can bandwidth-intensive algorithms be substituted for latency-sensitive
algorithms?
- Can Little’s Law be “beaten” by formulating algorithms that access data
lower in the memory hierarchy? If
so, then systems such as HTMT can be used effectively.
|
|
26
|
- For the solvers used in most of today’s codes, condition numbers of the
linear systems increase linearly
or quadratically with grid resolution.
- The number of iterations required for convergence is directly
proportional to the condition number.
|
|
27
|
- Conclusions:
- Solvers used in most of today’s applications are not numerically
scalable.
- Novel techniques, e.g. domain decomposition and multigrid, may yield
fundamentally more efficient methods.
|
|
28
|
- Studies must be made of future computer system and network designs,
years before they are constructed.
- Scalability assessments must be made of future algorithms and
applications, years before they are implemented on real computers.
|
|
29
|
- Approach:
- Detailed cost models derived from analysis of codes.
- Statistical fits to analytic models.
- Detailed system and algorithm simulations, using discrete event
simulation programs.
|
|
30
|
- Commodity technology or advanced technology?
- How can the huge projected power consumption and heat dissipation
requirements of future systems be brought under control?
- Conventional RISC or multi-threaded processors?
|
|
31
|
- Distributed memory or distributed shared memory?
- How many levels of memory hierarchy?
- How will cache coherence be handled?
- What design will best manage latency and hierarchical memories?
|
|
32
|
- 5-10 years ago: One word (8 byte)
per sustained flop/s.
- Today: One byte per sustained
flop/s.
- 5-10 years from now: 1/8 byte per
sustained flop/s may be adequate.
|
|
33
|
- 3/4 rule: For many 3-D
computational physics problems, main memory scales as d^3, while
computational cost scales as d^4.
- However:
- Advances in algorithms, such as domain decomposition and multigrid, may
overturn the 3/4 rule.
- Some data-intensive applications will still require one byte per flop/s
or more.
|
|
34
|
- MPI, PVM, etc.
- Difficult to learn, use and debug.
- Not a natural model for any notable body of applications.
- Inappropriate for distributed shared memory (DSM) systems.
- The software layer may be an impediment to performance.
|
|
35
|
- HPF, HPC, etc.
- Performance significantly lags behind MPI for most applications.
- Inappropriate for a number of emerging applications, which feature large
numbers of asynchronous tasks.
|
|
36
|
- Java, SISAL, Linda, etc.
- Each has its advocates, but none has yet proved its superiority for a
large class of highly parallel scientific applications.
|
|
37
|
- High-level features for application scientists.
- Low-level features for performance programmers.
- Handles both data and task parallelism, and both synchronous and
asynchronous tasks.
- Scalable for systems with up to 1,000,000 processors.
|
|
38
|
- Appropriate for parallel clusters of distributed shared memory nodes.
- Permits both automatic and explicit data communication.
- Designed with a hierarchical memory system in mind.
- Permits the memory hierarchy to be explicitly controlled by performance
programmers.
|
|
39
|
- How can tens or hundreds of thousands of processors, running possibly
thousands of separate user jobs, be managed?
- How can hardware and software faults be detected and rectified?
- How can run-time performance phenomena be monitored?
- How should the mass storage system be organized?
|
|
40
|
- How can real-time visualization be supported?
- Exotic techniques, such as expert systems and neural nets, may be needed
to manage future systems.
|
|
41
|
- Until recently, the high performance computing field was sustained by
- Faith in highly parallel computing technology.
- Hope that current faults will be rectified in the next generation.
- Charity of federal government(s).
|
|
42
|
- Results:
- Numerous firms have gone out of business.
- Government funding has been cut.
- Many scientists and lab managers have become cynical.
- Where do we go from here?
|
|
43
|
- Quantitative assessments of architecture scalability.
- Quantitative measurements of latency and bandwidth.
- Quantitative analyses of multi-level memory hierarchies.
|
|
44
|
- Quantitative analyses of algorithm and application scalability.
- Quantitative assessments of programming languages.
- Quantitative assessments of system software and tools.
|