Additional Deuce Coupe Thoughts

February 15, 2004
Russell Fish

General Idea of Multiprocessor

A plurality of very light weight high speed miniprocessors are distributed within a memory chip.  Each processor is responsible for a small task involving a small amount of instructions.  The optimum amount of instructions is that which can be fetched in a single memory cycle from RAM.  1024 bits would equate to 128 8-bit instructions.

The processors usage would be allocated much like tasks in a multi-tasker.  They would be called by other processors.  Possible a master processor would allocate tasks to sub-processors.  Some processes would run to completion, announce DONE, deliver RESULTS, and make themselves available for the next task.

Applications

The PC as we know it in 2004 is nearing the end of its innovative life.  CPU speed is approaching the ultimate limits of current architecture due to constraints of:

1. Memory speed (the Von Neuman bottleneck)

2. Power consumption (Increase speed 2X increase power 4X).

3. Heat production (Increase speed 2X increase power dissipated 4X).

4. Software limitations. (Microsoft has stretched the old DOS model about as far as it can.)

The current architecture consumes 90% of the CPU just keeping itself running.  It does not appear possible to make it scale well with multiprocessors.

Software

The Windows monopoly is coming to an end.  Microsoft made visual computing mainstream by building a front end on DOS (which was really a grown up CPM).  That software model is now more than 35 years old.  Many of the programming innovations of the last 5 years have involved putting various graphical and code generating interfaces on this very old dinosaur.  JAVA, Visual BASIC, etc.  When a "Hello World" program consumes 200K of memory (like it does in some JAVA implementations) smart people get a clue that things are not right.

Russell's Razor of Programming Efficiency: A programming language's figure of merit is determined by how many bits of memory are required to transform the programmers desire into machine executable action.

Programming efficiency directly leads to execution efficiency due to memory bandwidth and the Von Neuman effect.

The often repeated maxim that, "Memory is cheap, so programming efficiency doesn't matter" is wrong.  Memory is cheap and may be infinite, but memory bandwidth is expensive and very limited.

Future of Software

The future of software is "multitasking" and "multiprocessing".  This statement has been made by various pundits for 30 years, but the marketplace has been skewed by Microsoft/INTEL dominance.  The original IBM DOS could have easily been made multitasking but was intentionally crippled to prevent the original PC from taking market share away from IBM's various low-end minicomputers.

The market dominating PCs took their architectural cues from 1970 timeframe minicomputers.  (I know.  I was there.)  The 6800 (and eventually the 68000) were near copies of DEC's PDP-11.  The 8080 shared many architectural features with the Data General NOVA and other popular minicomputers.  These architectures were optimized for the minicomputer tasks of the day and not well suited for multiprocessing.  (They were way to bulky for either the hardware or software to be efficiently organized into large groups.)

Resulting generations of CPUs including the Pentium, PowerPC, and DEC Alpha incrementally enhanced the basic minicomputer idea without too much rethinking of the underlying limitations.  The most significant enhancements of the past 20 years of CPU progress have been:

1. Faster clock speed 1MHz - 4GHz.  This improvement is largely due to improvements in semiconductor processing and resulting faster transistors rather than architectural innovation.  7micron - .1 micron design rules.

2. Larger range of memory access 64Kbytes - 4Gbytes.  Memory address busses increased from 16 bits to 64 bits.  It is difficult to describe this growth as architecturally innovative.

3. Larger ALU widths 8bits - 64bits.  The increase in arithmetic widths does architecturally increase the speed of mathematic operations by doing more work in the same amount of time.

Once again it is difficult describe the increase as architecturally innovative.  The design model remained minicomputers and eventually mainframes.

Much of the design effort of the preceding two decades has been spent on dealing with the progressively worse match between memory bandwidth and CPU speed.  Various techniques have been used including:

1. Instruction prefetch queues

2. Instruction caches

3. Instruction pipelines

4. Multi-level caches

The battle is a losing one due to the limitations inherent in moving a bit of information from one chip through a huge metal pin, across a huge metal trace, back up another huge metal pin, and finally into another chip.  The laws of physics are against you.  As the speeds increase, immense amounts of energy are required to move bits a few mm.  The problem is also mathematically "second order".  In other words, if the speed doubles, the power increases by a factor of four.

By definition this is the sort of problem that has an upper physical bound.  Most of the world of resources scales linearly (power availability, power dissipation, number of wires in a certain area, etc.).  When a problem consumes resources in a nonlinearly increasing manner, there comes a time at which no amount of resources will improve the result.  Furthermore, the final few percentage points of improvement come at a horrendous cost.

Implications for the Future of Software and CPU Hardware

Traditionally, CPU hardware design has been distinct and different from software design.  CPU design has involved a fair amount of science and engineering.  Software design has proceeded more like "art" (and a black art at that). 

As my good friend Chuck Moore observed optimizing computer systems is really a factoring problem.  The simplest factor is the optimal result.  The simplest factor may only be achieved by optimizing hardware and software together.

To resolve the asymptotic behavior of PC performance requires rethinking designs along the following lines:

1. CPU resources must be tightly coupled to memory.  The closer to CPU is to memory, the less effect of the Von Neuman bottleneck.

2. CPU resources must be widely distributed rather than centralized.  Very few application problems solutions are inherently centralized, and their solutions need not be. 

3. The higher up the applications solutions scheme communications is pushed, the lower the bandwidth required. This along with items 1 & 2 argue for lightweight CPU/memory combinations producing intermediate results which may be communicated with other CPU/memory combinations.

4. Programs do not equally distribute CPU time among all the list of instructions.  Much time is spent waiting for results.  Much time is also spent inside inner loops performing repetitive operations.  These loops are often surprisingly small.  This argues for programs allocating multiple dedicated CPU/memory combinations for time consuming operations.

5. Distributing solution resources across the physical and cyber application space will result in superior solution compared to a centralized solution.  This is due to the many local optimizations possible with a distributed solution not available to the centralized approach.  It is also due to communications limitations of the centralized approach.  It is also the reason centrally planned economies are inherently less efficient that distributed locally optimized (free-market democratic) ones.

Future of the PC

The PC will be superseded by other personal intelligent devices.  Early models are various PDAs, cell phones, and portable video games.  The new computing paradigm will be distinguished by:

1. Small size (handheld or smaller).

2. Low power consumption (battery or solar powered)

3. Innovative information output or display (small LCD, digital paper, audio)

4. Heavy on wireless communications (802.11b/g, local IR)  Wired communications such as LAN is a good match for heavy immobile PCs but not for a light weight inherently portable device.

Master Processor

The Master Processor can do its own computations, but more importantly it can dispatch Slave Processors to do work and report back with results.  The Master Processor maintains a list of available slaves.  The Master can configure a slave to associate with a certain amount of memory.  Memory is allocated in chunks.  A chunk is one row worth of RAM (1024 bits for example). 

Program processes are bound by chunk boundaries.  A process may pass across multiple chunk boundaries by calling another chunk.  Programs may only loop inside a chunk.  Two levels of nesting are allowed within a single chunk. 

The Master Processor calls a slave and requests:

1. A CPU (of a particular characteristic),

2. A pointer to a certain number of chunks of data memory,

3. A pointer to a certain number of chunks of program memory.

Slave Processor

Slave processors may have a variety of configurations including CPU width, I/O capability, and special math or logic capability.