CS225,  Fall 2006, Wednesday, Nov.29, Day 39

HW Day 39, Due Friday Day 40   (3 class days after today) 
Reading: How to make a faster running computer, more, plus intro to PC architecture:
Null & Lobur: pp. 151-2: 4.1.3, clocks;
  pp.179-189: 4.7 (chip structure) & 4.8.1(PC)  4.8.2 (MIPS)
             [I think there's a misprint p. 188--end_while should be label of    la $t4, sum]
 Null & Lobur pp. 210-227, 5.3 on--Addressing, ISA's, pipelining.
   p.411 on: Ch. 9 Alternative architectures: thru 9.3 (CISC vs. RISC) and on till you get tired.  Pipelining p. 478
Warford, Operating system & process management: 8.1, 8.2 thru p. 396.

A) Continue with programming. 
Presentation to the world, Friday, Dec. 8, (in Study Week.) Time???   We will NOT be reworking any more code in class.
Post on the Wiki ASAP,  what you have working (however simple); or if not, something whose bugs you can't solve.
Have ready to hand in, Last class or at Presentation, printed out:
your code
, cover pages (Sub documentation. Algorithm if not obvious), Self-evaluation (modified 11/27)

Amnesty extended again:  Any  old "Programs" handed in  by last day of classes will be only 1 day late.
More extra credit:  Investigate another (simulated) assembly language/architecture  Link
= = = = = = = = = = = == = = = =
Notes:
TICTACTOE: (James Fallows, New York Times, Nov. 27, 2004:  "When I worked briefly on a product design team at Microsoft, I was sobered to learn that fully one-fourth of the company's typical two-year "product cycle time" was devoted to testing.  Programmers spend 18 months designing and debugging a system.  Then testers spend the next six months finding the problems they missed.")

Architecture development: 
As processors get faster, accessing memory becomes proportionately longer, compared to in/between-register ops.
As more programming happens at a higher level, sophisticated tools at assembly level become less useful (less likely to be used by compiler)

Von Neumann architecture:
Repeat: {Fetch instruction, Decode, Execute}  "Von Neumann bottleneck": one instruction at a time.  One cook in the kitchen.
    CPU has to do this in smaller steps, each taking one or more clock cycles.
 Repeat:  {Fetch opcode, Decode opcode, (Calculate effective address of operand(s!), Fetch operand(s!)), Execute(multiple steps?), Store result}

Crude historic overview:
Early ISA's tended to be like the PEP8.  Few registers, 1 or 2 operands (if you count the register A or X as an operand), lots of interaction with memory.  Chips and memory were (relatively) expensive.
"Accumulator-based architecture"--one operand is in the "accumulator" register, result gets stored there.
--> "General purpose register architecture" --More registers,  more flexibility  (add two, store in third.  Slower decoding)
-->  More and more specialized and complex operations got designed.  Instructions have Wildly different lengths (clock cycles) (see Day38, 68000 chip)
CISC Complex instruction set computers
          (Many very specialized ones for traversing arrays, managing stacks. e.g. push=store + move stack pointer)
        Microprogrammed: (layer of code between "wires" and Assembly/machine language). 
                   Allow short programs.  Not necessarily short runtime.    Not clear that compilers use specialized instructions.
Amdahl's law:  the performance enhancement from an improved feature is limited by the amount of time the improved feature is actually USED.  Think Food processor.)

Ways to ameliorate the one-cook bottleneck:

Shorten maximum clock time per instruction: 
Load/store architecture (oddly named):  The ONLY ops involving memory are LOAD from, STORE to. 
All others involve only registers. 
                       So ADDA Num, d
must be done by  LDB Num, d    
                             ADDA, B
  
Some chips add 3-operand possibilty:  Add C to B and put into A.
Simplify instructions back to the primitive (ASR one byte,  not ASR n bytes)
Hard wire (little or no "microcode" needed)
Results:
Shorten Maximum clock time of an operation.
(So leave out dedicated Multiplication:  5*15  Add 15 to itself 5 times (or faster, shift and add) 4+4+4*5= 28 cycles in Motorola.  Built in, 70. )
Make all ops run in the same length of time, (short maximum.  Some may "waste" a little.)

Pipelining
: N&L 216-17, 478   "Everybody" uses it now...
 Break each instruction into n sequential stages; have a part of the hardware for each one, all working simultaneously:
   e.g. n=3:   Fetch op,  Decode,  Execute
Potentially 3 times as fast:   But branches, other things can mess it up.
Time slot    1   2   3   4   5   6   7   8   9
    load     F   D   E
    and         
F   D   E
    add              F   D   E
    breq B                F   D   E
    load                     F   D ||E      Useless beginnings, if branched
    add                          F ||D   E
    sub                              F   D   E
B:  store                                F   D   E
B:  store                            F   D   E

Usually more stages are used, branches "lose" more.  (My son tells me one chip's language uses "branch after the next instruction" to flag potential branches early, ameliorate this!)
Other problems: If Fetch, Load/store use same bus to transfer: conflicts?  Load a data item  before it's been stored from previous op?
Pipelining is a lot easier to structure if each operation requires the same number of memory accesses, and/or the same time to complete each piece.

RISC architecture:  "Reduced Instruction Set computer"   The paradigm:
Load/store, fixed length instructions, simple ops (few addressing modes), 3 operands, many registers (parameter passing by a bank of registers reserved for the procedure:  register "window") .  Instructions hardwired rather than microcode.  Pipelining. 
     Complexity happens at compiler/programmer/opsys level rather than at  machine-instruction level. 
Most modern chips are RISC-like but  use the RISC features erratically. Intels--lots of instructions in microcode on a RISC core.  (Backward compatibility is an issue.)

Cut memory use where possible--keep it in the CPU.
Passing parameters in registers (
vs.Passing parameters purely on the stack )
--Simple to program.
--Much faster, especially if you don't have to protect too much.
 (Register windowing:  "New" chips have many many registers (say 100). Some are organized into smaller "windows"--each contains private registers, registers for passing out and passing in.  Hardware has a way of hooking up the passing registers).
Registers:  
r r r r r r r r r r r r r r r r r r
m m m m m m m m                 Main's window
          s s s s s s s s          Subroutine1's window (what it feels like)
                    t t t t t t t t   Subroutine of Sub1's window (what it feels like)
--Save/restore in local memory:  Can't do recursion.  Usually we save and restore registers to a stack
   If you have too many parameters to pass in registers, rest go on stack.

More parallelism:
Multiple processors:  e.g. separate processor for floating point arithmetic (early); for graphical output, for other I/O. Or  2 or more of the "same" processor running at the same time ("Dual" processors).  Issues:  How to manage use of common memory (if there is common memory),   who does what instructions (obvious for special purpose processors, not for parallel ones), waiting for stuff.
Distributed computing:  "Separate" computers networked, working on a common problem.  "Cluster management" sofware: e.g. BEOWULF.org--open source Linux-based structures.  Organizes a cluster of heterogeneous computers to look like a massively parallel machine.  (This concept is Cheapest for the power).
  
 If 5 cooks are good, are 50 cooks 10 times as good?  500 cooks? (Ha!)  Almost?--if the problem has a lot of parallelism  (Big matrix computations, searching masses of data, similar number crunching on different inputs, all feeding together). 
But more energy into the planning, organizing, managing, than into the actual "work"?

How programming is changing to
improve coding efficiency and still make use of the hardware changes:
People benefit from programming in large abstractions: Objects, Classes, Procedures, data structures. 
Layers:  "Virtual" machines on top of RISC bases. (Backward compatibility).  You think you're programming in a CISC architecture in C++, but by the time it gets to the core computer, it may be broken down into RISC type instructions.  Extra instructions in op.sys (like DECI, STRO) or built in to the chip in microcode.  Java Virtual Machine (JVM): Java (C++ like) code to JVM code.  JVM to machine instructions, different for each chip.

Optimizing compilers/code--from clearest for humans to most efficient for machine.
  --
Expand short loops into straight code, rewrite procedures into "in-line" code.
  -- Replace variable with constant if variable doesn't change.
  -- Reorganize mathematical formulas for most efficient computation a*(b+c) + d*(b+c) = (a+d)*(b+c)
  -- Keep "scratch" (intermediate) variables in registers rather than store them, if not needed for passing or output.
  -- Rearrange instructions to get longest paths without branches.
  -- "Vector" operations (like matrix arithmetic) naturally parallel.  Compile to take advantage of architecture that can handle this.
  -- Organize code into chunks which can run independently.  (Processors can do this too, to optimize parallelism and/or pipelining)
  -- Rearrange instructions to avoid pipeline conflicts like waiting for results of a previous op (interleave independent ops).
                 ("Superscalar" hardware--2 processors--does this too.)

Algorithm analysis: 
to develop ways to share out  a problem to parallel processers or feed to a pipeline, efficiently.
(who should do the optimizing?  Compiler (software) or Processor (hardware))

BACK TO:
Operating system:  What does it have to do, and how (a little) does it do it?  Warford, Ch. 8, lightly.  8.1, beginning 8.2
   First job:  Load a program and turn control over to it:  (PEP8)
       Loader: Is a subroutine in the operating system.  Load button puts its first instruction address in PC.
           Copies the program into memory starting at 0000,
               puts FBCF into SP
               puts 0000 into PC,
                  next instruction to run is the one at 0000.
   Our programs STOP.  In a "real" system, control would revert to the operating system, which would load another program, never (rarely) stop.

Trap:  (similar to "interrupt"--language varies)  Like subroutine but part of operating system.  DECI, DECO, STRO are traps in PEP8.
   Need to leave the running program, execute the trap, return, invisibly except for desired effect.
Process Control Block  on entering the trap, all info about the running program (process) is saved into memory in this block.  (On system stack or special memory blocks).  Restored at return from trap. 
What?  All Registers! (including Flags, Stack Pointer, PC).  That contains all info needed to pick up where we left off (unless memory has been messed with in a wrong way...)

Next. 8.3, Concurrent processes   

, To Sievers Home Page
CS225-Fall06/Day39.htm 
10pm, 11/29/06
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility