CS225, Fall 2006, Wednesday, Nov.29, Day 39
HW Day 39, Due Friday
Day 40 (3
class days after
today)
Reading: How to make a faster running computer,
more, plus
intro to PC architecture:
Null & Lobur: pp. 151-2: 4.1.3, clocks;
pp.179-189: 4.7 (chip
structure) & 4.8.1(PC) 4.8.2 (MIPS)
[I think there's a misprint p. 188--end_while should be label of
la $t4, sum]
Null & Lobur pp. 210-227, 5.3
on--Addressing, ISA's, pipelining.
p.411 on: Ch. 9 Alternative architectures: thru 9.3 (CISC
vs. RISC) and on till you get tired. Pipelining p. 478
Warford, Operating system & process management: 8.1, 8.2 thru p.
396.
A) Continue with programming.
Presentation to the world, Friday, Dec. 8,
(in Study Week.) Time???,
We will NOT be
reworking any
more code in class.
Post on the Wiki ASAP, what you have working
(however simple); or if not, something whose bugs you can't solve.
Have ready to hand in, Last class or at Presentation,
printed out:
your code, cover
pages (Sub documentation. Algorithm if
not obvious), Self-evaluation (modified
11/27)
Amnesty extended again: Any old "Programs"
handed in by last day of classes will be only 1 day late.
More extra credit: Investigate another (simulated) assembly
language/architecture Link
= = = = = = = = = = = == = = = =
Notes:
TICTACTOE:
(James Fallows, New York Times, Nov. 27, 2004: "When I worked
briefly on
a product design team at Microsoft, I was sobered to learn that fully
one-fourth of the company's typical two-year "product cycle time" was
devoted to testing. Programmers spend 18 months designing and
debugging a system. Then testers spend the next six months
finding the problems they missed.")
Architecture development:
As processors get faster, accessing memory becomes
proportionately
longer, compared to in/between-register ops.
As more programming happens at a higher level, sophisticated tools at
assembly level become less useful (less likely to be used by compiler)
Von Neumann architecture:
Repeat: {Fetch instruction, Decode, Execute} "Von Neumann
bottleneck": one instruction at a time. One cook in the kitchen.
CPU has to do this in smaller steps, each taking one or
more clock cycles.
Repeat: {Fetch opcode, Decode opcode, (Calculate effective
address
of operand(s!), Fetch operand(s!)), Execute(multiple steps?), Store
result}
Crude historic overview:
Early ISA's tended to be like the PEP8. Few registers, 1 or 2
operands
(if you count the register A or X as an operand), lots of interaction
with memory. Chips and memory were (relatively) expensive.
"Accumulator-based architecture"--one operand is in the
"accumulator" register, result gets stored there.
--> "General purpose register architecture" --More
registers, more flexibility (add two, store in third.
Slower decoding)
--> More and more specialized and complex operations got
designed. Instructions have Wildly different lengths (clock
cycles) (see Day38, 68000 chip)
CISC Complex instruction set computers
(Many very
specialized ones for
traversing arrays, managing stacks. e.g. push=store + move stack
pointer)
Microprogrammed: (layer of
code between "wires" and Assembly/machine language).
Allow short programs. Not necessarily short runtime.
Not clear that compilers use specialized instructions.
Amdahl's law: the performance enhancement from an improved
feature is limited by the amount of time the improved feature is
actually USED. Think Food processor.)
Ways to ameliorate the one-cook bottleneck:
Shorten maximum clock time per instruction:
Load/store architecture (oddly named): The ONLY ops
involving
memory are LOAD from, STORE to.
All others involve only registers.
So ADDA Num, d
must be done by LDB Num, d
ADDA, B
Some chips add 3-operand possibilty: Add C to B and put into A.
Simplify instructions back to the primitive (ASR one byte, not
ASR n bytes)
Hard wire (little or no "microcode" needed)
Results:
Shorten Maximum clock time of an operation.
(So leave out dedicated Multiplication: 5*15 Add 15 to
itself 5 times (or faster, shift and add) 4+4+4*5= 28 cycles in
Motorola. Built in, 70. )
Make all ops run in the same length of time, (short
maximum. Some
may "waste" a little.)
Pipelining: N&L 216-17, 478 "Everybody"
uses it now...
Break each instruction into n sequential stages; have a part of
the hardware for each one, all working simultaneously:
e.g. n=3: Fetch op, Decode, Execute
Potentially 3 times as fast: But
branches, other things can mess it up.
Time slot 1 2 3
4 5 6 7 8 9
load F D E
and F
D E
add
F D E
breq B
F
D E
load
F
D ||E
Useless beginnings, if branched
add
F ||D E
sub
F D E
B:
store
F D E
B:
store
F D E
Usually more stages are used, branches "lose" more. (My son tells
me one chip's language uses "branch after the next instruction" to flag
potential branches early, ameliorate this!)
Other problems: If
Fetch, Load/store use same bus to transfer: conflicts? Load a
data
item before it's been stored from previous op?
Pipelining is a lot easier to structure if each operation
requires the same number of memory accesses, and/or the same
time to complete each piece.
RISC architecture: "Reduced Instruction Set
computer" The paradigm:
Load/store, fixed
length instructions, simple ops (few addressing modes), 3 operands,
many registers (parameter
passing by a bank of registers reserved for the procedure:
register "window") . Instructions hardwired rather than
microcode. Pipelining.
Complexity happens at
compiler/programmer/opsys level rather than at
machine-instruction level.
Most modern chips are RISC-like but use the RISC features
erratically.
Intels--lots of instructions in microcode on a RISC core.
(Backward compatibility is an issue.)
Cut memory use where possible--keep it in the CPU.
Passing parameters in registers (vs.Passing parameters
purely on the stack ):
--Simple to program.
--Much faster, especially if you don't have to protect too much.
(Register windowing: "New" chips have many many
registers (say 100). Some are organized into smaller "windows"--each
contains private registers, registers for passing out and passing
in. Hardware has a way of hooking up the passing registers).
Registers:
r r r r r r r r r r r r r r r r r r
m m m m m m m
m
Main's window
s s s s s s s
s Subroutine1's window
(what it feels like)
t t t t t t t t Subroutine of Sub1's window (what it feels
like)
--Save/restore in local memory: Can't do
recursion. Usually we save and restore registers to a stack.
If you have too many parameters to pass in registers, rest
go on stack.
More parallelism:
Multiple processors: e.g. separate processor for
floating point arithmetic (early); for graphical output, for other I/O.
Or 2 or
more of the "same" processor running at the same time ("Dual"
processors). Issues: How to
manage use of common memory (if there
is common memory), who does what instructions (obvious for
special purpose processors, not for parallel ones), waiting for stuff.
Distributed computing: "Separate" computers networked,
working on a common problem. "Cluster management" sofware: e.g.
BEOWULF.org--open source Linux-based structures. Organizes a
cluster
of heterogeneous computers to look
like a massively parallel machine. (This concept is Cheapest for
the power).
If 5 cooks are good, are 50 cooks 10 times as
good? 500 cooks? (Ha!) Almost?--if the problem has a lot of
parallelism (Big matrix computations, searching masses of
data, similar number crunching on different inputs, all feeding
together).
But more energy into the planning, organizing, managing,
than into the actual "work"?
How programming is changing to improve coding efficiency
and still make
use of the hardware changes:
People benefit from programming in large abstractions: Objects,
Classes, Procedures, data structures.
Layers: "Virtual" machines on top of RISC bases.
(Backward
compatibility). You think you're programming in a CISC
architecture in C++, but by the time it gets to the core computer, it
may be
broken down into RISC type instructions. Extra
instructions in op.sys (like DECI, STRO) or built in to the chip in
microcode. Java Virtual Machine
(JVM): Java (C++ like) code to JVM code. JVM to machine
instructions, different for each chip.
Optimizing compilers/code--from clearest for humans to most
efficient for machine.
-- Expand short loops into
straight code, rewrite procedures into "in-line" code.
-- Replace variable with constant if variable doesn't change.
-- Reorganize mathematical formulas for most efficient
computation a*(b+c) + d*(b+c) = (a+d)*(b+c)
-- Keep "scratch"
(intermediate) variables in registers rather than store them, if not
needed for passing or output.
-- Rearrange instructions to
get longest paths without branches.
-- "Vector" operations (like matrix
arithmetic) naturally parallel. Compile to take advantage of
architecture that can handle this.
-- Organize code into chunks which can run independently.
(Processors can do this too, to optimize parallelism and/or pipelining)
-- Rearrange instructions to
avoid pipeline conflicts like waiting for results of a previous op
(interleave independent ops).
("Superscalar" hardware--2 processors--does this too.)
Algorithm analysis: to develop ways to share out a
problem to parallel processers or feed to a pipeline, efficiently.
(who should do the optimizing? Compiler (software) or Processor
(hardware))
BACK TO:
Operating system: What does it have to do, and how (a
little) does it do it? Warford, Ch. 8, lightly. 8.1,
beginning 8.2
First job: Load a program and turn control over to
it: (PEP8)
Loader: Is a subroutine in
the operating system. Load button puts its first instruction
address in PC.
Copies the
program into memory starting at 0000,
puts FBCF into SP
puts 0000 into PC,
next instruction to run is the one at 0000.
Our programs STOP. In a "real" system, control would
revert to the operating system, which would load another program, never
(rarely) stop.
Trap: (similar to "interrupt"--language varies) Like
subroutine but part of operating system. DECI, DECO, STRO are
traps in PEP8.
Need to leave the running program, execute the trap,
return, invisibly except for desired effect.
Process Control Block on entering the trap, all
info about the running program (process) is saved into memory in
this block. (On system stack or special memory blocks).
Restored at return from trap.
What? All Registers! (including Flags, Stack Pointer, PC).
That contains all info needed to pick up where we left off (unless memory
has been messed with in a wrong way...)
Next. 8.3, Concurrent processes
This page belongs to Sally Sievers who is solely
responsible
for its content. Please see our statement
of responsibility