| Hand in Monday p. 122, 5.3b only. verify formula Find the means, s.d.'s and r in the answers in the back of the book, and use them to calculate a and b and write the formula for the regression line.. p. 141, 5.30 husbands and wives (Note, you have to find the equation of the line to draw the graph, tho it doesn't explicitly tell you to...) p. 125, 5.5 (SPSS. Let SPSS find the regression line. Get the mean yield and mean planting rate too--you need it for part c) corn again, straight line is a "bad fit" . . . . . . . . . . . Residuals p. 129, 5.7 (SPSS) does fast driving waste fuel? residuals There is a data file for problem 5.7, and its third column is the residuals. Do all the parts, and Also with 5.7, In SPSS, Make a variable containing the residuals (Handout, bottom p. 4. Also middle-bottom of this page.) The values should match the ones in the book/SPSS file. SPSS Handout p. 3 (Governors' salaries): You can now finish #12, the last question. Hand it all in Next time. p.133, 5.9 Farm population Do a, b, c (read p. 132 for a good word to use in part c). Also, make a variable containing the residuals, and plot it against the x (year) values. Draw (in pencil) a horizontal line at height 0. What pattern do you see in the residuals?B. Use Residuals.xls from the website or the lab to graph these data sets, along with a graph of the residuals. Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.) a) x 1 2 8 4 6 9 y 1 3 6 6 7 5 b) x 1 2 7 4 6 9 y 7 6 2 4 2 1 (SPSS) Do a, b, c (read p. 132 for a good word to use in part c). Also, make a variable containing the residuals, and plot it against the x (year) values. Draw (in pencil) a horizontal line at height 0. What pattern do you see in the residuals? p 179 7.28, 29, 30 (SPSS) Soap in the shower.
Also, look carefully at the graph and guess why there is no data after
day 21. (Read p. 132 for the word to describe using the line for
day 30, and a discussion of the issue) |
Read, to discuss Look at this, especially with reference to the r standard deviations in y for every 1 standard deviation in x: A. Open the Excel file RegressionSlope (or in the folder RegressionDemosExcel for D&V in ClassMaterial\Math151 D&V). Change x-y values in the yellow boxes and watch the line change. Change x-values in col. F and watch the "run" (red line) change, in the rightmost 2 graphs. Notice the slope = the coefficient of x = the rise/run = increase in y per unit increase in x. Fix it so the increase in x (the "run") is exactly 1. Also, look at the leftmost graph, where the length of the standard deviations are shown, and note that in standard-deviation units, the rise is r s.d.'s in y for each s.d. run in x. .. C. Use Applet http://www.whfreeman.com/BPS4e
Correlation/regression. Make a cloud of data (about 15
points), put in the regression line. Play with an outlier: drag a
point to the far left (or right) and drag it up and down. Postpone: |
Optional p. 179, 7.27 (review Normal) Postpone p. 136, 5.11, lurking variables
|
NOTE: The standard deviation doesn't say anything about the distance of any individual point from the mean; it's only about a kind of "average" variability.
R2 doesn't say anything about the line and any particular (x,y) pair --just about a kind of "average" goodness of the explanatory power of the line for the data.
HW questions? Day 15
5.42 p. 146, a computer game, revisited. Can it really be
that only about 9% of the variability in speed of the right
hand is accounted for by the distance? The eye is fooled by the
graph, with the right hand data squashed down at the bottom and looking
really linear. Here is the right
hand by itself. (SPSS output
file)
Income depends on height?!
What is "$789", and what kind of analysis
did they do? (HW) How much of the variation in salary is
explained by height?
- - - - - - - - - - -
Fact 1: Regressing Variable A on Variable B doesn't
give the same line as regressing Variable B on Variable A: Line gives "best"
vertical value for a given horizontal. value. See "residual"
lines for govsal on avgpay.
- - - continuing- - - - - - - - -
Facts 2 &3, give line formula, and more! (Moore pp. 123-125)
(For details seeDay 15)
b = r times (s.d. of y)/(s.d.
of x) (Equation p. 120)
ybar = a + b (xbar).
Solve this for a, a = ybar - b (xbar).(OtherEquation p. 120)
Start
here Monday:
Least Squares Property, and Residuals
"Residual at x"
= (y - yhat)
= distance between observed y and predicted y (= what's left over after predicting)
( Positive if observed is bigger than predicted, negative
if observed is smaller than predicted)
Residual: Look at an individual observed (x,y) data pair.
The residual is the "leftover" amount of y after predicting a y using the line.
Visually, length of vertical line drawn from y to regression line (+ if point
is above line, - if point is below line)
Residual = observed
y - predicted y = "prediction
error" p. 119
Calculating: Montana (17895, 55502)
Govsal = 28,569.69 + 2.71*avgpay
Predicted Govsal = 28,569.69 + 2.71*17895 = 28,569.69
+ 48495.45 = 77065.14
Residual =
55,502 - 77065 =
-21563, $21,563 below expected value.
Least squares principle: Find the line that minimizes the
sums of the squared residuals.(Here,
or in Mac 101, ClassMaterials\Math151 BPS4e\ RegressionDemosExcel
BPS4e\RegressionLeastSqs.xls, Squares tab)
This method of finding
a "best fit" straight line for predicting y's from x's was derived mathematically
to work well with "joint normal" data--elliptical clouds. (Same idea as mean&
st.dev.) For data of this sort, the line does give the mean
of the y's for each given x (at least in the abstract.)
Residuals drawn to line Govsal-Deviations.doc, SPSS (handout, p. 3, bottom: In Edit mode, Insert>Spikes:
Spike to: Regression) <>Drawback if the data is not the
"elliptical cloud" type:
Outliers get their residual distance squared:
May be very influential in determining slope of line =
especially
if at lowest or highest x-values, may change slope of line a lot.
Applet ,http://bcs.whfreeman.com/BPS4e, ...Correlation®ression. Play with an outlier.
(Outliers toward the
middle x's may not change the slope, but may affect r, and r2.)
~ ~ ~ ~ ~ ~Do plotting
residuals Wed. Day 18?~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Plotting residuals: If you graph
residual values against x (or against predicted y's), you eliminate
visually the linear portion of the association. (The regression line "becomes"
the new x-axis; a "shear" transformation.) Curving or
other structure may stand out more visibly. No structure in residuals
= Straight line is a "Good" fit.) (Here or ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\Residuals.xls
SPSS can make a new variable of residuals,
which you then can use to make a scatterplot. (Handout p. 4 and 3 bottoms)
Do Analyze>Regression>Linear (a new menu for us)
Click your variables into Independent (X) and Dependent(Y).
Hit the Button "Save...": Checkbox Residuals: Unstandardized. Continue, Ok out
of the menus. You'll get output; ignore it.
You'll get a new variable, the residuals. You can now use this
on the vertical axis of a scatterplot: "Residual plot"
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Cautions
pp. 132-136 (done
most Mon. Day 17)
Plot the data: Summary
formulas and numbers don't tell the whole story. In particular,
correlation and regression line only describe a linear relationship
properly.
Correlation and regression are not resistant to outliers, influential
points.("Anscombe's quartet", Moore p.142, 5.34) (Overhead slide. You can reconstruct these pictures
using SPSS and Moore's problem, if you like.)
Extrapolation--
extra (outside) polation (putting a point): Using the line to predict
outside
the range of x's you have data for. Linear relationships don't go
on forever; straight line is often a first approximation to a
more complicated relationship.
Government projections of national budget surplus/deficit:
(www.cbo.gov publications>search)
Jan. 2001 http://www.cbo.gov/showdoc.cfm?index=2727&sequence=6
Projection used to justify Bush tax cuts.
Jan. 2002
http://www.cbo.gov/showdoc.cfm?index=3277&sequence=6
August 2006
http://www.cbo.gov/ftpdocs/74xx/doc7492/08-17-BudgetUpdate.pdf
Pdf p. 19, single line projection--10 years,
p. 36, uncertainty--6 years.
March. 2007(p.2)pdf p. 8
http://www.cbo.gov/ftpdocs/78xx/doc7837/03-05-Uncertain.pdf
June 2000, conservative think tank analysis http://www.hoover.org/publications/policyreview/3487697.html
Fig 1, budget surplus/deficit 1901
on. Notice only previous longterm surplus is 1920's,
Fig. 6 --1960 on, & projections
"Lurking" variable:
has an important effect, but not one of the variables studied.
Meatloaf shrinkage vs.
placement
in oven? (cooking thermometer/not had greatest influence)
Time sequence of
observations
a common one. (Learning, tiring, aging)
The trouble with lurking
variables is that by definition you don't know they're there.
Look
behind every tree.
Do this next time: Association
does not imply causation
Strong association/correlation between A and B could be:
A causes B/ B causes A/ C
causes both
A and B (lurking C)/ just Chance that they go together in this
data
set.
Direction? Rooster causes sun to rise by
crowing?
Both variables "caused" by a lurking variable?
Lurking variable can be part of the cause.
--Women with a history of heavy antibiotic use have higher rates of
breast cancer.
--Baby rats whose mothers licked and groomed
them more grew up to be more exploratory, social, less
timid.
Cause? Effect? How to tell?
Establishing that x "causes" y:
difficult:
Best: Do an experiment
in which we change x, keep lurking variables under control. (Ch.
9
Rats.
)
Otherwise: Strong
association. Consistent over many studies. Higher x-->stronger
y.
X precedes y in time. A plausible mechanism exists (parallel
studies?)
Generalize rat grooming to humans?
E.g.Partially hydrogenated oils ("trans fats")--> heart
disease? Homocysteines --> heart disease?
| Sievers home | Math151-Fall07/Dayf16.htm | 3:30pm | 10/1/07 |