Math 151 , Fall '08, Fri. Day 16, Oct. 3,Hit reload.. .After class.
Reading: Ch. 5, Regression, reread thru p.
125 (check p. 137: 5.14 through 20, basic line and
regression line facts and tools. 21 r and slope signs, 22 is
harder--changing units--don't worry about it. 23 If you sketch the
graph and draw a line thru the points, you should be able to
guesstimate the slope well enough to choose among the 3
answers.) Now, the equation of the least-squares
line (p. 120) & Fact 2, p.123. and, Continuing regression,
p. 126-137.
|
Hand in: More Regression
Line formula & Facts:
p. 122, 5.3b only. verify formula Find the means, s.d.'s and r
in the answers in the back of the book, and use them to calculate a and
b and write the formula for the regression line..
p. 141, 5.30 husbands and wives (Note, you have to find
the equation of the line to draw the graph, tho it doesn't explicitly
tell you to...)
p. 125, 5.5 (SPSS. Let SPSS find the regression line.
Get the mean yield and mean planting rate too--you need it for part c) Corn
again, straight line is a "bad fit." My book has a misprint in
(c). Should be "when xbar is the
mean planting rate".
. . . . . . . . . . .
pp. 143-4, 5.35, 37 (SPSS) Drilling into the past, silicon (one
clear outlier) To graph the lines with and without the outlier on
the same graph, make a new variable and put 1's in every case but the
outlier--give the outlier 0. Then use this variable as your Set
Markers By or your Panel By variable. Use Fit Line at Subgroups. You'll
also get a "nuisance" horizontal line at the outlier; ignore it.
To get the formula for the line without the outlier, use your new 0-1
variable as the Selection Variable (see p. 4 of Handout for details.)
Postpone the rest:
Residuals, Cautions
p. 129, 5.7 (SPSS) does fast driving waste fuel? residuals
There is a data file for problem 5.7, and its third column is
the residuals. Do all the parts, and
Also with 5.7, In SPSS, Make a variable containing the residuals
(Handout, bottom p. 4. Also middle-bottom of this page.) The values should
match the ones in the book/SPSS file.
SPSS Handout p. 3 (Governors'
salaries): You can now finish #12, the last question.
Hand it all in Next time.
p.133, 5.9 Farm population Do a, b, c (read p. 132 for a good
word to use in part c). Also, make a variable containing the
residuals, and plot it against the x (year) values. Draw (in
pencil) a horizontal line at height 0. What pattern do you see in
the residuals?
B. Use Residuals07.xls or Residuals.xls from the website or the lab to
graph these data sets, along with a graph of the residuals. Print
the results, and describe the shape of the residuals (it may help to
connect the dots with pencil, to see the pattern.)
a) x 1 2 8 4 6 9
y 1 3 6 6 7 5
b) x 1 2 7 4 6 9
y 7 6 2 4 2 1
p 179 7.28, 29, 30 (SPSS) Soap in the shower.
Also, look carefully at the graph and guess why there is no data after
day 21. (Read p. 132 for the word to describe using the line for
day 30, and a discussion of the issue)
p. 136 5.13 hospitals: big = bad?
|
Read,
to discuss
A. Practice Calculating line
formula, following up class work, notes below.
Highlight space after question to see worked solution.
|
Op
tion
al
|
= = = = = = = = = = = = = = = = = = = =
= =
Exam
2 a week from Friday: Day 19 (Oct. 10). Day
before break. Let me know Right Away if you can't take
the exam Friday. Starts with Ch. 3, Normal
distribution, tables. Thru Ch. 4, and what we cover of Ch.5
(&7) through Monday. (All questions on the sample exam will be covered?)
Sample exam (handout),
solutions (link) + Normal
probability practice As of beginning of today, you
can do all but #4d,e,f; #5. As of end, you can now do #5.
One sheet of notes: I will give you paper
copies of the Normal table.
Are you having
trouble seeing which
variable goes on the x axis? If there
is any sense that one is the cause of the other, or can/will
be used to predict or estimate the other, that's
the explanatory (x) variable. The other one is the
response (y) variable. (Sometimes you can choose
the x-values and see the response for that x, in the corresponding
y: like the corn plant density problem (It's an experiment,
Ch.9.)
Sometimes you can only observe.) Language:
Regress heating oil ON temperature: Temperature = x
= horizontal, Heating oil = y = vertical.
HW questions? Regression
Day 15
Leftover: Timeplots: are
scatterplots, where the x axis shows time. (Time is often a
lurking variable: plot data against order of taking observations)
"Trend" in timeplot ="slope" in usual scatterplot.
- - - - - - - - - - -
Regression
line: Ch. 6, "Regressing y ON
x" Predicts or estimates a y (vertical)
value for a given
x (horizontal) value: Straight line!
Experimenting
http://www.whfreeman.com/bps4e,
Correlation and Regression Applet.
SPSS--back of handout. Govsal
on avgpay
Formula yhat = a + b x. Govsal = a
+
b avgpay
Govsal = 28,569.69 + 2.709*avgpay
Calculating:
Montana (17,895,
55,502) Govsal = 28,569.69 + 2.709*avgpay
Predicted
Govsal
= 28,569.69 + 2.709*17,895 = 28,569.69 48,477.56 = 77,047.25
(higher than actual)
a is y-intercept.
b is slope:
If x increases one unit, yhat increases b
units.
Governor's salaries increase (on the average across the states)
$2.71 for every increase of $1 of average pay.
(In a straight-line relationship, the amount that y
increases
for one unit increase in x is the same no matter what value of
x
you start with) RegressionSlope.xls
or
in ClassMaterial\Math151-BPS4e \RegressionDemos Excel BPS4e
r2 ("Coefficient
of Determination") = fraction of the variation in y-values
explained/predicted by knowing x and using the least squares regression
line. (Fact 4)
HW: Income depends on height?!
What is "$789", and what kind of analysis
did they do? Regression.
Less than 15% of the variability in Pay is explainable by (regression
on ) height.
5.42 p. 146, a computer game, revisited. Can it really be
that only about 9% of the variability in speed of the right
hand is accounted for by the distance? The eye is fooled by the
graph, with the right hand data squashed down at the bottom and looking
really linear. Here is the right
hand by itself.
We all get the same line from a batch of data because we use the
"least-squares
best fit" criterion (p. 119): we'll investigate this more closely later.
Facts: 1, 2 lite, 3 first. Then
4. Then 2 &Formulas p. 120, from
2&3.
Facts again
(Moore pp. 123-125)
- Which is explanatory, which is response, is crucial for
regression! The Regression line is trying to predict the
"average y" for a given x (with the added requirement that it is a
straight line). See "residual"(deviation) lines for govsal on avgpay.
Unless the data lies perfectly on a straight line, the line for
predicting weight from height -- "regressing weight on height" --(for
example) will NOT be the same line
as that for predicting height from weight--"regressing height on
weight". (In-class demonstration, on overhead projector.)
(Example 5.3, Fig. 5.4 pp.123-4 is about this. )
- Lite: The correlation coefficient r and the
slope b of the regression line have the same sign!
+ or - .
Negative/positive: trend=slope
~association~correlation
Heavy: A change of one standard deviation in x
corresponds to a change of r standard deviations in y, along the
regression line. We'll return to this today.
- The regression line goes through the point given by the
two means, (xbar, ybar).
Applet http://www.whfreeman.com/bps4e
We'll return to this today.
- r2 ("Coefficient of Determination") = fraction of
the variation in y-values explained/predicted by knowing x and using
the least squares regression line. SPSS writes "R
Square", or "R Sq Linear". (Exactly what that means
mathematically is hard. Just get
used to it as a measurement.)
Closer to 0, more scatter around the line. Closer to 1, tighter
clustering around the line. R-Squared (or RSquared.xls:
ClassMaterial\Math151-BPS4e\RegressionDemosOlderExcel) (Excel07? RSquared07.xls in RegressionDemosExcel07)
(Optional: Further explanation of
r2)
(Demos yet to come.)
r2 is the square of the
correlation coefficient r! (-, + Sign gets
lost.)
If r = .7, about half (.49) of the variation in
the y's is explained by using the regression line relationship to
predict y from x. (If weight and height have a correlation of .7, then
half of the variability in weight can be explained by knowing height.
Or vice versa.)
NOTE: The standard deviation doesn't say anything about
the distance of any individual point from the mean; it's only
about a kind of "average" variability.
R2 doesn't say anything about the line and any
particular (x,y) pair --just about a kind of "average"
goodness of the explanatory power of the line for the data.
New:
Facts 2 &3 give line formula! (Moore pp.
123-125)
2. A change of one standard deviation in x corresponds
to a change of r standard deviations in y, along the regression
line.
The slope b expresses change in y-units
per x-unit. (Suppose x is inches, y is
pounds. Then b is in pounds per inch.) You can find b by
multiplying r by the standard deviation of the y's (that's in
pounds) and dividing by the standard deviation of the x's (that's
in inches)
In "algebra", b = r times (s.d. of
y)/(s.d. of x) (Equation p. 120)
If we standardize both the
x-values and the y-values, the slope will just = r !
govsalstd.sav, Govsalstd2.doc RegressionSlope.xls or RegressionSlope07.xls(for Excel07)
3. The regression line goes through the point given
by the two means, (xbar, ybar). http://www.whfreeman.com/bps4e
--If you know this, you know ybar = a
+ b (xbar). You can solve this for a, a = ybar - b (xbar). (OtherEquation p. 120)
--So knowing 2 and 3 give you the equation of the line from the means,
s.d.'s, and r.
--And if you draw the two lines, y on x and x on y, they will intersect
at (xbar, ybar)
The line formula yhat
= a + bx
from xbar, ybar, sx , sy , r:
Find b:
b = r sy / sx
(Fact 2: r is slope if x and y are standardized. Equation p. 120)
Find a:
Solve ybar = a + b xbar for a: a = ybar - b xbar
(Fact 3: (xbar, ybar) lies on the
regression line(s). Equation p. 109)
Example. x is measured in
Rangs, y in Zobs
xbar = 5 Rangs, ybar = 8 Zobs, sx = 10
Rangs, sy = 6 Zobs , r = -.3:
b
= -.3×6/10 (Zobs/Rang) = - 0.18
Zobs/Rang.
8 = a + (-0.18)×5
8 Zobs = aZobs
+ (-0.18)(Zobs/Rang) ×5
Rangs
8 = a - .95 a = 8.95 Zobs
yhat = 8.95 -0.18x Zobs
"A." Try it at
home: xbar
= 7 cm, ybar = 8 oz.
sx = 4 cm, sy
= 10 oz , r = .6 (highlite
space just below here for solution.)
b
= .6×10/4 (oz/cm) = 1.5
oz/cm.
8 = a + (1.5)×7cm
8 oz = a oz + (1.5)(oz/cm)
×7cm
8 = a + 10.5
a = 8-10.5 = -2.5 oz
yhat = -2.5 +1.5x oz
Start here Monday
Least Squares Property, and Residuals
"Residual at
x" = (y - yhat)
= distance between observed y and predicted y (= what's left over after
predicting) Also called 'deviation')
( Positive if observed is bigger than predicted,
negative if observed is smaller than predicted)
Residual: Look at an individual observed (x,y)
data pair. The residual is the "leftover" amount of y after
predicting a y using the line. Visually, length of vertical line
drawn from y to regression line (+ if point is above line, - if
point is below line)
Residual = observed
y - predicted y
= "prediction error" p. 119
Calculating: Montana
(17895, 55502)
Govsal = 28,569.69 + 2.709*avgpay
Predicted Govsal = 28,569.69 + 2.709*17895
= 28,569.69 + 48,477.56 = 77,047.25
Residual
= 55,502.00 - 77,047.25
= -21545.25, $21,545 below expected
value.
Least squares principle: Find the line that
minimizes the sums of the squared residuals.(RegressionLeastSqs.xls ,
or in Mac 101, ClassMaterials\Math151 BPS4e\ RegressionDemosOlderExcel
\RegressionLeastSqs.xls, Squares tab for older Excel, or RegressionLeastSqs07.xls
inRegressionDemosExcel07 for Excel07)
This method
of finding a "best fit" straight line for predicting y's from x's was
derived mathematically to work well with "joint normal"
data--elliptical clouds. (Same idea as mean& st.dev.) For
data of this sort, the line does give the mean of the y's
for each given x (at least in the abstract.)
Residuals drawn to line Govsal-Deviations.doc, SPSS (handout, p. 3, bottom: In Edit mode,
Insert>Spikes: Spike to: Regression) <>Drawback
if the data is not the "elliptical cloud" type:
Outliers get their residual distance squared:
May be very influential in
determining slope of line =
especially if at lowest or highest x-values, may change slope
of line a lot.
Applet ,http://bcs.whfreeman.com/BPS4e, ...Correlation®ression. Play with
an outlier.
(Outliers
toward the middle x's may not change the slope, but may affect r, and r2.)
~ ~ ~ ~ ~ ~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Plotting residuals: If you
graph residual values against x (or against predicted y's), you
eliminate visually the linear portion of the association. (The
regression line "becomes" the new x-axis; a "shear"
transformation.) Curving or other structure may stand out more
visibly. No structure in residuals = Straight line is a "Good"
fit.) (Here or ClassMaterials\Math151
BPS4e\ RegressionDemosExcel BPS4e\Residuals.xls
SPSS can make a new variable of residuals,
which you then can use to make a scatterplot. (Handout p. 3) govsal vs pay
Do Analyze>Regression>Linear
Click your variables into Independent (X) and Dependent(Y).
Hit the Button "Save...": Checkbox Residuals: Unstandardized. Continue,
Ok out of the menus. You'll get output; ignore it.
You'll get a new variable, the residuals. You can now
use this on the vertical axis of a scatterplot: "Residual plot."
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Cautions
pp. 132-136 ..
Plot the data:
Summary formulas and numbers don't tell the whole story.
In particular, correlation and regression line only describe a linear
relationship properly.
Correlation and regression are not resistant to
outliers, influential points.("Anscombe's quartet", Moore p.142, 5.34) (Overhead slide. You can reconstruct
these pictures using SPSS and Moore's problem, if you like.)
Extrapolation-- extra (outside) polation (putting a point): Using the
line to predict outside the range of x's you have data for. Dangerous!
Linear relationships don't go on forever; straight
line is often a first approximation to a more complicated
relationship.
"Lurking" variable: has an important effect, but not one of the variables
studied.
Meatloaf shrinkage vs.
placement in oven? (cooking thermometer/not had greatest
influence)
Time sequence of
observations a common one. (Learning, tiring, aging)
The trouble with lurking
variables is that by definition you don't know they're there.
Look behind every tree.
This page belongs to Sally Sievers who is solely
responsible
for its content. Please see our statement
of responsibility.