MATH 251, P&S I, Fall 2007, Monday Sept. 17, Day
11.hw comments added. hit
reload
Reading: finish 2.3, read 2.4, Cautions/residuals/influentials (I'll
demonstrate graphing residuals in class. Focus on uses
tonight.)
Hand in:
Problems A and B below
finish 2.42 a, b, c next time basketball NO SPSS
finish 2.47a, b, c next time social distress (SPSS)
2.54 (SPSS) Better predictor of GPA?
2.53 (SPSS) metabolic rate Also, Make 2
graphs, each with one of the two regression lines
2.55 h&w heights to formula
2.57 icicles in inches. Look in the back of the book for the
answers to part a, use them to do parts b and c.
2.58 Julie's exam (formula and R2)
2.59 attendance and grades
p. 169, 2.79 (This is a continuous-data version
of "Simpson's Paradox", p. 590) |
Read, be able to
discuss
2.81 heart attacks
Make a rule of thumb for choosing a hospital for your heart attack (As
if one had a choice--closer is better, and most people don't get to
decide)
p. 181, 2.97 habitat diversity
p. 183 2.103 heating deg. days, solar |
Optional |
A. (Not hard) If you know the means, standard deviations, and r for a
pair
of variables, you can calculate the equation of the regression
line
yhat = a + bx. Memorizing 2 facts is enough: " b = r (sY/
sX)" (= the correlation coefficient readjusted into "raw"
units), and "the pair of means (xbar , ybar) lie on the
line".
Show that these are enough; that is, show how to get the formula
for
a , if you know these facts (#2.56 is the same problem
"backwards".)
B. The least-squares best fit line is the
line yhat = a + bx that minimizes the squared residuals (vertical
distances
from each yi point to the line). Two
things can vary--the
slope b, and how high the line sits on the page (given by a, the
intercept.) (The calculus to get the formula requires partial
derivatives--Calculus III(?) ) Here's a simpler case:
You might ask (I know, you wouldn't--but you should...) what is the
best single point w to describe all the y-values,
using
this criterion: "The sum of the squared distances of the yi
values from w is the smallest possible"? (Another way of thinking of
this,
in the scatterplot setting: what horizontal line best
summarizes
all the y's, if we can't use the x-information?.).
Find w: That is, find the w that makes f(w) = Sum (yi
- w)2 the minimum (I can't make sigmas here: "Sum" = Big
sigma,
sum from i = 1 to n). (How? find the derivative f'(w), set
it = to 0. )
If you aren't comfortable with big sigma sums, let n = 3, f(w)
= (y1 - w)2 + (y2 - w)2
+ (y3 - w)2
(You should get a small "aha" experience, especially if you haven't
read p. 51 really carefully.)
Quizzes
returned. Mostly good. But most people
reversed direction on "at least"! Think about it!
Missed quiz? I expect advance notice if you need to miss, and
an arrangement to make it up promptly. If you can't know ahead of
time (sudden illness or emergency), I expect to hear as soon as
possible after the fact. Makeup may be possible, though points
may be "docked". It is your responsibility to initiate
this process.
HW questions?
Comments, Day 7: People still forget to check Measure: (F)

C: Almost J shaped, "0" is most frequent value(?)
What's with the weird gaps? Artifact of the choice of histogram
bin widths. Bins are a little narrower than 1 wide; the numbers
are in whole numbers, so about every 5, there's a gap.
Note that all the digits actually have data; looking at bar graph.
.
Linear regression, cont.
--Vertical Distance from point to regression line: "Error" = "Residual"
= "Deviation" = (yi -
yhati)
The regression line minimizes the "Sum of Squared
Errors",
the "Sum of squared deviations", "Sum of squared residuals."
See Residuals, RegressionLeastSquares (or in
Math251-IPS5e\RegressionDemosExcel) Govsal-deviations.doc (inWord) Govsal-deviations.spo
(Math251-IPS5e\SPSSforClass, output file)
--"Regressing weight ON height":
Height on the x axis, predicting weight from height.
--Unless the data lies perfectly on a straight line, the
line
for predicting weight from height -- "regressing weight on height"
--(for
example) will NOT be the same line
as that for predicting height from weight--"regressing height on
weight". Because you are measuring those deviations
from the line in different directions! (In-class demonstration)(The
picture on p.140 is about this. )
Formulas for computing regression line: IPS
137-8
(from data, no computer? Find an old textbook...)
- A change of one standard deviation in x
corresponds to a change of r
standard deviations in y, along the regression line. RegressionSlope
The slope b expresses
change
in y-units per x-unit. (Suppose x is
inches,
y is pounds. Then b is in pounds per inch.) You can find b
by
multiplying r by the standard deviation of the y's (that's in
pounds)
and dividing by the standard deviation of the x's (that's in inches)
In "algebra", b = r times (s.d. of
y)/(s.d. of x) (Equation p. 137)
If we standardize both the
x-values and the y-values, the slope will just = r !
Govsalstd2.doc
govsalstd.sav,
govsalstd.spo .
(In Math251-IPS5e\SPSS for Class)
- The regression line goes through the point given by
the
two means, (xbar, ybar). Applet:
http://www.whfreeman.com/ips5e
If you know this, you know ybar = a
+ b (xbar). Solve for a,
problem A)
--a
= ybar - b (xbar).(OtherEquation
p. 137)
--So knowing 1 and 2 give you the equation of the line from the means,
s.d.'s, and r.
--And if you draw the two lines, y on x and x on y, they will intersect
at (xbar, ybar)
The line formula yhat = a +
bx
from xbar, ybar, sx , sy , r:
Find b: b = r sy
/ sx
Find
a: Solve ybar
= a
+ b xbar for a: a = ybar - b xbar
Example. xbar = 5, ybar = 8,
sx = 10, sy = 6, r = -.3:
b =
-.3×6/10 = - 0.18, 8 = a + (-0.18)×5
= a - .95, a = 8.95,
yhat = 8.95 - 0.18x
r2
("Coefficient of Determination")
= Proportion of variability in
y-values explained/predicted by
knowing x and using the least squares regression line. IPS
pp. 141-3 Written R-Square
in SPSS graphs
R-Squared
Math251-IPS5e\RegressionDemosExcel) ( Further explanation
of r2)
r2 is the square of the correlation
coefficient r! (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability
in the y's is explained by using the regression line relationship to
predict
y from x. (If weight and height have a correlation of .7, then half of
the variability in weight can be explained by knowing height.)
The formula Moore gives, p. 142
is the "same" as the
formula often used (divide top&bottom by n-1)
variance of predicted values yhat =
Sum of explained squared variation/(n-1)____
variance of observed values y
=
Sum of observed(total)squared variation/(n-1)
(Un-accounted-for-variability =(1-r2) =
variance-of-residuals
/ total-variance-of-observed-y's
)
NOTE: The standard deviation doesn't say
anything about
the distance of any individual point from the mean; it's only
about
a kind of "average" variability. R2
doesn't say anything about the line and any particular (x,y)
pair
--just about a kind of "average" goodness of the explanatory power of
the
line for the data.
This page belongs to Sally Sievers who is solely
responsible
for its content. Please see our statement
of responsibility.