MATH 251, Probability and Statistics I, Fall 2005, Friday Sept. 16, Day
10hit reload
Reading: finish 2.3, read 2.4, Cautions/residuals/influentials (I'll
demonstrate graphing residuals in class. Focus on uses
tonight.)
Hand in:
Problems A and B below
finish 2.42 a, b, c next time basketball NO SPSS
finish 2.47a, b, c next time social distress (SPSS)
2.54 (SPSS) Better predictor of GPA?
2.53 (SPSS) metabolic rate Also, Make 2
graphs, each with one of the two regression lines
2.55 h&w heights to formula
2.57 icicles in inches. Look in the back of the book for the
answers to part a, use them to do parts b and c.
2.58 Julie's exam (formula and R2)
2.59 attendance and grades
p. 169, 2.79 (This is a continuous-data version
of "Simpson's Paradox", p. 590) |
Read, be able to discuss
2.81 heart attacks
Make a rule of
thumb for choosing a hospital for
your heart attack (As
if one had a
choice--closer is better, and most
people don't get to decide)
p. 181, 2.97 habitat diversity
p. 183 2.103 heating deg. days, solar |
Optional |
A. (Not hard) If you know the means, standard deviations, and r for a
pair
of variables, you can calculate the equation of the regression
line
yhat = a + bx. Memorizing 2 facts is enough: " b = r (sY/
sX)" (= the correlation coefficient readjusted into "raw"
units), and "the pair of means (xbar , ybar) lie on the
line".
Show that these are enough; that is, show how to get the formula
for
a , if you know these facts (#2.56 is the same problem
"backwards".)
B. The least-squares best fit line is the
line yhat = a + bx that minimizes the squared residuals (vertical
distances
from each yi point to the line). Two
things can vary--the
slope b, and how high the line sits on the page (given by a, the
intercept.) (The calculus to get the formula requires partial
derivatives--Calculus III(?) ) Here's a simpler case:
You might ask (I know, you wouldn't--but you should...) what is the
best single point w to describe all the y-values,
using
the criterion that the sum of the squared distances of the yi
values from w is the smallest possible? (Another way of thinking of
this,
in the scatterplot setting: what horizontal line best
summarizes
all the y's, if we can't use the x-information?.).
Find w: That is, find the w that makes f(w) = Sum (yi
- w)2 the minimum (I can't make sigmas here: "Sum" = Big
sigma,
sum from i = 1 to n). (How? find the derivative f'(w), set
it = to 0. )
If you aren't comfortable with big sigma sums, let n = 3, f(w)
= (y1 - w)2 + (y2 - w)2
+ (y3 - w)2
(You should get a small "aha" experience, especially if you haven't
read p. 51 really carefully.)
HW questions?
Linear regression, cont.
--Vertical Distance from point to regression line: "Error" = "Residual"
= "Deviation" = (yi -
yhati)
The regression line minimizes the "Sum of Squared
Errors",
the "Sum of squared deviations", "Sum of squared residuals."
See Residuals, RegressionLeastSquares (or in
Math251\RegressionDemosExcel) Govsal-deviations.spo
(Math251\SPSSforClass, output file)
--"Regressing weight ON height":
Height on the x axis, predicting weight from height.
--Unless the data lies perfectly on a straight line, the
line
for predicting weight from height -- "regressing weight on height"
--(for
example) will NOT be the same line
as that for predicting height from weight--"regressing height on
weight". Because you are measuring those deviations
from the line in different directions! (In-class demonstration)(The
picture on p.140 is about this. )
Formulas for computing regression line: IPS 137-8
(from data, no computer? Find an old textbook...)
- A change of one standard deviation in x
corresponds to a change of r
standard deviations in y, along the regression line. RegressionSlope
The slope b expresses
change
in y-units per x-unit. (Suppose x is
inches,
y is pounds. Then b is in pounds per inch.) You can find b
by
multiplying r by the standard deviation of the y's (that's in
pounds)
and dividing by the standard deviation of the x's (that's in inches)
In "algebra", b = r times (s.d. of
y)/(s.d. of x) (Equation p. 137)
If we standardize both the
x-values and the y-values, the slope will just = r !
govsalstd.sav,
govsalstd.spo .
(In Math251\SPSS for Class)
- The regression line goes through the point given by
the
two means, (xbar, ybar). http://www.whfreeman.com/ips
If you know this, you know ybar = a
+ b (xbar). Solve for a,
problem A)
--a
= ybar - b (xbar).(OtherEquation
p. 137)
--So knowing 1 and 2 give you the equation of the line from the means,
s.d.'s, and r.
--And if you draw the two lines, y on x and x on y, they will intersect
at (xbar, ybar)
r2 ("Coefficient of Determination")
= Proportion of variability in
y-values explained/predicted by
knowing x and using the least squares regression line. IPS
pp. 141-3 Written R-Square
in SPSS graphs
R-Squared
Math251\RegressionDemosExcel) ( Further explanation
of r2)
r2 is the square of the correlation
coefficient r! (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability
in the y's is explained by using the regression line relationship to
predict
y from x. (If weight and height have a correlation of .7, then half of
the variability in weight can be explained by knowing height.)
The formula Moore gives, p. 142
is the "same" as the
formula often used (divide top&bottom by n-1)
variance of predicted values yhat =
Sum of explained squared variation/(n-1)____
variance of observed values y
=
Sum of observed(total)squared variation/(n-1)
(Un-accounted-for-variability =(1-r2) =
variance-of-residuals
/ total-variance-of-observed-y's
)
This page belongs to Sally Sievers who is solely
responsible
for its content. Please see our statement
of responsibility.