MATH 251, Probability and Statistics I, Fall 2005, Friday Sept. 16, Day 10hit reload

Reading: finish 2.3, read 2.4, Cautions/residuals/influentials (I'll demonstrate graphing residuals in class.  Focus on uses tonight.)
Hand in: 
Problems A and B  below
finish 2.42 a, b, c next time basketball NO SPSS
finish 2.47a, b, c next time social distress (SPSS)
2.54 (SPSS) Better predictor of GPA?
2.53 (SPSS) metabolic rate  Also, Make 2 graphs, each with one of the two regression lines
2.55 h&w heights to formula
2.57 icicles in inches. 
Look in the back of the book for the answers to part a, use them to do parts b and c.
2.58 Julie's exam (formula and R2)
2.59 attendance and grades
p. 169, 2.79 (This is a continuous-data version 
of "Simpson's Paradox", p. 590) 
Read, be able to discuss 
2.81 heart attacks  Make a rule of
 thumb for choosing a hospital for
your heart attack (As if one had a
choice--closer is better, and most
 people don't get to decide) 
p. 181, 2.97 habitat diversity
p. 183 2.103 heating deg. days, solar
Optional
A. (Not hard) If you know the means, standard deviations, and r for a pair of variables, you can calculate the equation of the regression line  yhat = a + bx.  Memorizing 2 facts is enough: " b = r (sY/ sX)" (= the correlation coefficient readjusted into "raw" units),  and "the pair of means (xbar , ybar) lie on the line".  Show that these are enough; that is, show how to get the formula for  a  , if you know these facts  (#2.56 is the same problem "backwards".)


B.  The least-squares best fit  line is the line yhat = a + bx that minimizes the squared residuals (vertical distances from each yi point to the line).  Two things can vary--the slope b, and how high the line sits on the page (given by a, the intercept.) (The calculus to get the formula requires partial derivatives--Calculus III(?) ) Here's a simpler case:

You might ask (I know, you wouldn't--but you should...) what is the best single  point  w to describe all the y-values, using the criterion that the sum of the squared distances of the yi values from w is the smallest possible? (Another way of thinking of this, in the scatterplot setting: what horizontal line best summarizes all the y's, if we can't use the x-information?.).
Find w:  That is, find the w that makes f(w) = Sum (yi - w)2 the minimum (I can't make sigmas here: "Sum" = Big sigma, sum from i = 1 to n).  (How?  find the derivative f'(w), set it = to 0. )
If you aren't comfortable with big sigma sums, let n = 3,  f(w) = (y1 - w)2  + (y2 - w)2  +  (y3 - w)2
(You should get a small "aha" experience, especially if you haven't read p. 51 really carefully.)


HW questions?
Linear regression, cont.

--Vertical Distance from point to regression line: "Error" = "Residual" = "Deviation" = (yi - yhati)
   The regression line minimizes the "Sum of Squared Errors",  the "Sum of squared deviations", "Sum of squared residuals."
See ResidualsRegressionLeastSquares  (or in Math251\RegressionDemosExcel)  Govsal-deviations.spo (Math251\SPSSforClass, output file)

--"Regressing weight ON height":  Height on the x axis, predicting weight from height.

--Unless the data lies perfectly on a straight line, the line for predicting weight from height -- "regressing weight on height" --(for example) will NOT be the same line as that for predicting height from weight--"regressing height on weight".  Because you are measuring those deviations from the line in different directions! (In-class demonstration)(The picture on p.140 is about this. )

Formulas for computing regression line: IPS 137-8 (from data, no computer?  Find an old textbook...)

  1. A change of one standard deviation in x corresponds to a change of r standard deviations in y, along the regression line.  RegressionSlope

  2.  The slope b expresses change in y-units per x-unit. (Suppose x is inches, y is pounds. Then b is in pounds per inch.) You can find b by multiplying r by the standard deviation of the y's (that's in pounds)  and dividing by the standard deviation of the x's (that's in inches)
    In "algebra", b = r times (s.d. of y)/(s.d. of x)  (Equation p. 137)
           If we standardize both the x-values and the y-values, the slope will just = r !
                         govsalstd.sav govsalstd.spo . (In Math251\SPSS for Class)
     
  3. The regression line goes through the point given by the two means, (xbar, ybar). http://www.whfreeman.com/ips
  4. If you know this, you know ybar = a + b (xbar).  Solve  for a,  problem A)
    --a = ybar - b (xbar).(OtherEquation p. 137)
    --So knowing 1 and 2 give you the equation of the line from the means, s.d.'s, and r.
    --And if you draw the two lines, y on x and x on y, they will intersect at (xbar, ybar)
r2 ("Coefficient of Determination") = Proportion of variability in y-values explained/predicted by knowing x and using the least squares regression line. IPS pp. 141-3   Written R-Square in SPSS graphs
       R-Squared  Math251\RegressionDemosExcel)    ( Further explanation of r2)
r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.) If r = .7, about half (.49) of the variability  in the y's is explained by using the regression line relationship to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be explained by knowing height.)

The formula Moore gives, p. 142           is the "same" as  the        formula often used (divide top&bottom by n-1)
  variance of predicted values yhat  =  Sum of explained squared variation/(n-1)____
  variance of observed values y      =  Sum of observed(total)squared variation/(n-1)
(Un-accounted-for-variability =(1-r2) = variance-of-residuals / total-variance-of-observed-y's )


Sievers home  Math251-Fall05/Dayps10.htm    9pm    9/15/05
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.