MATH 251, P&S I, Fall 2007, Monday Sept. 17, Day 11.hw comments added. hit reload

Reading: finish 2.3, read 2.4, Cautions/residuals/influentials (I'll demonstrate graphing residuals in class.  Focus on uses tonight.)
Hand in: 
Problems A and B  below
finish 2.42 a, b, c next time basketball NO SPSS
finish 2.47a, b, c next time social distress (SPSS)
2.54 (SPSS) Better predictor of GPA?
2.53 (SPSS) metabolic rate  Also, Make 2 graphs, each with one of the two regression lines
2.55 h&w heights to formula
2.57 icicles in inches. 
Look in the back of the book for the answers to part a, use them to do parts b and c.
2.58 Julie's exam (formula and R2)
2.59 attendance and grades
p. 169, 2.79 (This is a continuous-data version 
of "Simpson's Paradox", p. 590) 
Read, be able to discuss 
2.81 heart attacks  Make a rule of thumb for choosing a hospital for your heart attack (As if one had a choice--closer is better, and most people don't get to decide) 
p. 181, 2.97 habitat diversity
p. 183 2.103 heating deg. days, solar
Optional
A. (Not hard) If you know the means, standard deviations, and r for a pair of variables, you can calculate the equation of the regression line  yhat = a + bx.  Memorizing 2 facts is enough: " b = r (sY/ sX)" (= the correlation coefficient readjusted into "raw" units),  and "the pair of means (xbar , ybar) lie on the line".  Show that these are enough; that is, show how to get the formula for  a  , if you know these facts  (#2.56 is the same problem "backwards".)


B.  The least-squares best fit  line is the line yhat = a + bx that minimizes the squared residuals (vertical distances from each yi point to the line).  Two things can vary--the slope b, and how high the line sits on the page (given by a, the intercept.) (The calculus to get the formula requires partial derivatives--Calculus III(?) ) Here's a simpler case:

You might ask (I know, you wouldn't--but you should...) what is the best single  point  w to describe all the y-values, using this criterion: "The sum of the squared distances of the yi values from w is the smallest possible"? (Another way of thinking of this, in the scatterplot setting: what horizontal line best summarizes all the y's, if we can't use the x-information?.).
Find w:  That is, find the w that makes f(w) = Sum (yi - w)2 the minimum (I can't make sigmas here: "Sum" = Big sigma, sum from i = 1 to n).  (How?  find the derivative f'(w), set it = to 0. )
If you aren't comfortable with big sigma sums, let n = 3,  f(w) = (y1 - w)2  + (y2 - w)2  +  (y3 - w)2
(You should get a small "aha" experience, especially if you haven't read p. 51 really carefully.)


Quizzes returned.  Mostly good.  But most people reversed direction on "at least"!  Think about it!
Missed quiz?
  I expect advance notice if you need to miss, and an arrangement to make it up promptly.  If you can't know ahead of time (sudden illness or emergency), I expect to hear as soon as possible after the fact.  Makeup may be possible, though points may be "docked".  It is your responsibility to initiate this process.
HW questions?
Comments, Day 7:  People still forget to check Measure: (F)

C:  Almost J shaped, "0" is most frequent value(?)
What's with the weird gaps?  Artifact of the choice of histogram bin widths.  Bins are a little narrower than 1 wide; the numbers are in whole numbers, so about every 5, there's a gap.

Note that all the digits actually  have data; looking at bar graph.

.

Linear regression, cont.
--Vertical Distance from point to regression line: "Error" = "Residual" = "Deviation" = (yi - yhati)
   The regression line minimizes the "Sum of Squared Errors",  the "Sum of squared deviations", "Sum of squared residuals."
See ResidualsRegressionLeastSquares  (or in Math251-IPS5e\RegressionDemosExcel)  Govsal-deviations.doc (inWord) Govsal-deviations.spo (Math251-IPS5e\SPSSforClass, output file)

--"Regressing weight ON height":  Height on the x axis, predicting weight from height.

--Unless the data lies perfectly on a straight line, the line for predicting weight from height -- "regressing weight on height" --(for example) will NOT be the same line as that for predicting height from weight--"regressing height on weight".  Because you are measuring those deviations from the line in different directions! (In-class demonstration)(The picture on p.140 is about this. )

Formulas for computing regression line: IPS 137-8 (from data, no computer?  Find an old textbook...)

  1. A change of one standard deviation in x corresponds to a change of r standard deviations in y, along the regression line.  RegressionSlope

  2.  The slope b expresses change in y-units per x-unit. (Suppose x is inches, y is pounds. Then b is in pounds per inch.) You can find b by multiplying r by the standard deviation of the y's (that's in pounds)  and dividing by the standard deviation of the x's (that's in inches)
    In "algebra", b = r times (s.d. of y)/(s.d. of x)  (Equation p. 137)
        If we standardize both the x-values and the y-values, the slope will just = r !       
             Govsalstd2.doc  govsalstd.sav govsalstd.spo . (In Math251-IPS5e\SPSS for Class)
     
  3. The regression line goes through the point given by the two means, (xbar, ybar). Applet: http://www.whfreeman.com/ips5e
  4. If you know this, you know ybar = a + b (xbar).  Solve  for a,  problem A)
    --a = ybar - b (xbar).(OtherEquation p. 137)
    --So knowing 1 and 2 give you the equation of the line from the means, s.d.'s, and r.
    --And if you draw the two lines, y on x and x on y, they will intersect at (xbar, ybar)
The line formula yhat = a + bx  from xbar, ybar, sx , sy , r:
     Find b:   b = r  sy / sx
       Find a:  Solve  ybar = a + b xbar for a:  a = ybar - b xbar
           Example.  xbar = 5,   ybar = 8,  sx = 10, sy = 6,   r = -.3: 
         b = -.3×6/10 = - 0.18,   8 = a + (-0.18)×5 = a  - .95,     a = 8.95,       yhat = 8.95 - 0.18x
r2 ("Coefficient of Determination") = Proportion of variability in y-values explained/predicted by knowing x and using the least squares regression line. IPS pp. 141-3   Written R-Square in SPSS graphs
       R-Squared  Math251-IPS5e\RegressionDemosExcel)    ( Further explanation of r2)
r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.) If r = .7, about half (.49) of the variability  in the y's is explained by using the regression line relationship to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be explained by knowing height.)

The formula Moore gives, p. 142           is the "same" as  the        formula often used (divide top&bottom by n-1)
  variance of predicted values yhat  =  Sum of explained squared variation/(n-1)____
  variance of observed values y      =  Sum of observed(total)squared variation/(n-1)
(Un-accounted-for-variability =(1-r2) = variance-of-residuals / total-variance-of-observed-y's )

NOTE:  The standard deviation doesn't say anything about the distance of any individual point from the mean; it's only about a kind of "average" variability.  R2 doesn't say anything about the line and any particular (x,y) pair --just about a kind of "average" goodness of the explanatory power of the line for the data.


Sievers home  Math251-Fall07/Day2s11.htm    9pm    9/16/07
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.