Math 151 , Spring '09, Mon. Day 16, Mar. 2,Hit reload.. .After class, corrected.

Reading:  Ch. 5, Regression, reread thru p. 125   (check p. 137:  5.14 through 20, basic line and regression line facts and tools.  21 r and slope signs, 22 is harder--changing units--don't worry about it. 23 If you sketch the graph and draw a line thru the points, you should be able to guesstimate the slope well enough to choose among the 3 answers.)   Now, the equation of the least-squares line (p. 120) & Fact 2&3, Read p. 132 for Extrapolation.  NextFact 1, p.123. And, Continuing regression, p. 126-137. Ch. 7, review: for this test, E, F, G(except for #8--calculate 1 residual, yes.  Residual plot, no) H (#4 we'll focus on with ch. 8&9)

Bring questions for exam.
Hand in: Note, some problems have been rearranged, near the "Postpone" line.
More Regression 
Line formula & Facts:
p. 122, 5.3b only. verify formula Find the means, s.d.'s and r in the answers in the back of the book, and use them to calculate a and b and write the formula for the regression line..
p. 141, 5.30 husbands and wives  (Note, you have to find the equation of the line to draw the graph, tho it doesn't explicitly tell you to...)

p. 125, 5.5 (SPSS. Let SPSS find the regression line. Get the mean yield and mean planting rate too--you need it for part c) Corn again, straight line is a "bad fit." My book has a misprint in (c).  Should be "when xbar is the mean planting rate".

. . . . . . . . . . .
pp. 143-4, 5.35, (SPSS) Drilling into the past, silicon (one clear outlier) To get the r with and without the outlier, you can just find r, then delete the outlier, then find r again.  Or you can do this (useful also for 5.37, see below): Make a new variable and put 1's in every case but the outlier--give the outlier 0.  Then, in the Data Editor, (First SPSS Handout, p. 5 bottom) Do Data>Select Cases: choose your new variable as the Filter variable.The outlier will be excluded till you return to Select Cases and select All cases.  5.35 covers material on Exam 2.  5.37 (effect of outlier on regression line) won't be on the exam, but you may want to get the graph  for it while you're doing 5.35. See details just below"Postpone.".

p 179, 7.28, 29, 30 (SPSS) Soap in the shower.  Also, look carefully at the graph and guess why there is no data after day 21.  (Read p. 132 for the word to describe using the line for day 30, and a discussion of the issue)

.Postpone the rest..
pp. 143-4, 5.37 (SPSS) Drilling into the past, silicon (one clear outlier) To graph the lines with and without the outlier on the same graph, make a new variable and put 1's in every case but the outlier--give the outlier 0.  Then use this variable as your Set Markers By  or your  Panel By variable. Use Fit Line at Subgroups. You'll also get a "nuisance" horizontal line at the outlier; ignore it.  To get the formula for the line without the outlier, use your new 0-1 variable as the Selection Variable (see p. 4 of Scatterplot Handout for details.)

Cautions
p. 136 5.13 hospitals: big = bad?

Residuals
SPSS Handout p. 3 (Governors' salaries):  You can now finish #12, the last question.  Hand it all  in Next time.
A.  Use Residuals07.xls or Residuals.xls from the website or the lab to graph these data sets, along with a graph of the residuals.  Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.) 
a)  x 1 2 8 4 6 9 
    y 1 3 6 6 7 5 
b) x 1 2 7 4 6 9
   y 7 6 2 4 2 1

p. 129, 5.7 (SPSS) does fast driving waste fuel? residuals  There is a data file for problem 5.7, and its third column is the residuals.  Do all the parts, and
Also with 5.7, In SPSS, Make a variable containing the residuals (Handout, top p.3).  Also middle-bottom of this page.)  The values should match the ones in the book/SPSS file.

p.133, 5.9 Farm population Do a, b, c (read p. 132 for a good word to use in part c).  Also, make a variable containing the residuals, and plot it against the x (year) values.  Draw (in pencil) a horizontal line at height 0.  What pattern do you see in the residuals?

Read,
to discuss

A. Practice Calculating line formula, following up class work, notes below. Highlight space after question to see worked solution.

B. Look at Excel spreadsheet RegressionSlope07,
 especially with reference to the r standard deviations in y for every 1 standard deviation in x:
Change x-y values in the yellow boxes and watch the line change.  Change x-values in col. F and watch the "run" (red line) change, in the rightmost 2 graphs. Notice the slope = the coefficient of x = the rise/run = increase in y per unit increase in x.  Fix it so the increase in x (the "run") is exactly 1.   Also, look at the leftmost graph, where the length of the standard deviations are shown, and note that in standard-deviation units, the rise is r s.d.'s in y for each s.d. run in x. 

Op 
tion 
al 

Added:
p. 179, 7.27 (review Normal)

 

= = = = = = = = = = = = = = = = = = = = = =
Exam 2 ThisFriday: Day 18 (March 6).  .  Let me know Right Away if you can't take the exam Friday.  Starts with Ch. 3,  Normal distribution, tables.  Thru Ch. 4, and what we cover of Ch.5 (&7)  through  Today.
Sample exam handed out todaySolutions.   Normal probability practice.  Beginning of class, you can do everything except #7a; 6--second d and e. ("Word" for second f, mentioned in class last time. Read p. 132.)
(All questions on the sample exam will be covered?) As of end of class, you can now do #7a also. 6--second d and e will NOT be on exam 2.  .
One sheet of notes: I will give you
paper copies of the Normal table.

Are you having trouble seeing which variable goes on the x axis?  If there is any sense that one is the cause of the other, or can/will be used to predict or estimate the other,   that's the explanatory (x) variable.  The other one is the response (y) variable.  (Sometimes you can choose the x-values and see the response for that x, in the corresponding y:  like the corn plant density problem (It's an experiment, Ch.9.)  Sometimes you can only observe.)  Language: Regress  heating oil used ON temperature:  Temperature = x = horizontal, Heating oil = y = vertical.

HW questions? Regression  
Day 15
   Leftover:  Timeplots:  are scatterplots, where the x axis shows time. (Time is often a lurking variable: plot data against order of taking observations) "Trend" in timeplot ="slope" in usual scatterplot.
- - - - - - - - - - -
Regression line: Ch. 6, "Regressing y ON x"  Predicts or estimates a y (vertical) value for a given x (horizontal) value:   Straight line!

Experimenting  http://www.whfreeman.com/bps4e,  Correlation and Regression Applet.
SPSS--back of handout.  Govsal on avgpay

Formula yhat = a + b x.    Govsal = a + b avgpay   Govsal = 28,569.69 + 2.709*avgpay 
         Calculating:  Montana (17,895, 55,502)   Govsal = 28,569.69 + 2.709*avgpay
           Predicted Govsal = 28,569.69 + 2.709*17,895 = 28,569.69 48,477.56 = 77,047.25 (higher than actual)

 a is y-intercept. is slope:  If x increases one unit, yhat increases b units.  
            Governor's salaries increase (on the average across the states)  $2.71 for every increase of  $1 of average pay.

 (In a straight-line relationship, the amount that y increases for one unit increase in x is the same no matter what value of x you start with)  RegressionSlope.xls or in ClassMaterial\Math151-BPS4e \RegressionDemos Excel BPS4e

r2 ("Coefficient of Determination") = fraction of the variation in y-values explained/predicted by knowing x and using the least squares regression line.   (Fact 4)

HW:  Income depends on height?!
    What is "$789", and what kind of analysis did they do?  Regression. $789 is the slope of the regression of Pay on Height.  Less than 15% of the variability in Pay is explainable by (regression on ) height.
5.42 p. 146, a computer game, revisited.
  Can it really be that only about 9% of the variability in speed of the right hand is accounted for by the distance?  The eye is fooled by the graph, with the right hand data squashed down at the bottom and looking really linear.  Here is the right hand by itself. 

We all get the same line from a batch of data because we use the "least-squares best fit" criterion (p. 119): we'll investigate this more closely later.

Facts:  1, 2 lite, 3 first.  Then 4.   Then 2 &Formulas p. 120, from 2&3.  

Facts again (Moore pp. 123-125)

  1. Which is explanatory, which is response, is crucial for regression!  The Regression line is trying to predict the "average y" for a given x (with the added requirement that it is a straight line).  See "residual"(deviation) lines for govsal on avgpay.
    Unless the data lies perfectly on a straight line, the line for predicting weight from height -- "regressing weight on height" --(for example) will NOT be the same line as that for predicting height from weight--"regressing height on weight".  (In-class demonstration, on overhead projector Soon.) (Example 5.3, Fig. 5.4 pp.123-4 is about this. )
     
  2. Lite:  The correlation coefficient r and the slope b of the regression line have the same sign!  + or - .
       Negative/positive:  trend=slope ~association~correlation
    Heavy: A change of one standard deviation in x corresponds to a change of r standard deviations in y, along the regression line.  We'll return to this today.

  3. The regression line goes through the point given by the two means, (xbar, ybar)
    Applet
    http://www.whfreeman.com/bps4e
    We'll return to this today.

  4. r2 ("Coefficient of Determination") = fraction of the variation in y-values explained/predicted by knowing x and using the least squares regression line.  SPSS writes "R Square", or "R Sq Linear".  (Exactly what that means mathematically is hard.  Just get used to it as a measurement.)
    Closer to 0, more scatter around the line. Closer to 1, tighter clustering around the line. R-Squared (or RSquared.xls: ClassMaterial\Math151-BPS4e\RegressionDemosOlderExcel) (Excel07? RSquared07.xls in RegressionDemosExcel07) (Optional:  Further explanation of r2)
  5.   (Demos yet to come.)
    r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.)  
    If r = .7, about half (.49) of the variation  in the y's is explained by using the regression line relationship to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be explained by knowing height. Or vice versa.)
    NOTE:  The standard deviation doesn't say anything about the distance of any individual point from the mean; it's only about a kind of "average" variability. 
    R2 doesn't say anything about the line and any particular (x,y) pair --just about a kind of "average" goodness of the explanatory power of the line for the data.

New:
Facts
2 &3  give line formula!
(Moore pp. 123-125) 

2.   A change of one standard deviation in x corresponds to a change of r standard deviations in y, along the regression line.
y = a + bx:  The slope b expresses change in y-units per x-unit. (Suppose x is inches, y is pounds. Then b is in pounds per inch.) You can find b by multiplying r by the standard deviation of the y's (that's in pounds)  and dividing by the standard deviation of the x's (that's in inches)
In "algebra", b = r times (s.d. of y)/(s.d. of x)  (Equation p. 120)
       If we standardize both the x-values and the y-values, the slope will just = r !
        
govsalstd.sav Govsalstd2.doc    RegressionSlope.xls or RegressionSlope07.xls(for Excel07)

3.   The regression line goes through the point given by the two means, (xbar, ybar). http://www.whfreeman.com/bps4e 
--If you know this, you know ybar = a + b (xbar).  You can solve this for a, a = ybar - b (xbar). (OtherEquation p. 120)
--So knowing 2 and 3 give you the equation of the line from the means, s.d.'s, and r.
--And if you draw the two lines, y on x and x on y, they will intersect at (xbar, ybar)
[Algebra lovers:  have point (xbar, ybar) and slope b of a line; can write equation]
The line formula   yhat = a + bx    from xbar, ybar, sx , sy , r:
     Find b:   b = r  sy / sx
                (uses Fact 2r is slope if x and y are standardized. Equation p. 120)
      Find a:  Solve  ybar = a + b xbar   for aa = ybar - b xbar
               (uses Fact 3:  (xbar, ybar) lies on the regression line(s).  Equation p. 109)
 Example.   x is measured in Rangs, y in Zobs
 xbar = 5 Rangs,   ybar = 8 Zobs,    sx = 10 Rangs,  sy = 6 Zobs ,   r = -.3:   
        b = -.3×6/10 (Zobs/Rang) = - 0.18  Zobs/Rang.  
         8 = a + (-0.18)×5              8
Zobs = aZobs + (-0.18)(Zobs/Rang) ×5  Rangs
         8 = a  - .90   a = 8.90 Zobs      yhat = 8.95 -0.18Zobs 

"A." Try it at desk/home:  xbar = 7 cm,   ybar = 8 oz.    sx = 4 cm,  sy = 10 oz ,   r = .6 (highlite space just below here for solution.)
         b = .6×10/4
(oz/cm) = 1.5  oz/cm.  
         8 = a + (1.5)×7
cm             8 oz = a oz + (1.5)(oz/cm) ×7cm
         8 = a + 10.5      a = 8-10.5 = -2.5 oz      yhat = -2.5 +1.5x  oz


..Exam  ends here.  Start here Wed.
Least Squares Property, and Residuals

"Residual at x" = (y - yhat)  = distance between observed y and  predicted y (= what's left over after predicting)  Also called 'deviation')
    ( Positive if observed is bigger than predicted, negative if observed is smaller than predicted)
Residual:  Look at an individual observed (x,y) data pair.  The residual is the "leftover" amount of y after predicting a y using the line.  Visually, length of vertical line drawn from y to regression line (+ if point is above line, -  if point is below line)
   Residual = observed y - predicted y    = "prediction error" p. 119
      Calculating:  Montana (17895, 55502)       Govsal = 28,569.69 + 2.709*avgpay
           Predicted Govsal = 28,569.69 + 2.709*17895 = 28,569.69 + 48,477.56 = 77,047.25
           Residual = 55,502.00 - 77,047.25 =  -21545.25,  $21,545 below expected value.
Least squares principle:  Find the line that minimizes the sums of the squared residuals.(RegressionLeastSqs.xls , or in Mac 101, ClassMaterials\Math151 BPS4e\ RegressionDemosOlderExcel \RegressionLeastSqs.xls, Squares tab for older Excel, or RegressionLeastSqs07.xls inRegressionDemosExcel07 for Excel07) 
       This method of finding a "best fit" straight line for predicting y's from x's was derived mathematically to work well with "joint normal" data--elliptical clouds. (Same idea as mean& st.dev.)  For data of this sort, the line does  give the mean of the y's for each given x (at least in the abstract.)
Residuals drawn to line Govsal-Deviations.doc,   
<>Drawback if the data is not the "elliptical cloud" type:
     Outliers get their residual distance squared:  May be very influential  in determining slope of line =
             especially if at lowest or highest x-values, may change slope of line a lot.
            Applet ,http://bcs.whfreeman.com/BPS4e, ...Correlation&regression.   Play with an outlier.
 (Outliers toward the middle x's may not change the slope, but may affect r, and r2.)
~ ~ ~ ~ ~ ~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Plotting residuals:(postpone plotting?)  If you  graph residual values against x (or against predicted y's), you eliminate visually the linear portion of the association. (The regression line "becomes" the new x-axis; a "shear" transformation.) Curving or other structure may stand out more visibly.  No structure in residuals = Straight line is a "Good" fit.) (Here or ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\Residuals.xls

SPSS can make a new variable of residuals, which you then can use to make a scatterplot. (Handout p. 3) govsal vs pay
 Do Analyze>Regression>Linear
Click your variables into Independent (X) and Dependent(Y). 
Hit the Button "Save...": Checkbox Residuals: Unstandardized. Continue, Ok out of the menus.  You'll get output; ignore it. 
You'll get a new variable, the residuals.   You can now use this on the vertical axis of a scatterplot:  "Residual plot."
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Cautions  pp. 132-136 ..
Plot the data: Summary formulas and numbers don't tell the whole story.  In particular, correlation and regression line only describe a linear relationship properly.
Correlation and regression are not resistant to outliers, influential points.("Anscombe's quartet", Moore p.142, 5.34)
(Overhead slide.   You can reconstruct these pictures using SPSS and Moore's problem, if you like.)

Extrapolation-- extra (outside) polation (putting a point): Using the line to predict outside the range of x's you have data for.   Dangerous!   Linear relationships don't go on forever; straight line  is often a first approximation to a more complicated relationship.


"Lurking" variable has an important effect, but not one of the variables studied.
    Govsal vs. pay:  Size of state (population and/or area) should affect salary.

    Meatloaf shrinkage vs. placement in oven?  (cooking thermometer/not had greatest influence)
    Time sequence of observations a common lurker.  (Learning, tiring, aging)
    The trouble with lurking variables is that by definition you don't know they're there.  Look behind every tree.


Sievers home  Math151-Sp09/Daysp16.htm  3:40pm 3/2/09
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.