Math 151 , Fall 2006, Friday Day 16, Sept. 29 After class. Hit reload...

HW assignment Day 16 Reading:  Ch. 5, Regression, Reread thru p. 125  (check p. 137:  5.14 through 20, basic line and regression line facts and tools, you should have done those.  21 r and slope with fact 2, 22 is harder--changing units--don't worry about it. 23 If you sketch the graph and draw a line thru the points, you should be able to guesstimate the slope well enough to choose among the 3 answers.)  Now, continue regression, p. 126-137.  (We'll skip Ch 6)

Hand in Monday:   (note, some problems have been re-ordered from before class)
p. 122, 5.3b only. verify formula Use the means, s.d.'s and r from the answers in the back of the book.
p. 141, 5.30 husbands and wives  (Note, you have to find the equation of the line to draw the graph, tho it doesn't explicitly tell you to...)
p. 125, 5.5 (SPSS)corn again, straight line is a "bad fit"

A. moved later.
p. 142, 5.32 going to class 
p. 140, 5.28 social rejection, reading other software  (Read the text for how to read software)

pp. 143-4, 5.35, 37 (SPSS) Drilling into the past, silicon (one clear outlier) To graph the lines with and without the outlier on the same graph, make a new variable and put 1's in every case but the outlier--give the outlier 0.  Then use this variable as your legend or panel variable.  You'll also get a "nuisance" horizontal line at the outlier; ignore it.

SPSS Handout p. 3 (Governors' salaries):  You can now finish all the questions but the last.  Hand it all  in Wednesday.

POSTPONE The rest:

p. 129, 5.7 (SPSS) does fast driving waste fuel? residuals  There is a data file for problem 5.7, and its third column is the residuals.  Do all the parts, and
Also with 5.7, In SPSS, Make a variable containing the residuals (Handout, bottom p. 4.  Also bottom of this page.)  The values should match the ones in the book/SPSS file.

A .  Use the Excel RSquared page. ( R-Squared (or RSquared.xls: ClassMaterial\Math151BPS4e\RegressionDemosExcel BPS4e)). Shift points around and get an r2 close to .8 (80%) (Between .75 and .85 is good enough.).  Note that if r = +.9, then  r2 = .81.   Now shift the points so that r is negative and r2 is close to .8.  Print the resulting page to hand in. (Data and graph)

p.133, 5.9 Farm population (SPSS)  Do a, b, c (read p. 132 for a good word to use in part c).  Also, make a variable containing the residuals, and plot it against the x (year) values.  Draw (in pencil) a horizontal line at height 0.  What pattern do you see in the residuals?

pp. 143-4, 5.35, 37 (SPSS) Drilling into the past, silicon (one clear outlier) To graph the lines with and without the outlier on the same graph, make a new variable and put 1's in every case but the outlier--give the outlier 0.  Then use this variable as your legend or panel variable.  You'll also get a "nuisance" horizontal line at the outlier; ignore it.

B.  Use Residuals.xls from the website or the lab to graph these data sets, along with a graph of the residuals.  Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.) 
a)  x 1 2 8 4 6 9 
    y 1 3 6 6 7 5 
b) x 1 2 7 4 6 9
   y 7 6 2 4 2 1

Read, to discuss
 Look at this especially, with reference to the r standard deviations in y for every 1 standard deviation in x: A. Open the Excel file RegressionSlope (or in the folder RegressionDemosExcel for D&V in ClassMaterial\Math151 D&V).  Change x-y values in the yellow boxes and watch the line change.  Change x-values in col. F and watch the "run" (red line) change, in the rightmost 2 graphs. Notice the slope = the coefficient of x = the rise/run = increase in y per unit increase in x.  Fix it so the increase in x (the "run") is exactly 1.   Also, look at the leftmost graph, where the length of the standard deviations are shown, and note that in standard-deviation units, the rise is r s.d.'s in y for each s.d. run in x. 

C. Use Applet http://www.whfreeman.com/BPS4e Correlation/regression.   Make a cloud of data (about 15 points), put in the regression line.  Play with an outlier: drag a point to the far left (or right) and drag it up and down.  Try it if it's in the middle range of x's.  Answer: Where is it most influential? Now add a bunch more points (50 is max.)  Play with an outlier  againDoes the outlier have more or less influence with a larger data set?
Optional  
 


 

Exam 2 Friday Day 19, a week from today.  Sample exam & solutions available Monday. 
   Chapters 3, Normal distribution (with tables), 4&5, Scatterplots, Correlation, Regresion.

Regression-- Review
ANY Straight line y = a + bx  (or bx + a):  b, the coefficient of x, is the slope of the line.  If x changes one unit, y changes b units, so b is the rate of change of y with respect to x.  (If y is weight in pounds, and x is height in inches, b is the number of pounds  we expect to see weight go up by, per inch that height goes up by.
Heard on NPR Fall '04:  The World Bank says:  For every $5 increase in the price of a barrel of oil, the world economic growth rate drops  3/10 of 1%.  What kind of analysis did they do?  They have restated what statistical thing?
Homework questions?

"Regression line of weight on height":  height = horizontal (x) axis, weight = vertical (y) axis.  
       Predicts (if suitable)  an average, typical y for each x.
Four FactsDay 15

The line formula yhat = a + bx  from xbar, ybar, sx , sy , r:
     Find b:   b = r  sy / sx
                (Fact 2r is slope if x and y are standardized. Equation p. 120)
      Find a:  Solve  ybar = a + b xbar for a:  a = ybar - b xbar
               (Fact 3:  (xbar, ybar) lies on the regression line(s).  Equation p. 109)
 Example.  xbar = 5   ybar = 8 
sx = 10, sy = 6 , r = -.3:  OOPS! b = -.3×10/6 = - 0.5.   8 = a + (-0.5)×5 = a - 2.5.  a = 10.5
       yhat = 10.5 - 0.5x 

        b = -.3×6/10 = - 0.18.   8 = a + (-0.18)×5 = a - .95    a = 8.95       yhat = 8.95 - 0.18x

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
The Line formula yhat = a + bx tells us our best prediction or estimate of a response (y) value for a particular value of the explanatory (x) value.  It says NOTHING about how good that "best" is--that is, it says nothing about how tight or scattered the data is around the line.  R-squared does that job.

    r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.)
    If r = .7, about half (.49) of the variability  in the y's is explained by using the regression line relationship to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be explained by knowing height.)
Start here Monday
Least Squares Property, and Residuals
"Residual at x" = (y - yhat)  = distance between observed y and  predicted y (= what's left over after predicting)
    ( Positive if observed is bigger than predicted, negative if observed is smaller than predicted)
Residual:  Look at an individual observed (x,y) data pair.  The residual is the "leftover" amount of y after predicting a y using the line.  Visually, length of vertical line drawn from y to regression line (+ if point is above line, -  if point is below line)
   Residual = observed y - predicted y    = "prediction error" p. 119
      Calculating:  Montana (17895, 55502)       Govsal = 28,569.69 + 2.71*avgpay
           Predicted Govsal = 28,569.69 + 2.71*17895 = 28,569.69 + 48495.45 = 77065.14
           Residual = 55,502 - 77065 =  -21563,  $21,563 below expected value.
Least squares principle:  Find the line that minimizes the sums of the squared residuals.(Here, or in Mac 101, ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\RegressionLeastSqs.xls, Squares tab)
       This method of finding a "best fit" straight line for predicting y's from x's was derived mathematically to work well with "joint normal" data--elliptical clouds. (Same idea as mean& st.dev.)  For data of this sort, the line does  give the mean of the y's for each given x (at least in the abstract.)
Residuals drawn to line Govsal-Deviations.doc,   SPSS (handout, p. 3, bottom:  In Edit mode, Insert>Spikes: Spike to: Regression)

Drawback if the data is not the "elliptical cloud" type:
     Outliers get their residual distance squared:  May be very influential  in determining where line sits.
             Especially if at lowest or highest x-values, may change slope of line a lot.
            Applet ,http://www.whfreeman.com/BPS4e, ...Correlation&regression.   Play with an outlier.
 (Outliers toward the middle x's may not change the slope, but may affect r, and r2.)
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Plotting residuals:  If you  graph residual values against x (or against predicted y's), you eliminate visually the linear portion of the association. (The regression line "becomes" the new x-axis; a "shear" transformation.) Curving or other structure may stand out more visibly.  Straight line is a "Good" fit  = no structure in residuals.) (Here or ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\Residuals.xls

SPSS can make a new variable of residuals, which you then can use to make a scatterplot. (Handout p. 4 and 3 bottoms)
 Do Analyze>Regression>Linear (a new menu for us) 
Click your variables into Independent (X) and Dependent(Y). 
Hit the Button "Save...": Checkbox Residuals: Unstandardized. Continue, Ok out of the menus.  You'll get output; ignore it. 
You'll get a new variable, the residuals.   You can now use this on the vertical axis, of a scatterplot:  "Residual plot"


Sievers home  Math151-Fall06/Daym16.htm  11:30am 9/29/06
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.