Math 151 , Fall 2007 Wednesday Day 17, Mar. 7,.After classextrapolation links renewed 3/11 Hit reload...

HW:  Reading:   Reread, Finish Ch.5    Not on the exam, the equation of the least-squares line (p. 120) & Fact 2, p.123. Continuing regression, p. 126-137.   Next,  Read Ch. 7, summary.  (Skip Ch. 6)

Hand in  Monday
p. 122, 5.3b only. verify formula Use the means, s.d.'s and r from the answers in the back of the book.
p. 141, 5.30 husbands and wives  (Note, you have to find the equation of the line to draw the graph, tho it doesn't explicitly tell you to...)
p. 125, 5.5 (SPSS. Let SPSS find the regression line. Get the mean yield and mean planting rate too--you need it for part c) corn again, straight line is a "bad fit" 

p. 142, 5.32 going to class 
p. 140, 5.28 social rejection, reading other software  (Read the text for how to read software)

Postpone the rest:
pp. 143-4, 5.35, 37 (SPSS) Drilling into the past, silicon (one clear outlier) To graph the lines with and without the outlier on the same graph, make a new variable and put 1's in every case but the outlier--give the outlier 0.  Then use this variable as your legend or panel variable.  You'll also get a "nuisance" horizontal line at the outlier; ignore it.

Residuals
p. 129, 5.7 (SPSS) does fast driving waste fuel? residuals
  There is a data file for problem 5.7, and its third column is the residuals.  Do all the parts, and
Also with 5.7, In SPSS, Make a variable containing the residuals (Handout, bottom p. 4.  Also bottom of this page.)  The values should match the ones in the book/SPSS file.

SPSS Handout p. 3 (Governors' salaries):  You can now finish#12, the last question.  Hand it all  in Monday(?).

p.133, 5.9 Farm population

B.  Use Residuals.xls from the website or the lab to graph these data sets, along with a graph of the residuals.  Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.) 
a)  x 1 2 8 4 6 9 
    y 1 3 6 6 7 5 
b) x 1 2 7 4 6 9
   y 7 6 2 4 2 1
(SPSS)  Do a, b, c (read p. 132 for a good word to use in part c).  Also, make a variable containing the residuals, and plot it against the x (year) values.  Draw (in pencil) a horizontal line at height 0.  What pattern do you see in the residuals?

p 179 7.28, 29, 30 (SPSS) Soap in the shower.  Also, look carefully at the graph and guess why there is no data after day 21.  (Read p. 132 for the word to describe using the line for day 30, and a discussion of the issue)
p. 136 5.13 hospitals: big = bad?

Read, to discuss
 Look at this, especially with reference to the r standard deviations in y for every 1 standard deviation in x: A. Open the Excel file RegressionSlope (or in the folder RegressionDemosExcel for D&V in ClassMaterial\Math151 D&V).  Change x-y values in the yellow boxes and watch the line change.  Change x-values in col. F and watch the "run" (red line) change, in the rightmost 2 graphs. Notice the slope = the coefficient of x = the rise/run = increase in y per unit increase in x.  Fix it so the increase in x (the "run") is exactly 1.   Also, look at the leftmost graph, where the length of the standard deviations are shown, and note that in standard-deviation units, the rise is r s.d.'s in y for each s.d. run in x. 

Postpone the rest:

C. Use Applet http://www.whfreeman.com/BPS4e Correlation/regression.   Make a cloud of data (about 15 points), put in the regression line.  Play with an outlier: drag a point to the far left (or right) and drag it up and down. 
Try it if it's in the middle range of x's.  (Drag it up and down.)  Answer: Where is it most influential? Now add a bunch more points (50 is max.)  Play with an outlier  againDoes the outlier have
more or less influence with a larger data set?

p. 136,  5.12 lurking variables


Optional 
p. 179, 7.27 (review Normal)


Postpone the rest:
p. 136, 5.11, lurking variables 







 
 
 
 
 
 
 

 

Exam 2 this Friday (next class: Day 18 (March 9).  Starts with Ch. 3, Normal distrib.  Thru Ch. 4, and what was covered of Ch.5 Monday.  One sheet of notes: I will give you paper copies of the Normal table.
Sample exam handout, outside my door after class,  and linked Here.(Exam will cover all the problems given)). 
     Solutions: 2 outside my door, 2 on reserve, linked here
 
Sign up for the time you'll start, on clipboard today. 
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
HW questions?  Day 16
5.42 p. 146, a computer game, revisited.
  Can it really be that only about 9% of the variability in speed of the right hand is accounted for by the distance?  The eye is fooled by the graph, with the right hand data squashed down at the bottom and looking really linear.  Here is the right hand by itself.  (SPSS output file)

Income depends on height?!

    What is "$789", and what kind of analysis did they do?  (HW)  How much of the variation in salary is explained by height?

The Line formula yhat = a + bx tells us our best prediction or estimate of a response (y) value for a particular value of the explanatory (x) value.  It says NOTHING about how good that "best" is--that is, it says nothing about how tight or scattered the data is around the line.  R-squared does that job.
    r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.)
    If r = .7, about half (.49) of the variability  in the y's is explained by using the regression line relationship to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be explained by knowing height.)
NOTE:  The standard deviation doesn't say anything about the distance of any individual point from the mean; it's only about a kind of "average" variability.  R2 doesn't say anything about the line and any particular (x,y) pair --just about a kind of "average" goodness of the explanatory power of the line for the data.
Other questions for exam?

- - - (most of) The rest of Chapter 5- - - - - - - - -
Facts
2 &3, give line formula!
(Moore pp. 123-125)  (Day 16)

2.   A change of one standard deviation in x corresponds to a change of r standard deviations in y, along the regression line.
The slope b expresses change in y-units per x-unit. (Suppose x is inches, y is pounds. Then b is in pounds per inch.) You can find b by multiplying r by the standard deviation of the y's (that's in pounds)  and dividing by the standard deviation of the x's (that's in inches)
In "algebra", b = r times (s.d. of y)/(s.d. of x)  (Equation p. 120)
       If we standardize both the x-values and the y-values, the slope will just = r !
        
govsalstd.sav Govsalstd2.doc    RegressionSlope.xls

3.   The regression line goes through the point given by the two means, (xbar, ybar). http://www.whfreeman.com/bps4e 
--If you know this, you know ybar = a + b (xbar).  You can solve this for a, a = ybar - b (xbar).(OtherEquation p. 120)
--So knowing 2 and 3 give you the equation of the line from the means, s.d.'s, and r.
--And if you draw the two lines, y on x and x on y, they will intersect at (xbar, ybar)

The line formula yhat = a + bx  from xbar, ybar, sx , sy , r:
     Find b:   b = r  sy / sx
                (Fact 2r is slope if x and y are standardized. Equation p. 120)
      Find a:  Solve  ybar = a + b xbar for a:  a = ybar - b xbar
               (Fact 3:  (xbar, ybar) lies on the regression line(s).  Equation p. 109)
 Example.  xbar = 5   ybar = 8 
sx = 10, sy = 6 , r = -.3: 
        b = -.3×6/10 = - 0.18.   8 = a + (-0.18)×5 = a  - .95    a = 8.95       yhat = 8.95 - 0.18x

Start here Monday

Least Squares Property, and Residuals
"Residual at x" = (y - yhat)  = distance between observed y and  predicted y (= what's left over after predicting)
    ( Positive if observed is bigger than predicted, negative if observed is smaller than predicted)
Residual:  Look at an individual observed (x,y) data pair.  The residual is the "leftover" amount of y after predicting a y using the line.  Visually, length of vertical line drawn from y to regression line (+ if point is above line, -  if point is below line)
   Residual = observed y - predicted y    = "prediction error" p. 119
      Calculating:  Montana (17895, 55502)       Govsal = 28,569.69 + 2.71*avgpay
           Predicted Govsal = 28,569.69 + 2.71*17895 = 28,569.69 + 48495.45 = 77065.14
           Residual = 55,502 - 77065 =  -21563,  $21,563 below expected value.
Least squares principle:  Find the line that minimizes the sums of the squared residuals.(Here, or in Mac 101, ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\RegressionLeastSqs.xls, Squares tab)
       This method of finding a "best fit" straight line for predicting y's from x's was derived mathematically to work well with "joint normal" data--elliptical clouds. (Same idea as mean& st.dev.)  For data of this sort, the line does  give the mean of the y's for each given x (at least in the abstract.)
Residuals drawn to line Govsal-Deviations.doc,   SPSS (handout, p. 3, bottom:  In Edit mode, Insert>Spikes: Spike to: Regression) <>Drawback if the data is not the "elliptical cloud" type:
     Outliers get their residual distance squared:  May be very influential  in determining where line sits.
             Especially if at lowest or highest x-values, may change slope of line a lot.
            Applet ,http://bcs.whfreeman.com/BPS4e, ...Correlation&regression.   Play with an outlier.
 (Outliers toward the middle x's may not change the slope, but may affect r, and r2.)
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Plotting residuals:  If you  graph residual values against x (or against predicted y's), you eliminate visually the linear portion of the association. (The regression line "becomes" the new x-axis; a "shear" transformation.) Curving or other structure may stand out more visibly.  Straight line is a "Good" fit  = no structure in residuals.) (Here or ClassMaterials\Math151 BPS4e\ RegressionDemosExcel BPS4e\Residuals.xls

SPSS can make a new variable of residuals, which you then can use to make a scatterplot. (Handout p. 4 and 3 bottoms)
 Do Analyze>Regression>Linear (a new menu for us) 
Click your variables into Independent (X) and Dependent(Y). 
Hit the Button "Save...": Checkbox Residuals: Unstandardized. Continue, Ok out of the menus.  You'll get output; ignore it. 
You'll get a new variable, the residuals.   You can now use this on the vertical axis of a scatterplot:  "Residual plot"

Cautions  pp. 132-136
Plot the data: Summary formulas and numbers don't tell the whole story.  In particular, correlation and regression line only describe a linear relationship properly.
Correlation and regression are not resistant to outliers, influential points.("Anscombe's quartet", Moore p.142, 5.34)
(Overhead slide.   You can reconstruct these pictures using SPSS and Moore's problem, if you like.)

Extrapolation-- extra (outside) polation (putting a point): Using the line to predict outside the range of x's you have data for.  Linear relationships don't go on forever; straight line  is often a first approximation to a more complicated relationship.
Government projections of national budget surplus/deficit:  (www.cbo.gov publications>search)
 Jan. 2001 http://www.cbo.gov/showdoc.cfm?index=2727&sequence=6  Projection used to justify Bush tax cuts.
Jan. 2002   http://www.cbo.gov/showdoc.cfm?index=3277&sequence=6
August 2006 http://www.cbo.gov/ftpdocs/74xx/doc7492/08-17-BudgetUpdate.pdf  
     Pdf p. 19, single line projection--10 years, p. 36, uncertainty--6 years.
March. 2007(p.2)pdf p. 8  http://www.cbo.gov/ftpdocs/78xx/doc7837/03-05-Uncertain.pdf

  June 2000, conservative think tank analysis  http://www.hoover.org/publications/policyreview/3487697.html
      Fig 1, budget surplus/deficit 1901 on.  Notice only previous longterm surplus is 1920's,
      Fig. 6 --1960 on, & projections


"Lurking" variable has an important effect, but not one of the variables studied.
    Meatloaf shrinkage vs. placement in oven?  (cooking thermometer/not had greatest influence)
    Time sequence of observations a common one.  (Learning, tiring, aging)
    The trouble with lurking variables is that by definition you don't know they're there.  Look behind every tree.

Association does not imply causation
Strong association/correlation between A and B could be:
     A causes B/   B causes A/  C causes both A and B (lurking C)/  just Chance that they go together in this data set.    
Direction?  Rooster causes sun to rise by crowing?
Both variables "caused" by a lurking variable?   Lurking variable can be part of the cause.
--Women with a history of heavy antibiotic use have higher rates of breast cancer.
--Baby rats whose mothers licked and groomed them more   grew up to be more exploratory, social, less timid.
            Cause? Effect?  How to tell?

Establishing that x "causes" y:  difficult:
    Best: Do an experiment in which we change x, keep lurking variables under control. (Ch. 9  Rats. )
    Otherwise: Strong association. Consistent over many studies. Higher x-->stronger y.  X precedes y in time.  A plausible mechanism exists (parallel studies?)
                Generalize rat grooming to humans?

         E.g.Partially  hydrogenated oils --> heart disease?  Homocysteines --> heart disease?


Sievers home   Math151-Sp07/Daysp17.htm  9:30pm 3/11/07
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.