Math 151 , Spring 2005, Day 17 Wed. March 9 Hit reload After class

Day 17 (Wed. Mar 9): Reading: Read D&V Ch8 & Ch9(all), Do AS8 Regression. (AS9)
Ahead, we'll skip Ch 10, do Ch11  and AS11, then  12&13
Hand in Fri.  (D&V p.152 ff, unless otherwise noted)

C.  Use Residuals.xls from here or the lab(in  ClassMaterial\Math151 D&V\RegressionDemosExcel for D&V) to graph these data sets, along with a graph of the residuals.  Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.) 
   a)  x 1 2 8 4 6 9 
       y 1 3 6 6 7 5 
   b) x 1 2 7 4 6 9
      y 7 6 2 4 2 1
3 Residuals
32a-h Birthrates (type the data into SPSS. Make a plot of residuals also, to help with c) 
RSquared
SPSS Handout p. 3:  You can now finish all the questions.  Hand it in as part of Day 17!
36 e Gators, how good.  Also, Graph this line, and use it to estimate the weight of a 60-inch alligator. 
21 d,e,f Used cars
31 a-g El Nino
7 Real Estate  (for e, f, remember the regression equation in z's form, p.138 middle)

A. Income depends on height?! Read the article and answer this.
If your browser doesn't get the link, it's at http://aurora.wells.edu/~srs/Math151-Sp05/tallpeoplewin.htm
   a)What is "$789", and what kind of analysis did they do? 
  b)What does my footnote at the end tell you about the data that the article did not?
The rest will be part of Friday Day 18's work, since I haven't lectured on it yet (but you can do most of it now!)
25&27 Burgers (type the data into SPSS)
19 SAT scores (You did 17 last night.  Look in the answers in the back of the book for the formula you calculated which you need for part e.)

Read,
to discuss 
Optional: 
Use Activstats Least Squares tool, (see below) and play with datasets; especially drag points around and see what they do.
HW, questions?  Day 16
Heard on NPR last fall:  The World Bank says:  For every $5 increase in the price of a barrel of oil, the world economic growth rate drops  3/10 of 1%.  What kind of analysis did they do?  They have restated what statistical thing?

Regression line: D&V Ch 8&9, AS8&9, "Regressing y ON x"
 Formula yhat =  b0 + b1 x,
     b1  = r times (s.d. of y)/(s.d. of x) = r  sy / sx,    b1 is in y-units per (/) x-unit
     b0= ybar - b1(xbar) from ybar = b0 + b1(xbar).

Residual:  Look at an individual observed (x,y) data pair.  The residual is the "leftover" amount of y after predicting a y using the line.  Visually, length of vertical line drawn from y to regression line (+ if point is above line, -  if point is below line)
   Residual = observed - predicted
    SPSS (handout, p. 3, bottom:  In Edit mode, Insert>Spikes: Spike to: Regression) Govsal-deviations.spo

"Least squares" (D&Vp.144, AS8-3Activity1&2) The regression line is the line that minimizes the sums of the squared residuals.  (RegressionLeastSqs.xls, or in Mac 101, ClassMaterials\Math151 D&V\ RegressionDemosExcel for D&V\RegressionLeastSqs.xls)
       &&This method of finding a "best fit" straight line for predicting y's from x's was derived mathematically to work well with "joint normal" data--elliptical clouds.  For data of this sort, the line does  give the mean of the y's for each given x (at least in the abstract.)
ActivStats Least Squares tool: AS8-3, rightmost button, with line and red dots. "Show" button.  Checkmark all possibilities. Uncheck "ShowLS Line". Choose number of points, Do "Regenerate". Move green line to minimize Sum of Squares (red bar), and observe residuals as you do.  Confirm your result by checking "ShowLSLine".
"Regenerate" created "good clouds" of data.  To use your own data, do "Reset"; click in the picture to make dots (but not too close to the green line or it will think you're dragging that.).

Pattern in graph of residuals:  (Ch9 p.162-3) If you graph residual values against x (or against predicted y's), you eliminate visually the linear portion of the association--eliminate the distraction of the slanted line. (The regression line "becomes" the new x-axis; a "shear" transformation)
   Excel Residuals.xls in  ClassMaterial\Math151 D&V\RegressionDemosExcel for D&V.
Curving or other structure may stand out more visibly.  "Good" fit = no structure in residuals.
SPSS:  (old wing) (Handout bottom p.4&3)  Analyze>Regression>Linear. Plots button, *ZRESID on *ZPRED. Save button, Residuals: Unstandardized calculates all the residuals and saves them as a new variable, which you can graph with.  If you graph residuals on x, and add the regression line, it's now a flat line at 0--the mean of all the residuals is 0.
 

R-squared : The Line formula yhat =  b0 + b1 x   tells us our best prediction or estimate of a response (y) value for a particular value of the explanatory (x) value.  It says NOTHING about how good that "best" is--that is, it says nothing about how tight or scattered the data is around the line.  R-squared does that job.
 R2 (= r2 = "Coefficient of Determination") = Proportion of variability in y-values explained/accounted for by knowing x and using the  regression line model.
  Un-accounted-for-variability =(1-r2) = variance-of-residuals / total-variance-of-y's
More:R-Squared (ClassMaterials\Math151 D&V\ RegressionDemosExcel for D&V\RSquared.xls))
(Optional: Further explanation of r2)
r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability  in the y's is accounted for by  using the regression line model to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be accounted for by height.)
Start here Friday:
Line is not symmetric: The regression of weight on height uses a different line from the regression of height on weight.  (Minimizing vertical  residuals pulls line "flatter" than  the line that just goes through the middle of the cloud, which would rise 1 s.d. up for one s.d. run.  Related to the idea of "regression to the mean" p. 139)
   Demonstration on overhead projector; flip transparency to exchange axes.

Chapter 9:  Regression (& correlation) wisdom:  What  can go wrong, things to watch out for.
Groups (subsets) may benefit from being considered separately.  Sometimes analyzing residuals can alert us to important subsets.
SPSS:  Fitting lines to groups:  Put a grouping variable in a Legend Variables box and Insert >FitLine>Regression will make a line for the whole and lines for each group. In Edit mode: Click on a regression equation; then Edit>Regression Parameters allows eliminating the line for the Total (or the Subgroups lines)  (Using Panel Variables box makes each group on a separate graph)  Residuals graphs and variables are only generated for the total group.  To do separately, you'd need to do Data>Select Cases in the editor, and work with one group at a time.

Shape after linear trend removed: see above, Patterns in Graphs of residuals.
Extrapolation p.163: last class also.  Linear approximations may be good for short term segments, lousy in long term.
Association does not imply causation---"Lurking" variable (p. 168) has an important effect, but not one of the variables studied.
    Meatloaf shrinkage vs. placement in oven?  (cooking thermometer/not had greatest influence)
    Time sequence of observations a common one.  (Learning, tiring, aging)
    The trouble with lurking variables is that by definition you don't know they're there.  Look behind every tree.

Outliers and Influential points:  pp. 165-7 (Use Moore http://www.whfreeman.com/scc, or ASLeastSquaresTool)
 A point (or more) outside of the pack--an outlier--can :
--Weaken  or strengthen r (& r2):
        If it's in the same direction as the general trend, strengthens. Against the trend, weakens
-- Affect the slope  of the regression line a lot (has high "leverage"= is an "influential point"), if it's an outlier in the x measurement.  (Teeter-totter principle)  We won't calculate leverage.
-- Affect the slope little, but:
     -- strengthen r2 if it's along the main trend but farther out.
    -- pull the whole line up or down a bit,  if it's  in the center of the data on the x -measurement and an outlier in y (Not an "influential point"
&& Two clusters with little internal trend could look like a strong association when "combined."

Anscombe's Quartet--4 (made-up) data sets with identical summary statistics (AS9HW: MRB127-46 Always Plot your Data!)

Summary values p.169:  If the x and/or y data have already been averaged or summarized, the relationship you plot and/or use correlation/regression to describe, will look stronger than it would if you used raw data (you've already gotten rid of much of the variability.)  && Watch out for  Data relating states, nations, groups .


Sievers home  Math151-Sp05/Days17.htm 1pm 3/9/05
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.