Math 151 , Fall 2005, Day 18 Wed. Oct. 5 Hit reload 

Day 18 (Wed. Oct.5): Reading: Read D&V Ch8 & Ch9(all), Do AS8 Regression. (AS9)
Ahead, we'll skip Ch 10, do Ch11  and AS11, then  12&13
Hand in Fri. 

(D&V p.152 ff.)
y on x, x on y: 
25&27 Burgers (type the data into SPSS)
19 SAT scores (You did 17 last night.  Look in the answers in the back of the book for the formula you calculated which you need for part e.)

Chapter 9:  D&V p. 174ff. unless otherwise noted:
1,3 Marriage age (Type your data into SPSS.  Your answers may vary. Cf.#13p.73)
2 Age difference

Groups, from ActivStats Ch7 HW (same datasets you used before, different questions) Find Datasets in SPSS from this page
A) Metabolic rate (MRA-81-4) on Lean Body Mass.  i) Make a scatterplot with the regression lines for the 2 sexes and the whole group.  (a graph with Panel variable Sex is good in addition).  The R2 for Males is quite low, compared to that for Females, and compared to that for the whole group.  ii) Explain how the scatter that you see for the two groups is consistent with their different  R2's.  iii) How can it be that the more scattered Males  added on to the Females  end up  producing an R2 close to that of the Females alone?
B) Bear neck/weight (TRE-58-26)  i) Make a single scatterplot with the regression lines for the 2 sexes.  (Get rid of the Total line)  ii) Then do a graph with Panel variable Sex and regression lines, better to see them separately.  iii) Describe any bears which are outliers and/or  influential points, and any ways in which the data are not well modeled by the straight lines.

15, 16 Gestation  (note, these are summarized data)
20 Life expectancy (SPSS)(again, summarized data)  For b, if an outlier is "impossible", delete it!
18 Smoking. (SPSS) Graph it and discuss what you might do to model it; don't DO it.

Read,
to discuss



Ch.9
p. 176:
11, 12, 13, 14 
(all about outliers)

7a-d Reading

 
Optional: 
Use Activstats Least Squares tool, (see below) and play with datasets; especially drag points around and see what they do.
HW, questions?  Day 17
Pick a digit (from 0,1,2,3,4,5,6,7,8,9).  Write it down.  Write it next to your name on the sign in sheet.
Heightism:  Income depends on height?!
  a)What is "$789"Slope of regression line of yearly pay on height (in inches).
  b)What does my footnote at the end tell you about the data that the article did not? R < .15 , so less than 15% of the variation in pay is accounted for by the regression on height.

Regression line: D&V Ch 8&9, AS8&9, "Regressing y ON x"
 Formula yhat =  b0 + b1 x,      b1  = r times (s.d. of y)/(s.d. of x) = r  sy / sx,    b1 is in y-units per (/) x-unit
     b0= ybar - b1(xbar) from ybar = b0 + b1(xbar).

Residual:    Residual = observed - predicted  
"Least squares" (D&Vp.144, AS8-3Activity1&2) The regression line is the line that minimizes the sums of the squared residuals. See Day 16  
 
R-squared
: The Line formula
yhat =  b0 + b1 x   tells us our best prediction or estimate of a response (y) value for a particular value of the explanatory (x) value.  It says NOTHING about how good that "best" is--that is, it says nothing about how tight or scattered the data is around the line.  R-squared does that job.

  R2 (= r2 = "Coefficient of Determination") = Proportion of variability in y-values explained/accounted for by knowing x and using the  regression line model.
  Un-accounted-for-variability =(1-r2) = variance-of-residuals / total-variance-of-y's
More:R-Squared (ClassMaterials\Math151 D&V\ RegressionDemosExcel for D&V\RSquared.xls))
(Optional: Further explanation of r2)
r2 is the square of the correlation coefficient r!  (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability  in the y's is accounted for by  using the regression line model to predict y from x. (If weight and height have a correlation of .7, then half of the variability in weight can be accounted for by height.)

Line is not symmetric: The regression of weight on height uses a different line from the regression of height on weight.  (Minimizing vertical  residuals pulls line "flatter" than  the line that just goes through the middle of the cloud, which would rise 1 s.d. up for one s.d. run.  Related to the idea of "regression to the mean" p. 139)
   Demonstration on overhead projector; flip transparency to exchange axes.

Chapter 9:  Regression (& correlation) wisdom:  What  can go wrong, things to watch out for.
Groups (subsets) may benefit from being considered separately.  Sometimes analyzing residuals can alert us to important subsets.
SPSS:  Fitting lines to groups: Govsal_vs_pay Put a grouping variable in a Legend Variables box and Insert >FitLine>Regression will make a line for the whole and lines for each group. In Edit mode: Click on a regression equation; then Edit>Regression Parameters allows eliminating the line for the Total (or the Subgroups lines)  (Using Panel Variables box makes each group on a separate graph)  Residuals graphs and variables are only generated for the total group.  To do separately, you'd need to do Data>Select Cases in the editor, and work with one group at a time.

Shape after linear trend removed: discussed with Patterns in Graphs of residuals.
Extrapolation p.163: last class also.  Linear approximations may be good for short term segments, lousy in long term.

Outliers and Influential points:  pp. 165-7 (Use Moore http://www.whfreeman.com/scc, or ASLeastSquaresTool)
 A point (or more) outside of the pack--an outlier--can :
--Weaken  or strengthen r (& r2):
        If it's in the same direction as the general trend, strengthens. Against the trend, weakens .
-- Affect the slope  of the regression line a lot (has high "leverage"= is an "influential point"), if it's an outlier in the x measurement.  (Teeter-totter principle)  We won't calculate leverage.
-- Affect the slope little, but:
     -- strengthen r2 if it's along the main trend but farther out.
    -- pull the whole line up or down a bit,  if it's  in the center of the data on the x -measurement and an outlier in y (Not an "influential point"
&& Two clusters with little internal trend could look like a strong association when "combined."

Anscombe's Quartet--4 (made-up) data sets with identical summary statistics (AS9HW: MRB127-46 Always Plot your Data!)

Summary values p.169:  If the x and/or y data have already been averaged or summarized, the relationship you plot and/or use correlation/regression to describe, will look stronger than it would if you used raw data (you've already gotten rid of much of the variability.)  && Watch out for  Data relating states, nations, groups .

Association does not imply causation---"Lurking" variable (p. 168) has an important effect, but not one of the variables studied.
    Meatloaf shrinkage vs. placement in oven?  (cooking thermometer/not had greatest influence)
    Time sequence of observations a common one.  (Learning, tiring, aging)
    The trouble with lurking variables is that by definition you don't know they're there.  Look behind every tree.


Sievers home  Math151-Fall05/Dayf18.htm 10pm 10/4/05
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.