MATH 251, Probability and Statistics I, Fall 2001, Sept. 24, Day 11

Correction Day 10:--Vertical Distance from point to regression line: "Error" = "Residual" = "Deviation" = (yi - yhati)
--Answer to problem B, Day 10:  the w that minimizes Sum (yi - w)2  is ybar, the mean of the y's.
(Note that with ybar in place of w, Sum (yi - w)2 is the top of the variance formula.)
So in the context of mean/s.d., the least squares criterion for the line fits right in.

SPSSResiduals and DIFFITs--Linear Regression, Save button--adds columns of these values to your data file; then you can analyze them however you want.
SPSS gives 5 choices for residuals. Unstandardized is the raw.  What Moore calls "Studentized" (p. 163)  is called "Deleted" in SPSS--"Deleted" will emphasize outliers, since it compares each value with the pack not including itself.  (SPSS's Studentized uses a "standard deviation"  that factors in how far a point is from the middle on the x-line.  It also tends to emphasize outliers.)  It doesn't seem to make a lot of difference which you use.

Perhaps start with the normal probabality plots to spot oddballs.  If you used Analyze>Desc.Stats.>Explore: Set Markers By, then Plots: Normality plots, you can identify the oddballs with Point ID. Explore's normal probability plot sometimes leaves off the smallest value.  Bad bug!  Use Graphs/QQ plot instead (at least to see if they're the same).  Downside--can't label points except with row numbers. But ask for Outliers in the Explore: Statistics button and it will give you a list, numbers and labels if you chose labels in the main box.

Other profitable explorations are the residuals vs. the independent variable ("detrending" the values), Diffits vs. independent variable, to see where the outliers came from.
Also either vs. order of observation (looking for a "fatigue" or "running in" factor) (Graphs>Sequence)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CAUTIONS:
Association does not prove causation! (Sec. 2.7)
Correlation/regression only capture linear association (lots of things are almost linear over a short interval)
   Extrapolation  (but maybe not linear over a longer interval)
   Restricted-range problem (range not enough to uncover true relationship)
Lurking variables
   influential points, outliers (squared errors make very non-resistant)
Mixing 2 (or more) groups can diffuse or even reverse association (pp. 167-8--"Simpson's Paradox")
Averaged data will make stronger correlation than nonaveraged.  (country data)

Day 11, Monday Sept 24, finishing text 2.3, 2.4. SPSS manual sec. 2.2, pp.62-66top.
Next:  Proceed onward through ch. 2: 2.5 next, then 2.6, 2.7
Hand in: 
p. 151, 2.42 degree days, predict both ways.  Also Graph both.
If you have the computer skills, bring both graphs into 
Word, and use the drawing tools to flip one around the 
diagonal so they both have the same axes.  This is (more 
or less) what was done for Fig. 2.16 (Hubble). 


Residuals/influential points 
p. 171, 2.54 gas chromatography-plot residuals
p. 176, 2.64 particulates Also with part c, 
    find the DIFFITs values, plot them 
   (do a histogram and a QQ plot from the Graphs menu) 
   and see if its results match your eyeball.
Read, discuss 
2.67 mean stride rates/raw
 
 


2.68 Baseball salaries--resid

 

Optional 
 
 
 
 



2.53 golf   Use SPSS, and find the DIFFITS for the 11 points. See how these pick out the outlier/influential point..


Sievers home  Math251-Fall01/DayP11.htm      10pm    9/23/01
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.