| Hand
in Fri. (D&V p.152 ff, unless otherwise noted)
C. Use Residuals.xls from here
or the lab(in ClassMaterial\Math151
D&V\RegressionDemosExcel for D&V) to graph these
data sets, along with a graph of the residuals. Print the results,
and describe the shape of the residuals (it may help to connect the dots
with pencil, to see the pattern.)
A. Income depends on height?! Read
the article and answer this.
|
Read,
to discuss |
Optional:
Use Activstats Least Squares tool, (see below) and play with datasets; especially drag points around and see what they do. |
Regression line: D&V
Ch 8&9, AS8&9, "Regressing y ON x"
Formula yhat = b0 + b1 x,
b1
= r times (s.d. of y)/(s.d. of x) = r sy / sx,
b1 is in y-units per (/) x-unit
b0=
ybar
- b1(xbar) from ybar =
b0 + b1(xbar).
Residual: Look at an individual observed (x,y) data
pair. The residual is the "leftover" amount of y after predicting
a y using the line. Visually, length of vertical line drawn from
y to regression line (+ if point is above line, - if point is below
line)
Residual = observed - predicted
SPSS
(handout, p. 3, bottom: In Edit mode, Insert>Spikes: Spike to: Regression)
Govsal-deviations.spo
"Least squares" (D&Vp.144, AS8-3Activity1&2) The
regression line is the line that minimizes the sums of the squared
residuals. (RegressionLeastSqs.xls,
or
in Mac 101, ClassMaterials\Math151 D&V\ RegressionDemosExcel for D&V\RegressionLeastSqs.xls)
&&This
method of finding a "best fit" straight line for predicting y's from x's
was derived mathematically to work well with "joint normal" data--elliptical
clouds. For data of this sort, the line does give the
mean of the y's for each given x (at least in the abstract.)
ActivStats Least Squares tool:
AS8-3,
rightmost button, with line and red dots. "Show" button. Checkmark
all possibilities. Uncheck "ShowLS Line". Choose number of points, Do "Regenerate".
Move green line to minimize Sum of Squares (red bar), and observe residuals
as you do. Confirm your result by checking "ShowLSLine".
"Regenerate" created "good clouds" of data. To use your own data,
do "Reset"; click in the picture to make dots (but not too
close to the green line or it will think you're dragging that.).
Pattern in graph of residuals: (Ch9 p.162-3) If
you graph
residual values against x (or against predicted y's),
you eliminate visually the linear portion of the association--eliminate
the distraction of the slanted line. (The regression line "becomes" the
new x-axis; a "shear" transformation)
Excel Residuals.xls
in
ClassMaterial\Math151 D&V\RegressionDemosExcel for D&V.
Curving or other structure may stand out more visibly. "Good"
fit = no structure in residuals.
SPSS: (old wing) (Handout bottom p.4&3) Analyze>Regression>Linear.
Plots button, *ZRESID on *ZPRED. Save button,
Residuals:
Unstandardized calculates all the residuals and saves them as a new
variable, which you can graph with. If you graph residuals on x,
and add the regression line, it's now a flat line at 0--the mean
of all the residuals is 0.
R-squared : The Line formula
yhat = b0 + b1 x
tells us our best prediction or estimate of a response (y) value
for a particular value of the explanatory (x) value. It says NOTHING
about how good that "best" is--that is, it says nothing about how tight
or scattered the data is around the line. R-squared
does that job.
R2 (= r2
= "Coefficient of Determination") = Proportion of variability
in y-values explained/accounted for by knowing x and using the regression
line model.
Un-accounted-for-variability =(1-r2) = variance-of-residuals
/ total-variance-of-y's
More:R-Squared (ClassMaterials\Math151
D&V\ RegressionDemosExcel for D&V\RSquared.xls))
(Optional:
Further explanation
of r2)
r2 is the square of the correlation
coefficient r! (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability
in the y's is accounted for by using the regression line model to
predict y from x. (If weight and height have a correlation of .7, then
half of the variability in weight can be accounted for by height.)
Start here Friday:
Line is not symmetric: The regression of weight on height
uses a different line from the regression of height on weight.
(Minimizing vertical residuals pulls line "flatter" than
the line that just goes through the middle of the cloud, which would rise
1 s.d. up for one s.d. run. Related to the idea of "regression to
the mean" p. 139)
Demonstration on overhead
projector; flip transparency to exchange axes.
Chapter 9:
Regression (& correlation) wisdom: What
can go wrong, things to watch out for.
Groups (subsets) may benefit from being considered separately.
Sometimes analyzing residuals can alert us to important subsets.
SPSS: Fitting lines to groups: Put a grouping
variable in a Legend Variables box and Insert >FitLine>Regression will
make a line for the whole and lines for each group. In Edit mode: Click
on a regression equation; then Edit>Regression Parameters allows eliminating
the line for the Total (or the Subgroups lines) (Using Panel Variables
box makes each group on a separate graph) Residuals
graphs and variables are only generated for the total group. To do
separately, you'd need to do Data>Select Cases in the editor, and work
with one group at a time.
Shape after linear trend removed: see above, Patterns
in Graphs of residuals.
Extrapolation p.163: last class also. Linear
approximations may be good for short term segments, lousy in long term.
Association does not imply causation---"Lurking"
variable:
(p. 168) has an important effect, but not one of the variables studied.
Meatloaf shrinkage vs. placement
in oven? (cooking thermometer/not had greatest influence)
Time sequence of observations
a common one. (Learning, tiring, aging)
The trouble with lurking
variables is that by definition you don't know they're there. Look
behind every tree.
Outliers and Influential points: pp. 165-7 (Use
Moore http://www.whfreeman.com/scc,
or ASLeastSquaresTool)
A point (or more) outside of the pack--an outlier--can :
--Weaken or strengthen r (& r2):
If it's in the same direction
as the general trend, strengthens. Against the trend, weakens
-- Affect the slope of the regression line a lot (has
high "leverage"= is an "influential point"), if it's an outlier in the
x measurement. (Teeter-totter principle) We won't
calculate leverage.
-- Affect the slope little, but:
-- strengthen r2 if it's
along the main trend but farther out.
-- pull the whole line up or down a bit, if
it's in the center of the data on the x -measurement and an outlier
in y (Not an "influential point"
&& Two clusters with little internal trend could look like
a strong association when "combined."
Anscombe's Quartet--4 (made-up) data sets with identical summary statistics (AS9HW: MRB127-46 Always Plot your Data!)
Summary values p.169: If the x and/or y data have
already been averaged or summarized, the relationship you plot and/or use
correlation/regression to describe, will look stronger than it would
if you used raw data (you've already gotten rid of much of the variability.)
&& Watch out for Data relating states, nations, groups .
| Sievers home | Math151-Sp05/Days17.htm | 1pm | 3/9/05 |