| Hand
in Fri. (D&V p.152 ff.) Chapter 9: D&V p. 174ff. unless otherwise noted:
Groups, from ActivStats Ch7 HW (same datasets you used
before, different
questions)
Find Datasets in SPSS from this page 15, 16 Gestation (note, these are summarized
data) |
Read,
to discuss Ch.9 7a-d Reading |
Optional:
Use Activstats Least Squares tool, (see below) and play with datasets; especially drag points around and see what they do. |
Regression line:
D&V
Ch 8&9, AS8&9, "Regressing y ON x"
Formula yhat = b0 + b1 x,
b1
= r times (s.d. of y)/(s.d. of x) = r sy /
sx,
b1 is in y-units per (/) x-unit
b0=
ybar
- b1(xbar) from ybar =
b0 + b1(xbar).
Residual:
Residual = observed - predicted
"Least squares" (D&Vp.144, AS8-3Activity1&2)
The
regression line is the line that minimizes the sums of the squared
residuals. See Day 16
R-squared : The Line formula
yhat = b0 + b1 x
tells us our best prediction or estimate of a response (y)
value
for a particular value of the explanatory (x) value. It says
NOTHING
about how good that "best" is--that is, it says nothing about how tight
or scattered the data is around the line. R-squared
does that job.
R2 (= r2
= "Coefficient of Determination") = Proportion of
variability
in y-values explained/accounted for by knowing x and using the
regression
line model.
Un-accounted-for-variability =(1-r2) =
variance-of-residuals
/ total-variance-of-y's
More:R-Squared (ClassMaterials\Math151
D&V\ RegressionDemosExcel for D&V\RSquared.xls))
(Optional: Further
explanation
of r2)
r2 is the square of the correlation
coefficient r! (-, + Sign gets lost.)
If r = .7, about half (.49) of the variability
in the y's is accounted for by using the regression line model to
predict y from x. (If weight and height have a correlation of .7, then
half of the variability in weight can be accounted for by height.)
Line is not symmetric: The regression of weight on
height
uses a different line from the regression of height on
weight.
(Minimizing vertical residuals pulls line "flatter"
than
the line that just goes through the middle of the cloud, which would
rise
1 s.d. up for one s.d. run. Related to the idea of "regression to
the mean" p. 139)
Demonstration on overhead
projector; flip transparency to exchange axes.
Chapter
9:
Regression (& correlation) wisdom: What
can go wrong, things to watch out for.
Groups (subsets) may benefit from being considered
separately.
Sometimes analyzing residuals can alert us to important subsets.
SPSS: Fitting lines to groups: Govsal_vs_pay Put a
grouping
variable in a Legend Variables box and Insert >FitLine>Regression
will
make a line for the whole and lines for each group. In Edit mode: Click
on a regression equation; then Edit>Regression Parameters allows
eliminating
the line for the Total (or the Subgroups lines) (Using Panel
Variables
box makes each group on a separate graph) Residuals
graphs and variables are only generated for the total group. To
do
separately, you'd need to do Data>Select Cases in the editor, and
work
with one group at a time.
Shape after linear trend removed: discussed with
Patterns
in Graphs of residuals.
Extrapolation p.163: last class also.
Linear
approximations may be good for short term segments, lousy in long term.
Outliers and Influential points: pp. 165-7 (Use
Moore http://www.whfreeman.com/scc,
or ASLeastSquaresTool)
A point (or more) outside of the pack--an outlier--can :
--Weaken or strengthen r (& r2):
If it's in the same
direction
as the general trend, strengthens. Against the trend, weakens
.
-- Affect the slope of the regression line a lot (has
high "leverage"= is an "influential point"), if it's an outlier in the
x measurement. (Teeter-totter principle) We
won't
calculate leverage.
-- Affect the slope little, but:
-- strengthen r2 if it's
along the main trend but farther out.
-- pull the whole line up or down a bit, if
it's in the center of the data on the x -measurement and an
outlier
in y (Not an "influential point"
&& Two clusters with little internal trend could look like
a strong association when "combined."
Anscombe's Quartet--4 (made-up) data sets with identical summary statistics (AS9HW: MRB127-46 Always Plot your Data!)
Summary values p.169: If the x and/or y data
have
already been averaged or summarized, the relationship you plot and/or
use
correlation/regression to describe, will look stronger than it
would
if you used raw data (you've already gotten rid of much of the
variability.)
&& Watch out for Data relating states, nations, groups .
Association does not imply causation---"Lurking"
variable:
(p. 168) has an important effect, but not one of the variables studied.
Meatloaf shrinkage vs.
placement
in oven? (cooking thermometer/not had greatest influence)
Time sequence of observations
a common one. (Learning, tiring, aging)
The trouble with lurking
variables is that by definition you don't know they're there.
Look
behind every tree.
| Sievers home | Math151-Fall05/Dayf18.htm | 10pm | 10/4/05 |