Math 151 , Spring 2002, Wednesday Day 17, March 6 Hit reload to get most current versionAfter Class

Regression-- Review comments
ANY Straight line y = a + bx  (or bx + a):  b, the coefficient of x, is the slope of the line.  If x changes one unit, y changes b units, so b is the rate of change of y with respect to x.  (If y is weight in pounds, and x is height in inches, b is the number of pounds  we expect to see weight go up by, per inch that height goes up by.

"Regression line of weight on height":  height = horizontal (x) axis, weight = vertical (y) axis.

LEAST SQUARES PROPERTY
"Residual at x" = y - yhat  = distance between observed y and  predicted y (what's left over after predicting)
    ( Positive if observed is bigger than predicted, negative if observed is smaller than predicted)
Least squares principle:  Find the line that minimizes the sums of the squared residuals.(Here, or in Mac 101, ClassMaterials\Math151\ RegressionDemos\RegressionLine.xls, Squares tab)
       This method of finding a "best fit" straight line for predicting y's from x's was derived mathematically to work well with "joint normal" data--elliptical clouds.  For data of this sort, the line does  give the mean of the y's for each given x (at least in the abstract.)

Drawback if the data is not the "elliptical cloud" type:
     Outliers get their residual distance squared:  May be very influential  in determining where line sits.
Especially if at lowest or highest x-values, may change slope of line a lot.
(Activstats Least Squares tool: p. 9-2, Show residuals, show #points.)
 (Outliers toward the middle x's may not change the slope, but may affect r and r2.)

Education and mortality in cities (ACT p. 8-3, bottom).  Outliers?

Plotting residuals:  This amounts to making the regression line into a new x-axis--If you plot the residuals themselves vs. the original x values, without the distraction of the slanted line, outliers and patterns other than the linear (if any) can emerge.
(Here or  ClassMaterials\Math151\RegressionDemos\ResidualsRSquared.xls , Graph of Residuals tab.)
SPSS can make a new variable of residuals. Optional HW.
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Will do all of 2.4 next time.  Cautions  Sec. 2.4
Plot the data: Summary formulas and numbers don't tell the whole story.  (Anscombe's quartet, p.127, 2.46-7, also in ACT HW ch.9)

Extrapolation-- extra (outside) polation (putting a point): Using the line to predict outside the range of x's you have data for.  Unavoidable if x is time; but inevitably dangerous--nothing says the mechanism you see will persist in a wider range.

Averaged data will produce a stronger relationship (higher correlation, R2) than the merged raw data from individuals (the averaging hides much variability)  You did a problem on height vs. age--they were averaged values.

Lurking variables and association/causation next time.



PreClass assignment Day 17  for Day18
I'll finish Moore 2.4, so read it.  I probably won't start Ch. 3 till Monday, but here's (all or most of) the preclass work for it.
Activstats, Sample surveys, 10-1  Know Sample/Population, Simple Random Sample.  (Don't get bogged down in taking your own potato sample.  If it's confusing, skip it) Do the last activity p. 10-1, pop. size doesn't matter.
10-2 Know Bias, Voluntary Response bias, Nonresponse, Undercoverage. 
Do 2nd activity, write down Literary Digest Prediction, Actual vote percentages.

HW assignment Day 17, Wednesday March 6,
ACT: From Activstats Homework, Moore:  From  The Basic Practice of Statistics
Reading:  Finish 2.3, read 2.4.   Skip 2.5. Ahead in Ch. 3.
Hand in 
A.  Use ResidualsRSquared to graph these data sets, along with a graph of the residuals.  Print the results, and describe the shape of the residuals (it may help to connect the dots with pencil, to see the pattern.)
a)  x 1 2 8 4 6 9 
    y 1 3 6 6 7 5 
b) x 1 2 7 4 6 9
   y 7 6 2 4 2 1
Moore p. 122, 2.36 speed&gas again a, b, c, d.   There is a data file for problem 2.36, and its third column is the residuals (check them against the book).

Moore p. 123, 2.38 Gesell first word-point in middle of x range. Get the data into SPSS, delete child 19, graph and get the regression line and r2.  Use the formula on p.117 and graph the line for the full data set by hand on your printout.   r2  for the full data set is on p. 122. 

Moore p. 122, 2.37 Calories (This data set is in ACT ch 8 HW MRB-3)(or, from Moore's files, in  TA02-04)Graph and get lines in SPSS with and without the outliers.  Graph the line for "without outliers" by hand on the printout for "with outliers" so you can compare them better.  Print one more graph (with outliers) and keep it for problem B below.
= = = = = = = = = = = = = = = = 
Sec. 2.4, all Moore, Postponed to Day 18 HW
 p. 131, 2.53 farm population (SPSS)
   Also connect the dots, or plot the residuals--is there any curve to the relationship?
p. 132  2.54 Dow average/stocks
p. 138 2.63 math&verbal r, states/individuals
B.  Look again at p. 122, 2.37.   These values are averaged values, over a bunch of people's guesses.  What would the graph look like if all the individuals' separate guesses had been graphed?  Add points to your graph .

Read,  Optional 
SPSS will make residuals:  Do Analyze>Regression>Linear (a new menu for us)
Click your variables into Independent (X) and Dependent(Y). 
Hit the Button "Save...": Checkbox Residuals: Unstandardized. (Also here is Distance:Leverage values, as in ACT 9-4) Continue, Ok out of the menus.  You'll get output; ignore it. 
You'll get a new variable, the residuals. (and another, the leverages, if you do that)
Try it with the data file for problem 2.36, with speed and gas.  You'll get a fourth variable that should be the same as the residuals variable.
 

 


Sievers home  Math151-Sp02/Day17.htm  11pm 3/05/02
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.