MATH 251, P&S I, Fall 2007, W. Sept. 19, Day 12.after class. Addition

Reading: finishing text 2.3, 2.4.  Proceed onward through ch. 2: 2.5 (Causality), then return to transforming relationships (Handout, + pp. 143-145 + virtual text "Sec. 2.6")) 

Handout (log transformation, 1-variable). + IPS 5th ed. pp. 143-5
 + Sec. 2.6: 1 copy outside my door, + IPS4e on reserve.   Was Sec. 2.6 in IPS 4th ed. (pp. 187-203 for text.  Figures are -2 from download--fig. 2.30 in 4th ed is 2.32 in 5th)  or download it.  (Website, "Supplemental Material"  , or it may be on your CD.  Mine was missing all the figures and tables).  We want pp. 2-18 for the text.   Download Acrobat file of pp.2-25.  I'm giving you the HW problems I'm asking for, in the Handout.  2.118 on are in 2.6.

Hand in: 
Sec. 2.4
p. 164, 2.63 infant growth (SPSS) The residuals are already in the problem's file.
(Draw the horizontal line at 0 on your residuals plot by hand. SPSS will do it with difficulty; don't bother.)
Also, follow the directions on the Scatterplot handout (p. 4 bottom, cont'd p. 3 bottom) &/or below to create a new variable containing the residuals.  Check that it duplicates the given residuals column.
2.65 infant growth is averages

Governors' Salaries HW:  do 10, 12, which completes the questions. (Create  the residuals and graph them vs.average pay.  Note your graph is the reverse of that on p. 4 of the handout.   Hand everything in.

2.67 and 2.68  lurking variables  (Recall 2.79 from Day 10)
2.73, 2.75  mileage again. (SPSS) Easiest thing to do with  the unwanted cases is delete them, save data file under a different name.  (To use Data>Select cases: if... with string variables, put the string in single quotes.)
p. 186, 2.106 speed/strideM/F (SPSS)(Regress stride on speed). The file as pre-supplied is not organized right for SPSS.  Here it is correctly, in SPSS form  ex2-106runstride.sav   and text form ex2-106runstride.dat.

Sec. 2.5 causality
2.88  health and wealth
2.89 music
2.91  miscarriage and transistors. What information would be helpful to study/eliminate the confounding variable of standing up?
2.92 hospital stay/size

Preview for Transforming:
2.107  bacteria death (SPSS) (Read pp. 143-5 with this.) (For b: use Transform:Compute: lncount = ln(count) to make a new variable of the natural logarithm of count.  (You can paste in the formula from the Functions box.  To check this is the right one, do Help: get Computing variables; pick Functions, then Arithmetic Functions, and read.) 

- - - - - Postpone the rest.- - - - - - - - - - - - - 
Transforming:  For the following you may need to Transform your x or y-data to a new variable in SPSS.  Use Transform>compute:  Use the function LG10( ) for the log base 10, LN( ) for natural log,  x^3 for x cubed.  Use  log base 10  unless told otherwise; but it really doesn't matter much. 

A. (SPSS)  Table 1.5 (tornado damage) and Table 1.8 (guinea pig survival) gave histograms highly skewed right.  For each of these data sets:  Make a histogram, take the log of the data and make a new histogram.  Tell if this transformation makes a "nicer" (more symmetric) graph.

Problems are on handout. SPSS files are  linked to from here . (The .sav files are now on the website. They don't seem to want to open directly into SPSS, at least on my office machine, though they should..You'll probably need to download them, then open with SPSS..The .por files are still there on the website, but not directly linked to any more..)
2.118 (not spss)  b, d Monotonic
2.123 (SPSS) fish weight
2.124 (SPSS) fish width (above file)
2.129 (SPSS) American population
2.121 (SPSS)  isotope decay
2.136 heart rate
2.131 tree biomass
2.138 (SPSS) tree seeds

Read, discuss 
2.78 Applet exploration of outlier.  Watch also r, and think about r-squared.

2.67 grade inflation
2.69 fidgeting or BMR? look in the back for the numbers.
2.76 mean stride rates/raw
2.83 baseball pay--reading residuals

 

p. 179
2.85 marriage
- - - - - 
Postpone:
On handout:
2.118 a, d
2.119 sin

2.134, 2.135 strength, weight. 
 

Optional
Postpone:
 For problem A, if the log transformation didn't do a good job, work through the ladder of powers and look for one that does better.

On handout:
2.120 transistors , Moore's law


HW Questions?
--Answer to problem B, Day 10:  the w that minimizes Sum (yi - w)2  is ybar, the mean of the y's.
(Note that with ybar in place of w, Sum (yi - w)2 is the top of the variance formula.)
So in the context of mean/s.d., the least squares criterion for the line fits right in.

R-squared?  Day 11

 Residuals (2.4):  "DEtrend" the data by graphing residuals--then y=0 line replaces slanted regression line.  Residuals should show no clear  patterns, if the regression line's a good fit.  By "detrending" the data set, sometimes subtle characteristics (like a curve) are uncovered.  Excel Residuals

SPSS: Analyze> Linear Regression, horizontal axis variable to Independent box, vertical axis variable to Dependent box.   Save button--adds columns of these values to your data file; then you can analyze them however you want.  Choose Residuals: Unstandardized  and Predicted values: Unstandardized .
See Scatterplot handout, bottom pp. 4 and 3. The Plots button gives residuals on the y-predicted variable! not the x-variable as IPS shows.  Doesn't matter much, since y-predicted is a linear transformation of x, but if the slope is negative, they'll look "backward".

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CAUTIONS:
Correlation/regression only capture linear association (lots of things are almost linear over a short interval)
   Extrapolation is dangerous-- (maybe not linear over a longer interval)
   Restricted-range problem (range not enough to uncover true relationship, which could be more strongly linear if x's had a bigger range (IPS).  OR:  it might be curved--Extrapolation.)
Influential points, outliers (squared errors make very non-resistant)   Explore with Applet, http://www.whfreeman.com/ips5e
Outlier may or may not be "influential", in terms of changing line.
   May increase r-squared (if "in line" and outlying  in x direction.)
   May decrease r-squared (if outlying in y-direction)

Lurking variables.  Check residuals, x, y, against time or order of observation (timeplot)--(looking for a "fatigue" or "running in" lurking variable.)
  Mixing 2 (or more) groups can diffuse or even reverse association (pp. 167-8--"Simpson's Paradox")
Averaged data will make stronger correlation than nonaveraged.  (e.g. country data)

"Anscombe's quartet:"  summary numbers are not sufficient to describe relationship! (Data p. 169, ex. 2.80)

2.5 Causation:
Association (correlation) does not imply causation!
     Association diagrams:  dotted lines= association, solid = causation.  Good tool (p.174)
 x causes y?  Maybe y causes x.
 Common response to another variable (lurker)?
 Confounding:  2 or more "explanatory" variables are associated strongly; can't sort out which one response is "due to".  (And they may be lurkers.)

How to establish causation? x causes y
   Experiment: control all variables except the potential explanatory variables; randomize out uncontrollable factors (Ch. 3)
   Otherwise: p. 178, criteria:
       Strong association; consistent in different contexts.  Higher "dose" of x--> stronger response of y.
       x precedes y.  Plausible "mechanism" why x should cause y.

- - - - - - - -Start here Fri.- - - - - - - - - - - - - - - - -
Transforming variables (handout, plus Sec. 2.6)

Exponential growth.  (growth by percentages)
--  In an actual "growth" situation, taking logarithms often turns the growth curve into a straight line, or at least does the "growth" analog of "detrending" and makes deviations from the expected percentage growth more visible.

--Many other kinds of data benefit from log transformations:
>Where numbers are all >0,  and larger values can be thought of naturally as multiples of smaller ones.
>Where the histogram distribution is J-shaped, many observations at small values and fewer and fewer at larger and larger values.  E.g. earthquake severity (Richter scale is already log of amplitude), populations of all nations.
> Other times...

--We usually use log base 10, for ease in interpretation.  Then
   raw value   log   The leading log digit tells what place
    1-10      0-1      the leading raw digit takes.
   10-100     1-2
  100-1000    2-3

Other transformations: powers, reciprocals.
Need monotonic  transformation to retain the order of data points.  If necessary, shift the data by adding a constant so all values are > 0.

"Ladder of Powers" (Fig. 2.36) xp    log x lies at p=0.  (If p is negative,  xp reverses order of data, < to >. Use - xp.
  p > 1:  will pull in  the left tail of a distribution and stretch out the right tail. (Making a left skewed distribution more symmetrical.)  Stronger for higher p.
  p < 1:  will stretch out  the left tail of a distribution and pull in the right tail. (Making a right skewed distribution more symmetrical.)  Stronger for lower p.

Relationships
Exponential growth  y = a bx  becomes  log y = log(a) + x log(b).
   (x, log y) values have a linear relationship.  Fit with regression, solve back for y.  (can use log10 or ln.)
       e.g. log y = 2 + 3 x   -->   y = 102 + 3 x  =  102 10 3 x  = 100(1000x )
Powers y = axp  becomes log y = log a + p log x.
   (log x, log y) values have a linear relationship, and the fitted slope p "is" the power.


Sievers home  Math251-Fall07/Day2s12.htm   1:46pm    9/20/07
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement of responsibility.