### Math 151 , Day 12, Friday, Feb. 17, 2012Hit Reload...After class..

HW Day 12: Read Ch. 4 (Scatterplots and correlation) to p. 104 Check p.112  4.14, 15, 16,   and   pp. 104-112 (correlation) Check 4.16 thru 4.22.  You do not have to be able to calculate r by hand.  You should be able to guess roughly at an r for a swarm of data; as p.108-9, and know and  be able to use facts 1-4, p. 107, and Cautions 1-4 pp. 108,110.
Please also, Ahead, Ch. 5, Regression, thru p. 135 (Check: p. 137: 5.17 through 23, basic line and regression line facts and tools (5.18: those are not very satisfactory answers, but you should be able to eliminate at least one). 5.24 r and slope.  5.26 Don't calculate! If you sketch the graph by hand and draw a line thru the points, you should be able to guesstimate the slope well enough to choose among the 3 answers. 5.25 r2 is the square of r)  Then Continuing regression, p. 126-147.

Exam 1 returned Comments  Solutions
Sample exam solutions

I haven't been mentioning Science Colloquium, every Friday 12:40-1:20 but they're often fascinating, and often have Statistics in Action (unpredictably, unfortunately)  This spring, mostly student theses. Today, "The Independence Option; Business Knowledge for Science Majors,." Prof. Ellis.  Please come!

HW questions--  Going from x to area (proportion), & backward--area to x: Day 11,   Normal probability practice

What happens further out in normal tails?  Almost (but not quite) 0.  Rounds to .0000.
p. 90, 3.43:  Difference in tails, M/F math.  Other evidence relevant to the question:  Across countries, the difference in math scores M/F is related to the level of gender equality in the country--the more equal the sexes are in general, the smaller the differential in math scores, and vice versa. (would be good on a scatterplot but I don't have the data in that form).  Evidence for nurture not nature.

A. , What proportion of pregnancies last 310 days or more? Find Mean and s.d. in p.87, 3.19: N(266,16)
z = (310-266)/16 = 44/16= 2.75.  Area above 2.75 = .0030.  3 in a thousand! Pretty rare!
Why do I ask?  (see "San Diego Reader" below )
Is "San Diego Reader" one of the 3-in-a-thousand, or is she lying?  (this is the kind of question we deal with in Significance Testing, part 3 of the course)  Discussion.
These days a DNA test could be done to determine paternity; not then.

New today:
Normal distribution mechanism:
Thing measured is the result of many small independent influences.
"Real" data may not be perfectly normal:
--
just because of natural variation in a particular data set, especially in small data sets.
-- Data falls only on (a lot of ) integers, not really continuous.  In a Continuous Model, no individual value has Area above it--only intervals have area above them.  (so proportion who are exactly = 27 is 0 by the model.  Proportion > 27 = Proportion > 27)  (A fix, if you need a better approximation.  Prop > 27 = Prop. > 27.5.  Proportion > 27 = Proportion > 26.5.  We won't bother)
-- Model may not hold for extreme values.  The Normal Model says there is still a (tiny!) proportion of individuals out at 4, 5, 8 standard deviations away from the mean. These may not even make sense in your real world situation (off the scale). Tails
-- The model may just not be quite right; the mechanism is not quite the Normal one.
But Normal may be a good enough approximation.

= = = = = = = = = = = = = = = = = = = =
Start here Monday

Relationships: (BPS5e Ch.4, at first to p. 104)
Two Related quantitative variables  (We used side by side stemplots, boxplots, histograms to relate a quantitative variable to a categorical variable)
"Just Related" or "explanatory & response?"
(Scatterplots)
explanatory = independent = "x" = horizontal axis ( = "cause", sometimes but not always)= predictOR
response =    dependent = "y" = vertical axis      = ("effect ") =predicteED

(Living histograms:  Height vs. weight, Height vs. gpa)

Discussing Scatterplot
General Pattern                                      Deviations
Clusters?                                                      Outliers? (label if possible)
Form (linear, curved, ...?)
Strength of relationship (how unfuzzy)  "Weak, moderate, strong"
Direction
Positively associated:  y increases as x increases (generally).
Negatively associated:  y decreases as x increases.

Mark subgroups differently to do comparisons. (Subgroups defined by categorical variable, like Sex, Region of country)

Get SPSS Scatterplot handout, link Governors' Salaries HW sheet,or outside my door, if you missed class. (BPS Ch. 4&5)
SPSS:   Graphs>Legacy Dialogs>Scatter/Dot > Simple Scatterplot.  Move variables from the lefthand  list to the X-axis (horizontal)  and Y-axis (vertical) boxes. See Handout for more.  Files from text? Don't forget to check Measure, and to add Labels.

Some scatterplot data:  educ-v-mortality.sav  . The file used for the handout is govsal_vs_pay.sav..
(BPS Ch. 4&5)

....
Correlation
:
(pp. 104-112)  The (Pearson) correlation coefficient r is a numerical measure for how strongly linear (and in what direction) the relationship is.  Doesn't substitute  for a scatterplot.
Use if data is:  2 quantitative variables, & "nice":
One cluster/cloud/band.
Pretty straight.
Outlier(s)? Do with/without & be cautious.
Correlation experiments:
Website,  http://www.whfreeman.com/bps5e,"Statistical Applets",  Correlation/Regression.  Play with data points, observing the Correlation Coefficient.   Check in the "Show Mean X & Mean Y lines" box.  See how much is in each quadrant. Compare with correlation coefficient.

Using SPSS (p.4 top,Scatterplot Handout ) Analyze>Correlate>Bivariate, move both variables across.

Properties (p. 107) and Cautions (p. 108,110):

1. Measures relationship--same whichever variable is on the x-axis
2. "Unitless"--original measurement units (cm., inches) are "standardized out"
3. Sign of correlation coefficient matches direction of relationship. + positive, -negative.
4.  Between -1 and +1.   0: no linear relationship,   +1 or  -1: perfect straight line.
1. Between two quantitative variables only!
2. Does NOT give info about curved relationships (only measures linear part of relationship).
3. NOT resistant to outliers--quite sensitive.
4. Not a complete summary, even for nice linear data.  Need means, s.d.'s too.

--You won't have to calculate a correlation coefficient by hand. This formula is a bad one for hand computation (roundoff error); if you must do one by hand, find the computational formula in an old textbook.
--Eyeballing:  sketch xbar and ybar lines, see how much data is in + quadrants, how much in - quadrants.

Strength of correlation says NOTHING about causality!  Strong correlation could be:
A causes B/   B causes A/  C causes both A and B (lurking C)/  just Chance that they go together in this data set.

 Sievers  home Math151-Sp12/Days12.htm 2:30pm 2/17/12
This page belongs to Sally Sievers who is solely responsible for its content. Please see our statement  of responsibility.

**[In 1973] the following item appeared in Dear Abby's column:

Dear Abby: You wrote in your column that a woman is pregnant for 266 days. Who said so? I carried my baby for ten months  and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy  and it couldn't have possibly been conceived any other time because I saw him only once for an hour, and I didn't see him again  until the day before the baby was born. I don't drink or run around, and there is no way this baby isn't his, so please print a retraction about that 266-day carrying time because otherwise I am in a lot of trouble.
Abby's answer was consoling and gracious but not very statistical:

Dear Reader: The average gestation period is 266 days. Some babies come early. Others come late. Yours was late.

The question here is not whether the baby was late. That fact is already known. At issue is the credibility of the length of the delay. Ten months and five days is approximately 310 days, which means that the pregnancy exceeded the norm by 44 days. [How unusual is that?]
A. , What proportion of pregnancies last 310 days or more? Find Mean and s.d. in p.74, 3.7
z = (310-266)/16 = 44/16= 2.75.  Area above 2.75 = .0030.  3 in a thousand! Pretty rare!
Why do I ask?  (see "San Diego Reader" just above )
Is "San Diego Reader" one of the 3-in-a-thousand, or is she lying?  (this is the kind of question we deal with in Significance Testing, part 3 of the course)

*Bear in mind that there were around 400,000 births in California in 1970. (I'm guesstimating.  There were 605,694 births in 1990, and the population of California in 1970 was 2/3 of that in 1990).
So a
3-in-a-thousand event would occur in 3x400 = 1200 births--there would be 1200 women in San Diego Reader's position (many of whom wouldn't know it.)
Rare events DO happen--it's not really fair to only notice and question them AFTER the fact.
Note--pregnancy in 1970 usually didn't involve the level of medical intervention (ultrasound, inducement of labor, Caesarian, etc.) it often gets now.