Class notes for Chapter 5 (will last all week, probably): Deriving information from data:
Distributions--Many data sets seem to follow the "Normal"
(Gaussian, Bell-curve) distribution.
"Living Histograms", F and M/F; "quincunx" pinball box.
We let area in an interval under the bell-curve stand for probability.(~
proportion of data set in that interval)
Probability of something happening:
Number between 0 and 1: closer to 1 if it is more likely, closer
to 0 if it is less likely.
(idealized) proportion of times the something would happen if we
could repeat the situation producing the something, a very large
number of times.
(Probability of choosing one Wells student 5'3" or under, is equal
to the proportion of All Wells students 5'3" or under. Assuming I
choose her randomly--each student has equal chance to be chosen.)
Probability density function:
Idealized histogram. (of results of repeating the situation very many
times--e.g. choosing many random students.)
"Population" is all possible results. Sometimes a real population
of individuals. Prob. density function is histogram of population.
"Sample" is a set drawn from all the possible results, or individuals
in the population. "Random sample" from a real population gives every
individual in the population an equal chance of being chosen.
The histogram of the sample should look like the probability-density-function=histogram-of-population.
But variability in sampling will give substantial differences from
one sample to another.
Population parameters: single-number descriptives for the population.
(Like
mean, standard deviation). Mean of sample should be close to mean
of population--it is IF sample size is big enough (but how big is big enough?).
Similarly for standard deviation.
Normal distribution family:
To compare one value with the "pack": standardize it: find the
-->
"z-score": how many standard deviations away from the
mean a particular value is.
(can be +, above; or -, below the mean) .
68-95-99.7 rule: about 68% of values are within 1 s.d. of
the mean, about 95% are within 2 s.d.'s of the mean, about 99.7% are within
3 s.d.'s of the mean, IF the data comes from a normal distribution.
"shoulders" of curve (where it changes from curving right, to curving
left, or vice versa) happen one s.d. away from the mean. (Not in your text.)
Standard normal distribution has mean = 0, standard deviation =
1. What you get when you measure data in standard deviations from
the mean. The "Probability Density Function" in the Concepts97 workbook
is standard normal.
Distribution of sample means: If I take many different samples,
what does the distribution of the values of the sample means look like?
If the sample size is "large", it looks roughly NORMAL, NO
MATTER WHAT THE POPULATION LOOKS LIKE THAT YOU DREW THE DATA FROM!
"Central Limit Theorem"
The mean of the sample means (average value from your many samples)
is
the mean of the population. (Sometimes the mean of the sample will
be higher, sometimes lower, than the mean of the population. On average,
right there.)
The variability of the sample means (from many different samples)
is
less than the variability (spread) of the population.
Formula: Standard deviation of sample mean(s)
= (standard-deviation-of-population) divided by (square-root-of-the-sample-size.) (Means from a bunch of samples of size 4 will have half the variability
of the population.
Means from a bunch of samples of size 9 will have one-third the variability
of the population.
Means from a bunch of samples of size 16 will have one-fourth the variability
of the population.)
So WHY is the normal distribution
so common? Any mechanism where the value measured is the sum
or average of a lot of little independent contributions, no one thing dominating,
will probably be normal (because of the Central Limit Theorem)
important to statistical inference (Making claims about a population based
on a sample)? It allows us to study how a sample mean is likely to
relate to a population mean, for a given sample size, even if we don't
know much about the population. (Take Math 151...)
Learning Excel: Most of ch. 5 is about concepts, little new
"Excel" here.
pp. 139-41, 146-8 Generating Random Numbers.
These are really only pseudo-random numbers--they come in a fixed sequence,
but they "look" random. But every time you turn on Excel and go directly
to this tool you will get the same list! If you want to control
what lists you get, you can put in a number in the Random Seed box: a different
number will give a different list.
You can get numbers from different distributions by choosing a different
name in the Distribution box.
StatPlus Histograms: fragile? Sometimes
I've had trouble with this. I think it doesn't much like negative
numbers. If it gives you a lot of trouble just do the usual histogram.
You won't get the superimposed normal curve. Look at the book and
imagine the normal curve.
Lost your spreadsheet?How
to avoid it again, with automatic backups and/or Autosave.
Assignment 8: Due (probably)
Wednesday, April 25 (Day 14, in Week 5) Friday,
April 27 (Day 15)
Get handout for this, do it as you work through the text.