This document will contain notes to the class regarding assignments and other material. Check back often.

Congratulations on the classroom presentations. The staying up most of the night working on the evaluation tool was worth it and the snafu from the morning class has been fixed. It went fairly smoothly and I was much happier than in previous semesters where the evaluation was done on paper with a single score and everyone got a 10 because "hey, I have to get up there too." The ability to send comments to the presenters was nice and you can go back in and read the comments that were made about your presentation.

Speaking of comments, I'm really impressed with how many were made. Here's some data for you.

Section 1 had 23 comments out of 117 evaluations while section 2 had 129 comments out of 252 evalulations. Now granted, the morning class was rushed because some people took much longer than the 3 to 5 minutes, but it got me to wondering: Did section 2 have a higher proportion of comments? Intuitively, yes, but since this is a statistics class, let's check it out.

H_{0}: p_{1} ≥ p_{2} or p_{1} - p_{2} = 0 (claimed difference is 0)

H_{1}: p_{1} < p_{2} or p_{1} - p_{2} < 0 (left tail test)

Here are the Minitab results

Sample X N Sample p 1 23 117 0.196581 2 129 252 0.511905 Difference = p (1) - p (2) Estimate for difference: -0.315324 95% upper bound for difference: -0.235733 Test for difference = 0 (vs < 0): Z = -5.73 P-Value = 0.000

There are three approaches to hypothesis testing. Let's examine each of them.

- The p-value of 0.000 is less than the significance level of 0.05, so we reject the null hypothesis that p
_{1}= p_{2} - The test statistic of z = -5.73 is to the left of the critical value of z = -1.645, so we reject the null hypothesis that p
_{1}= p_{2} - The claimed difference of p
_{1}-p_{2}= 0 does not fall in the confidence interval for the difference of p_{1}-p_{2}< -0.235733, so we reject the null hypothesis that p_{1}= p_{2}

All of the decision rules agree and say reject the null hypothesis.

There is enough evidence to support the claim that section 2 commented at a higher rate than section 1.

See how easy it is to come up with hypothesis tests? They're everywhere!

Some of you have noticed that the links for the National Highway Traffic Safety Administration in question 2 are returning errors. The NHTSA shut down that portion of the their website because of technical difficulties on April 18, 2006. Google had the page cached and I was able to get the data. The information that you need is contained in the Minitab instructions for technology project 10.

I'm just letting the class know that the software that allows me to take over the computers and see what you're doing is now installed. If your computer magically shuts down during the middle of the class, you should ask yourself if you were doing something you shouldn't have been doing.

I'm always looking for ways to improve the class so that more people understand the material. There seems to be a larger group of confusion this semester than normal and although I'm sure part of it is due to people not paying attention in class (the Internet is too compelling) and not reading their books, I think part of it is a failure to see the big picture and tie things together.

So, let me run my latest idea past you and see what you think. You're nearing the end of the course and are in a position to comment.

The first week (or however long it takes), we would do a project that did a little bit of everything. Call it an introduction to the course so the students could see what it was all about. Then we would go back later and fill in the details by working through the book in a more traditional method. Don't get too bogged down with details at this point and definitely don't introduce messy formulas (just let the computer find them for now).

Here is an example.

Claim: The temperatures this year are the same as last year.

Divide the class into groups of two or three each, probably three since it's early in the semester. Each group would be given a season (winter, spring, summer, fall) and with a larger class there would be more than one group working on the same season (that way we can show that different samples of the same data might give different results).

The groups would then use Minitab to randomly select dates from last year and dates from this year. We might have one group select different dates (independent samples) and another group of the same season use the same dates (dependent samples). They would the go to the weather.gov website and collect the temperatures for those dates that were selected, much like they did on my chapter 8 tech project. This introduces them to random sampling (might as well show them the best way first) and also independent vs dependent samples.

Once we have collected the data, we would use it to make histograms, dot plots, box plots. We might even show them a scatterplot and why it's not appropriate (especially if you have an independent samples). We could code the data in categories (cold, normal, warm) and then look at the pie chart and bar charts. We can compare the two sets of data and look for differences. Basically it would be a graphical exploration of the data, all done with Minitab. We won't concentrate on how it finds the numbers just yet, just ways to look at the data graphically.

Then we'll describe the data numerically. Find the mean and the standard deviation. We'll redo the histogram so that each bar is one SD wide and then we can talk about the 68-95-99.7 rule and Chebyshev's rule. Find the five number summary and tie it back into the box plots that were made. Kind of show how the numerical data appears in the different graphs. Don't really focus on how to find the mean or standard deviation at this point, just let Minitab do it.

Then we'll move on to the confidence intervals and hypothesis testing, based on a "anything more than 2 standard deviations away from the mean " definition of unusual. We're not ready for a discussion of critical values or T distributions at this point, but we could mention that anything that has less than a 5% chance of occurring we'll consider to be unusual. We might also have another side claim like "the average temperature in the winter is 22 degrees" that we could see whether or not it fell in the confidence interval. That would be best since we're not supposed to use overlapping confidence intervals when testing the mean of two independent samples. Then we can move to the actual test of last year's data versus this years data. Not looking at the test statistic, but just the p-value and leaving out all the other terminology. We'll tell the student's which test to run, of course, since they won't know otherwise.

Throughout all of this, they should take their output from Minitab and paste it into word and type a sentence or two about what they discovered or found out. Then they are encouraged to take the relevant material and create a power point slide show. I don't want everything they do to be in there, just the highlights. They'll take the power point slide show and give a short demonstration to the class about what they found out.

The Word document, power point, and presentation are the first "test", 100 points. Everyone in the group gets the same grade.

After the introduction project is over, we'll go back and fill in the details in a more traditional format. We'll get into sampling, simulation, probability, distributions, hypothesis testing, correlation, and chi-square / anova stuff.

I'm hoping that by providing the overview at the beginning, the students be better able to put it all together and see the big picture early on. Hopefully by spending so much time early on getting using the technology, it will go faster later on and I can spend less time on things, which should still give me enough time to get everything covered.

I have no idea how long this will take. Now that I look at what I wrote out, 2 weeks may be more realistic. I may try this in the summer so that I have a better idea before fall gets here.

I have a feeling that I would have to sacrifice some of the classroom activities to do this, but that's okay with me. I need to re-work the class anyway, I'm spending too much time explaining the different test statistics in chapters 7-8 and they're missing that it's all the same thing and getting bogged down on the formulas, even though we don't need to use them because the computer will do them for us. They're getting tired of "put this in your notes for completeness, but we won't do it by hand".

I'm open for comments or suggestions. Good idea, bad idea, neutral? I haven't worked out the details, that's just kind of off the top of my head. Email me your comments.

I mentioned to the class during review on Friday that this was going to be the part that gave you the most difficulty and I'm seeing that in the technology projects that you're sending to me.

Think about the type of data you're collecting. When you collect each piece of data are you going to collect a number or a category (yes/no) response? Think about how you would graph the data: would you use a pie chart or a histogram? If you are collecting numerical data or would use a histogram to graph it, then you are (at this point) going to be talking about means and the symbol you should be using is μ. If you are collecting categorical data or would use a pie chart to graph it, then you are (at this point) going to be talking about proportions and the symbol you should be using is p.

Here are some examples.

- Claim: "Over eight in ten adults are overweight." For each person, you would ask "Are you overweight?" and the response would be "yes" or "no" (or maybe "none of your business"), but definitely a categorical response. You would make a pie chart showing the percent of responses in each category. We're talking about proportions, so we use p. Since there is only one sample, we need a number to compare the proportion to and that's where the "eight in ten" comes in. Eight in ten is 8/10 or 0.8. The claim is p > 0.80.
- Claim: "More than 20 new computer viruses are reported each day." When you go to the website and count the number of new viruses, you're collecting numerical data. You would make a histogram to display those numbers. You could define success to be "there are more than 20 new viruses" and then record a "yes" or "no" for each day, but then your claim would have to be something like "more than 20 new viruses are reported for 70% of the days" and then you could have a claim about a proportion, but that's just not here in this problem. We're talking about means and there is only one sample, so the claim is μ > 20. When you only have one population, you need a number to compare the parameter to.
- Claim: "Echo Boomers are less likely than Gen Xers to watch network broadcast or cable news daily or several times a week." Think about how the data was collected. What did you ask to get this data? You asked "do you watch ...?" and that has a yes or no answer. Okay, Harris Interactive probably asked how many times a week they asked, but it was still a category, we've just simplified it for where we're at in the course. But there are two samples here, Echo Boomers (EB) and Gen Xers (GX). When we make the pie charts, there would be two pie charts, one for each group. We're talking two proportions and the claim is p
_{EB}< p_{GX}. When you have two populations (EB and GX), you don't need a number to compare them to, you compare them to each other. Note: If the claim had been "people who watch network broadcast or cable news daily or several times a week are less likely to be Echo Boomers than Gen Xers," then there would have only been one sample (those who watch the news) and if you define success to be that the person is an Echo Boomer, you could say p < 0.50 (meaning that less than half of the people are Echo Boomers). However, that assumes that all people who watch news are one of those two groups and that's not really the case, so a χ^{2}goodness of fit test from chapter 10 is a better way to test that. - Claim: "There is no difference in the temperatures between last year and this year." What kind of data are the temperatures? That's right, numerical, and you would make a histogram to graph them, so we're talking about means, not proportions. We have two samples, last year and this year, so we don't need an actual value to compare the temperatures to, just each other. So the original claim is μ
_{1}= μ_{2}where sample 1 is from last year and sample 2 is from this year. Be sure to define subscripts for the samples when it isn't obvious what they stand for.

For more examples with explanations, look at the review material for the test. Look at the "parameter" group from the part 1 challenge or the "claim" group from the part 2 challenge.

Another place to look is the paper I gave you with for your semester project. The "What can we test?" section has examples of categorical and quantitative data.

I've corrected the mistakes on the hypothesis test review part 2 where twice 0.375 was incorrectly listed as 0.650 instead of 0.750. I also changed an explanation of the finding the margin of error from a critical value and standard error. If you find any other mistakes, please let me know.

The link takes you to the NOAA website, but you need to follow the same instructions you did on a previous technology project for finding the average temperature. This means clicking on the Preliminary Climatology Data and then choosing the Decatur Airport.

You do not need to write the fractions for all of the percents in the table for question 1 on the paper you turn in. The goal is to notice a pattern and the pattern shows up when they are written as fractions, not when they are writen as percents.

Let's say your percents were 66% for three doors, 75% for four doors, and 80% for five doors. You need to write those as 66% = ?/3, 75% = ?/4, and 80% = ?/5. The numbers may not come out exactly, but they should be really close to a whole number so the fraction is really nice. Then look at all of the fractions and try to notice a pattern.

If (and this is not the right answer) your fractions were 1/3, 2/4, 3/5, 4/6, 5/7, etc., then you would hopefully notice that the numerator is 2 less than the denominator. Since the denominator is represented by n, then two less than n would be written as n-2 and a suitable pattern would be (n-2)/n. Feel free to use the equation editor to enter that so that it looks nice.

Many people are asking me "What about 12? It's a multiple of 3 and a multiple of 4." The instructions say to stop as soon as the first rule is matched, so you only check for multiples of 3 if the number is not prime. You only check for multiples of 4 if the number is not prime and not a multiple of 3. Another common mistake is that people are cheating 2 out of its primeness.

What I suggest doing is writing the numbers 2, 3, 4, ..., 12 down in a row. Then underneath each number, write its probability (we did this as an example of a probability distribution in class). Finally, on the third row, write P underneath each number that is prime. Then, for the numbers you have left, write 3 under all the ones that are multiples of 3. Then, for the numbers that you have left, write 4 under all the ones that are multiples of 4.

Finally, to find the probability of getting a prime, add up all the probabilities of those numbers that are prime. Since you lose $10 when the value is prime, enter the probability of being prime in the table under the -$10.

Be sure that you are using the matching parentheses from the lower left of the toolbar or using the keyboard shortcut of control-9 (a single parenthesis is normally shift-9, so just hold control instead of shift). Do not type ( and ) separately from the keyboard as those don't grow with what's inside. This is especially obvious with the formula for the variance near the bottom of the chapter 3-4 notation. Failure to use the matching parentheses will keep you from getting full credit for the notation.

The labels Discrete and Continuous and the application of units apply to Quantitative/Numerical data only. They do not apply to Qualitative/Categorical data.

I will say this now (and probably many more times this semester): "It is important that you read!" Not just the problem, but the instructions and the book and the explanations of how to use Minitab on the website and ...

The goal of question 1 is for you to understand how manipulating the data affects the statistics. You need to be able to predict a new measure of center or spread based on the old data and transformation.

For example, if the mean is 50 and the median is 60 and you subtract 20 from each of the data values, then the mean becomes 30 and the median becomes 40. I do not want you to give me an answer of "it gets smaller" or "they get larger." The statement should not refer to the value of the constant (20) either.

An appropriate response might be: "When a constant is added to or subtracted from all of the data, the same constant is added to or subtracted from the measures of position."

Be careful with multiplication and division. There is a dashed line separating the variance from the range and standard deviation for a reason. Your answer should address both groups.