Technology Exercise 7: Inference When Variables Are Related
Internet Delays (Question 1)
Use the information that you gathered for technology exercise 6. If you don't have this information saved from last time, then you can regather
the information using the instructions for technology exercise 6.
- Choose File / Open Project. Navigate to the tech6 folder and
open your project.
- Highlight the window that has your data for the Internet delays in it. You may
need to go to the Window command to find it.
- Choose File / Save Current Worksheet As. Navigate to the tech7 folder and save
your worksheet with a name that's unique to your group.
- Choose File / New / Minitab Project
- Choose File / Open Worksheet. Find the worksheet you saved in step 3 and open
it.
- Choose File / Save Project As and give it a name that's unique to your group.
Is there a difference between the sites? (Part A)
We're testing the claim that there is no difference in the mean times of each
of the different sites. Since there are more than two samples, we need to use
the One Way Analysis of Variance.
- Choose Stat / ANOVA / One-Way. Choose the first "One-Way", not the
"One-Way (unstacked)", that's for when your data are in different columns
rather than using subscripts.
- The response variable is the time
- The factor is the site.
- Go into Graphs and turn on the Normal Plot of Residuals. The residuals from the
One Way ANOVA should be normally distributed. Be sure to address this as
one of your conditions.
- Click OK
Minitab will give you an ANOVA table and a set of confidence intervals. The confidence
intervals can be used to show which means are different. You don't specifically
have to list each on that is different, but explain how to use interpret the
graph.
Is there a difference between the samping times? (Part B)
Repeat the last part, except this time, use "sample" as the factor, rather than
"site".
Hate Crimes (Question 2)
Start a new worksheet for this question.
Gathering the Data
- Visit the National Archive of Criminal Justice Data and their online
analysis of the 2001 Hate Crime Data at http://www.icpsr.umich.edu/cgi-bin/SDA12/hsda?nacjd+03720-0001.
- Check the Run frequency or crosstabulation box
- Click Start
- Enter ir10 for the row variable. IR10 contains the race of the offenders
- Enter ir11 for the column variable. IR11 contains the code for the 1st offense.
- In the Selection Filters box, enter "ir10(2,6) ir11(10-12,33)" but without the
quote marks. This will select races 2 (black) and 6 (white) and offense codes 10 (aggravated
assault), 11 (simple assault), 12 (intimidation) and 33 (destruction / vandalism).
- Turn off all percentaging
- Click Run the Table
Entering the Data into Minitab
- Label the columns as race, crime, and frequency.
- Enter "black" four times in the race column and "white" four times in the race
column.
- Come up with some abbreviations for the four offense codes (you can enter the
whole thing, but Minitab only displays the first eight characters in the
output) and enter those four abbreviations into Minitab for the blacks and
whites (each race should have the four crimes).
- Enter the frequencies from the Internet output into the frequency column.
Conducting the Hypothesis Test
- Choose Stat / Tables / Cross Tabulation
- The classification variables are race and crime (enter them both).
- Turn off all counts.
- Turn on the Chi-Square Analysis, check the "Above and standard residual" box.
- Tell it that Frequencies are in the frequency column
- Click OK
Each cell in the table will have three numbers. The first number is the observed
frequency, the second number is the expected frequency (expected under the
null hypothesis that race and crime are independent), and the last number is
the standardized residual (observed - expected) / sqrt(expected).
If the standardized residual is negative, then the observed frequencies were
lower than expected and if it's positive, then they were higher than expected.
This helps you find where there might be differences by looking for the large
standardized residuals.
The chi-square test statistic is the sum of the squares
of the standardized residuals. You can look up a critical value in Table
X and/or use the p-value to make your decision.
M&M Candies (Question 3).
Getting the Claimed Proportions
- Visit the M&M Mars website at http://www.mms.com/
- Click on the United States
- Click on About M&Ms
- Click on Products
- Click on Peanut
The color distribution is given at the lower left of the page. If you're color
blind, then write M&M Mars a note telling them their website isn't accessible
to persons with disabilities. In the meantime, have your partner(s) help you
out.
Entering the Data into Minitab
Start a new worksheet for this problem.
- Label columns as color, observed, proportion, expected, residuals
- Enter the six colors into the color column, the observed frequencies from activity
6 into the observed column, and the corresponding proportions from the M&M
website (as decimals, 20% = 0.2) into the proportion column. Do not enter
the total number of M&Ms into Minitab.
- Find the expected frequencies. Choose Calc / Calculator.
- Store the results in the expected column
- The expression is "proportion * TOTAL" (where TOTAL is the actual numeric value
of the total number of M&Ms). Alternatively, you can let Minitab
figure that
out for
you
by
going "proportion
* sum(observed)"
- Click OK
- Calculate the residuals. Choose Calc / Calculator
- Store the results into the residuals column.
- The expression is "( observed - expected )**2 / expected". You can, of course,
use the column names and go (C2-C4)**2/C4. The **2 is Minitab's way of
squaring the results.
- Click OK
- Find the test statistic by adding up the residuals. Note that these are not the
standardized residuals that Minitab gave us in question 2, but they are the
squares of the standardized residuals.
Choose Calc / Column Statistics
- The input variable is the residuals
- You want to find the sum
- Click OK
The test statistic is where Minitab says "Sum of residuals = ".
The degrees of freedom is one less than the number of categories and you can
use Table X to find the critical value.
Finding the p-value
Since we didn't use the built in routines of Minitab, it didn't give us a p-value.
The p-value is the probability of being more extreme than the test statistic.
Since this is a right tail test, it is the probability of being to the right
of the test statistic.
- Choose Calc / Probability Distribution / Chi-Square
- Choose Cumulative Probability
- Enter the proper number of degrees of freedom (one less than the number of categories)
- Click on Input Constant. Enter your test statistic as the constant
- Click OK
- Minitab returns the area to the left of the test statistic, but we need the area
to the right. Subtract the probability Minitab gives you from 1 to find the
p-value.
Seat Belt and Traffic Fatalities (Question 4)
Entering the Data into Minitab
Start a new worksheet for this problem.
- Label columns as year, seatbelt, and fatality
- Gather the information for 1985 through 2001 from the June 2003 Safety Belt Usage
in Illinois. This information is from the Illinois Department of Transportation and is at
http://www.dot.state.il.us/safetybeltjune2003.pdf. The data is in a chart, so you'll have to read the percents from the top of
the bars. The data is from 1985 to 2003, but the fatality data in the next
part only goes to 2001.
- Gather the information for 1985 through 2001 from the Illinois 2001 Toll of Motor
Vehicles page from the National Highway Traffic Safety Administration at http://www.nhtsa.dot.gov/STSI/State_Info.cfm?Year=2001&State=IL. There is a table toward the bottom of the page that is titled "Fatalities and
Fatality Rate per 100 Million VMT". You want the Total Fatality Rate column. There is information for 1982 through
2001, but the seatbelt site only contains information from 1985 on.
Making the Fitted Line Plot
- Choose Stat / Regression / Fitted Line Plot
- The response variable is fatality
- The predictor variable is seatbelt
- Go into Storage and turn on the residuals
- Click OK
The Fitted Line plot also contains the regression equation and the value of r2, the percent of the variation that can be explained by the regression model.
The output in the session window on Minitab gives much of the same information
including an ANOVA table that contains the F test statistic and the p-value
that can be used for checking correlation.
Checking for Significant Linear Correlation
While the p-value can be found from the ANOVA table, it doesn't give the value
of r, the correlation coefficient.
- Choose Stat / Basic Statistics / Correlation
- The two variables are seatbelt and fatality (order doesn't matter)
- Click OK
The output gives you the correlation coefficient first and the p-value second.
The null hypothesis is that there is no significant linear correlation.
Checking Residuals for Normality
One of the assumptions in regression is that the residuals (the differences between
the estimated value and the actual value) have a normal distribution. If you
turned on the storage of residuals during the fitted line plot part, then you
should have a new variable called RESI1 that contain the residuals.
- Choose Stat / Basic Statistics / Normality Test
- The variable is RESI1
- Click OK
Give the Regression Equation
If you determined that there was significant linear correlation (positive or
negative) by rejecting the null hypothesis of no significant linear correlation,
then you should use the regression equation given by the computer. This was
found when you did the fitted line plot. Your equation should look something
like "fatality rate = 3.03814 - 0.0230872 seatbelt" (probably not that exactly).
If, however, you decided that there was no signficant linear correlation because
you retained the null hypothesis of the correlation test, then you should use
the mean of y (y-bar) for the estimated equation. Your equation should be something
like "fatality rate = ####" where #### is the numerical value of the mean of
the fatality variable. You'll have to do descriptive statistics to find out
what that is.
Removing the Outlier
There was one year that had a really low seatbelt usage rate. Copy your data
into another worksheet and remove that row (that way you'll still have the
original data should you need to come back to it).
- Highlight all of the data including the variable names. Choose Edit
/ Copy or hit control-C.
- Choose File / New / New Worksheet
- Click in the label cell for C1 and choose Edit / Paste or press control-V.
- Highlight the row to be deleted by clicking on the row number to the left of
the data.
- Press del (or right click and choose delete row).
Start this process at the part that says "Making a fitted line plot".
There are a couple of ways to tell which is the better model. You can look at
the p-values (smaller means more significant results), the correlation coefficients
(bigger means more correlation), or the value of r2. Some of these might be the same, in which case you'll have to look at something
else.
Comment on which model is a better predictor of fatality rates in Illinois.