Technology Exercise 1: Exploring and Understanding Data
Collecting the data
Go to the Box Office reviews at Yahoo
Movies at http://movies.yahoo.com/boxoffice/latest/rank.html.
You may want to right click your mouse button on that link and say "Open
in New Window" so that you can continue to have these instructions available
to you. Depending on the time of the week, you will either get the top 10
movies for the weekend or the full 100 from the last week. If you get the
one that has the top 10 movies for the weekend, don't use it as the data
is incomplete for that weekend. You may gather previous weekends worth of
data, though.
Creating the Worksheet
You may use either Excel or Minitab to collect the data. If you use Excel,
you can collect the data at home and then bring it into school on a floppy
disk to import into Minitab. If you enter the data directly into Minitab, you
won't have to do the import, but you'll have to be at school.
Read this entire section before doing anything on the computer.
- Label columns as "title", "weekend", "gross", "theaters", "critic", "yahoo",
and "mpaa".
- For each movie on the web page, look for those that have been in release
for one week, ranked in the top 10 movies for that week, and had was showing
in least 2000 theaters.
- Enter the title, beginning date of the weekend (for example: November
21, 2003, would be entered as 11/21/03), the weekend gross in millions
of dollars ($37,062,535 would get entered as 37.062535), and the number
of theaters in the columns below the headings. Do not enter commas or
dollar signs into any of the values.
- Click on the title of the movie and it will bring you to another page
that contains a critic's rating, yahoo users' rating, and the MPAA rating.
Enter those values in the next three columns. Make sure you are consistent
in your data entry (always use the same case for the ratings and always
use either "pg-13" or "pg13", but don't mix them).
Use the back button to get back to the data.
- When you are done collecting information about this week, click on the
pull down menu under "Archived Charts", select the previous week,
and then click "GO". Enter the information for the new movies for
that week and then repeat the process, choosing previous weeks until you
can choose no more.
- The menu of archived weeks only allows for the last 12 weeks, but the previous
weeks are available. The instructor has gone back and collected information
for some of the earlier months, so that you can still
use that information. That table of information appears below these instructions.
Begin with this data and then gather the more recent data.
- Save your spreadsheet into the R: drive. Use the R:\xx\tech2 folder
where xx is your section number (01, 02,
or 03). Save it as a name that is unique for your group.
If you are collecting this data at home, then save it onto a floppy disk
to bring into school. If you feel comfortable making a folder to save your
information into, then go ahead and save all your files in that folder.
From now on, whenever you need to work with this data, open up your project
by going to File and choosing Open Project.
Data from May 21 - July 16, 2004
title |
weekend |
gross |
theaters |
critic |
yahoo |
mpaa |
I, Robot |
7/16/2004 |
52.179887 |
3420 |
b- |
b+ |
pg13 |
A Cinderella Story |
7/16/2004 |
13.623350 |
2625 |
c- |
b |
pg |
Anchorman |
7/9/2004 |
28.416365 |
3091 |
b |
b- |
pg13 |
King Arthur |
7/9/2004 |
15.193907 |
3086 |
c+ |
b |
pg13 |
Sleepover |
7/9/2004 |
4.171226 |
2207 |
c |
c |
pg |
Spiderman 2 |
7/2/2004 |
115.817364 |
4152 |
a- |
a- |
pg13 |
White Chicks |
6/25/2004 |
19.676748 |
2726 |
c |
b |
pg13 |
The Notebook |
6/25/2004 |
13.464745 |
3020 |
c+ |
b+ |
pg13 |
Two Brothers |
6/25/2004 |
6.144160 |
2175 |
b- |
b- |
pg |
Dodgeball |
6/18/2004 |
30.070196 |
2694 |
b- |
b |
pg13 |
The Terminal |
6/18/2004 |
19.053199 |
2811 |
b |
b |
pg13 |
Around the World |
6/18/2004 |
7.576132 |
2801 |
c |
c+ |
pg |
Chronicles of Riddick |
6/11/2004 |
24.289165 |
2757 |
c |
b- |
pg13 |
Garfield |
6/11/2004 |
21.727611 |
3094 |
c- |
b- |
pg |
The Stepford Wives |
6/11/2004 |
21.406781 |
3057 |
c+ |
c+ |
pg13 |
Harry Potter |
6/4/2004 |
93.687367 |
3855 |
b+ |
b+ |
pg |
The Day After Tomorrow |
5/28/2004 |
85.807341 |
3425 |
c+ |
b |
pg13 |
Raising Helen |
5/28/2004 |
14.239252 |
2717 |
c- |
b- |
pg13 |
Shrek 2 |
5/21/2004 |
108.037878 |
4163 |
b |
a- |
pg |
You can not copy and paste
this table directly into Minitab. Follow these instructions to save typing.
If you went ahead and gathered all of your data, then don't copy the headings
in the steps below and you will also need to make sure that you make sure data
I've entered exactly matches what you've entered.
- Highlight the table on the web page including the headings.
- Copy the data from the web page
- Open up Microsoft Excel and paste the data
- Copy the data from Excel (It should still be highlighted)
- Switch to Minitab and paste the data. Make sure you've clicked in the grey
cell above row 1 where the labels go. If you didn't copy the labels, then
start in row 1.
Specifying the order in Minitab
We want to make sure the data is displayed in the proper order. Normally,
Minitab will display data in alphabetical order, but that will cause problems
with what we have here. Do the following for the "critic", "yahoo",
and "mpaa" columns.
The order only affects the output. The data will look the
same as it did before this step.
- Click anywhere in a column that you want to specify the order for.
- Right
click the mouse button and choose Column and then Value
Order.
- Check the User-specified order box
- In the Define an order window on the right, type the order you want the
values to appear in. You can choose whether you want A+ first or last and
the order you want the ratings in.
- Click OK and then repeat for the other two variables.
Displaying Categorical Data (Question 2)
There are several ways to explore the data. You will need to pick which methods
to use with which variables. Below you will find the Minitab instructions for
several types. There is nothing that says you have to use the same variables
I do, just change the name of the variable where appropriate in the instructions.
Frequency Table
Frequency tables are appropriate for categorical data or quantitative data
with limited numbers of responses. All you can do with the frequency table
is count information where each data point is counted only once. If you have
frequency data then use the contingency table (Cross tabulation).
- Choose Stat / Tables / Tally Individual Values
- Click in the Variables box and then double click on the variables you want
to tally.
- (Optional) Check the type of tallys you want. The default is counts and
that is usually good enough. You may want percents. Cumulative counts or
percents tell you what part of the sample data is either that category or
below. This only makes sense when you have ordered data.
- Click OK
Bar Chart
Bar charts are appropriate for categorical data. It allows you to graph a
function (count, mean, st. dev) of a quantitative (measurement) level variable
that is categorized by another variable.
For example, let's say that you want to know how many theaters are showing
each rating of movie.
- Choose Graph / Bar Chart
- Decide on the type of bar chart
- If each case is to have equal weight and you just want to count the
number of times values appear, then let each bar represent counts of
unique values. This would be useful to count the number of movies for
each rating. If you choose this option, you will need to supply the categorical
variable. There will be a bar for each value of the categorical variable.
- If you want to find some statistical function for a variable, then
choose to let each bar represent a function of a variable. If you want
to know the total or average gross for movies by their rating, then you
would choose this. If you choose this, then you get to pick a function
like mean (average), count, or sum (total). The graph variable is the
column that you want to apply the function to and the categorical variable
is the column that contains which category the item falls into.
- (Optional - Recommended) Choose labels and add a title to the
graph that describes what we're looking at.
- Choose labels and switch to the data labels tab.
Check the use y-value labels radio button.
- Click OK
After the graph is generated, you can right click the mouse button on parts
of it to make changes. For example, you could right click over the bars and
change the formatting so that the background has slashes instead of a solid
color.
Pie Charts
Pie charts are useful when you want to summarize a categorical variable graphically.
You may either have just one variable or another variable that specifies frequencies
(like the number of theaters).
The categorical variable represents the individual slices of the pie.
- Choose Graph / Pie Chart
- Choose the type of data you have.
- If you have frequency data, then check
the Chart values from a table box and enter the Categorical variable
and make
your
frequency
variable
the
summary
variable.
- If you just have raw data and not frequencies, then select the Chart
raw data box. The Categorical
variables
should contain the variable you wish to graph.
- (Optional) If you have several categories that have very small percentages,
then click on Pie Chart Options and enter a value into the "Combine
slices less than ___ % into one group" line. If you entered
2 here, then any category with less than 2% of the values would be lumped
into
a
group called "Other"
- (Optional - Recommended) Enter a title for the graph. Click on Labels and
add a title. Otherwise it will say "Pie Chart of ____" where the
blank is the name of the variable you graphed.
- Click on Labels and change to the Slice Labels tab. Click either frequency
to show the count in each slice or percent to show the percent of the area
in each slice. You can also check the Category Name box to have the name
of the category displayed next to the slice. This is more readable than
using just the legend.
- Click OK
Under the Data Options menu, you can change to the Group Options tab and tell
it not to include missing as a group. This would be good if you only want the
pie chart to represent the data you know.
Box Plots (Question 3)
For this example, I'll display the opening weekend gross by the MPAA rating.
Note, this is not what I asked you to create, you will need to change the instructions
to match what I've asked for.
- Choose Graph / Boxplot
- Choose the type of boxplot. Use simple if you only have one variable to
graph and choose With Groups if you would like side-by-side box plots where
the data is broken down by some classification variable. In this example,
I'll use groups.
- For the graph variables, choose the data you want to make the boxplot
for. I would
use gross
for
this example.
- For the categorical variables, choose the way you want to categorize the
data. I would use MPAA for this example.
- (Optional) Add a point for the mean. Click on Data View and check the box
for Mean Symbol.
- (Optional - Recommended) Click on Labels and add a Title
- (Optional) Display which points are outliers. In this case, I could label
the outlier points with the title of the movie so I could see which movies
were outliers.
- Choose Labels / Data Labels
- Click on the pull down box for Labels to say Outliers.
- Tell it to "Use labels from ____" and put the title variable
in the blank. If you leave this step off, it will put the numerical value
of the outlier, which may be okay depending on what you want.
- Click OK
Numerical Descriptions (Question 4)
You will use this a lot in this course. Minitab knows that and so it is the
first choice under the stats menu. It gives you the sample size, mean, median,
trimmed mean, standard deviation, standard error of the mean, minimum, maximum,
first quartile, and third quartile.
- Choose Stat / Basic Statistics / Display Descriptive Statistics
- You may describe several variables. Each variable will get its own row
in the output.
- (Optional) You may describe the data by another variable. This is useful
if you want to compare movies by another variable like their MPAA rating.
You would specify "By variable ____" and then
put the classification variable in that spot.
- (Optional) You can control the statistics that are displayed by clicking
on Statistics and checking or unchecking variables. In particular, we almost
never use N*, the number of missing values.
- Click OK
Normal Probability Plots (Question 5)
There are two places to generate a normal probability plot. You can either
go to Graph / Probability Plots or to Stat / Basic Statistics / Normality Test.
The first does everything the second does and also includes a confidence interval
band (which doesn't mean anything to you right now, but will later).
- Choose Graph / Probability Plots / Single
- Double click on the name of the variable you want to test for normality.
- (Optional) Click on Labels and add a title. The default title is pretty
good.
- Click OK
The explanation for interpreting the normal probability plot is given in your
book.
Some of the more observant students will notice that the standard deviation
here does not agree with the standard deviation from the descriptive statistics.
The rest of you will be asking, "How long does this project go on for?" The
standard deviation here assumes that your data is the entire population, whereas
the standard deviation from the descriptive statistics assumes that the data
is from a sample. The formulas are slightly different, but they are both a
measure of spread.
I want to explain the Anderson-Darling (AD) Normality Test values on the right
side. The p-value is the probability of getting
the results
we
did if the data is normally distributed. If the p-value is small (say less
than 5% or 0.05), then there is a very small chance that we would get these
results if there data were normal. If oru p-value was 0.015,
which is less than 5%, we would say that our data is unusual for a normally
distributed
population. Since our results are unusual, we'll reject the assumption that
our data is normally distributed. This should agree with the books explanation
about the data falling along the line, but gives a slightly more definite approach
(an actual number instead of just "close").
Making a scatter plot (Question 6)
A
scatter plot is appropriate when you have two quantitative (measurement level)
variables and you want to see if they're correlated with each other.
The y variable will be "gross" and the x variable will be "theaters"
- Choose Graph / Scatterplot / Simple
- Click in the Y column for graph 1. Double click on the response variable
(gross)
- Click in the X column for graph 1. Double click on the predictor variable
(theaters)
- (Optional - Recommended) Add a title to the graph by choosing Labels
/ Title
- Click OK
You may have more than one scatter plot on the same graph. If you do this,
you probably want to use the same x variable for both, otherwise things can
get really confusing.
You can also change the color of the dots. To do this, click on the dots with
the left mouse button and then click the right mouse button and choose Edit
Symbols. If it doesn'ts say Edit Symbols, you don't have the dots selected.
Choose custom and then you can edit the way they look. This technique works
for editing all parts of the graphs -- highlight the part you want to edit
and then right click and choose Edit (or control-T is the keyboard shortcut).
Regression Analysis (Question 7)
- Choose Stat / Regression / Regression
- The response variable is "gross" and the predictor variable is "theaters".
- Click OK
There is no graphical output here. Just copy the text into Word and explain
what you're looking at.