Technology Exercise 1: Exploring and Understanding Data

Collecting the data

Go to the Box Office reviews at Yahoo Movies at http://movies.yahoo.com/boxoffice/latest/rank.html. You may want to right click your mouse button on that link and say "Open in New Window" so that you can continue to have these instructions available to you. Depending on the time of the week, you will either get the top 10 movies for the weekend or the full 100 from the last week. If you get the one that has the top 10 movies for the weekend, don't use it as the data is incomplete for that weekend. You may gather previous weekends worth of data, though.

Creating the Worksheet

You may use either Excel or Minitab to collect the data. If you use Excel, you can collect the data at home and then bring it into school on a floppy disk to import into Minitab. If you enter the data directly into Minitab, you won't have to do the import, but you'll have to be at school.

Read this entire section before doing anything on the computer.

  1. Label columns as "title", "weekend", "gross", "theaters", "critic", "yahoo", and "mpaa".
  2. For each movie on the web page, look for those that have been in release for one week, ranked in the top 10 movies for that week, and had was showing in least 2000 theaters.
    1. Enter the title, beginning date of the weekend (for example: November 21, 2003, would be entered as 11/21/03), the weekend gross in millions of dollars ($37,062,535 would get entered as 37.062535), and the number of theaters in the columns below the headings. Do not enter commas or dollar signs into any of the values.
    2. Click on the title of the movie and it will bring you to another page that contains a critic's rating, yahoo users' rating, and the MPAA rating. Enter those values in the next three columns. Make sure you are consistent in your data entry (always use the same case for the ratings and always use either "pg-13" or "pg13", but don't mix them). Use the back button to get back to the data.
  3. When you are done collecting information about this week, click on the pull down menu under "Archived Charts", select the previous week, and then click "GO". Enter the information for the new movies for that week and then repeat the process, choosing previous weeks until you can choose no more.
  4. The menu of archived weeks only allows for the last 12 weeks, but the previous weeks are available. The instructor has gone back and collected information for some of the earlier months, so that you can still use that information. That table of information appears below these instructions. Begin with this data and then gather the more recent data.
  5. Save your spreadsheet into the R: drive. Use the R:\xx\tech2 folder where xx is your section number (01, 02, or 03). Save it as a name that is unique for your group. If you are collecting this data at home, then save it onto a floppy disk to bring into school. If you feel comfortable making a folder to save your information into, then go ahead and save all your files in that folder.

From now on, whenever you need to work with this data, open up your project by going to File and choosing Open Project.

Data from May 21 - July 16, 2004

title weekend gross theaters critic yahoo mpaa
I, Robot 7/16/2004 52.179887 3420 b- b+ pg13
A Cinderella Story 7/16/2004 13.623350 2625 c- b pg
Anchorman 7/9/2004 28.416365 3091 b b- pg13
King Arthur 7/9/2004 15.193907 3086 c+ b pg13
Sleepover 7/9/2004 4.171226 2207 c c pg
Spiderman 2 7/2/2004 115.817364 4152 a- a- pg13
White Chicks 6/25/2004 19.676748 2726 c b pg13
The Notebook 6/25/2004 13.464745 3020 c+ b+ pg13
Two Brothers 6/25/2004 6.144160 2175 b- b- pg
Dodgeball 6/18/2004 30.070196 2694 b- b pg13
The Terminal 6/18/2004 19.053199 2811 b b pg13
Around the World 6/18/2004 7.576132 2801 c c+ pg
Chronicles of Riddick 6/11/2004 24.289165 2757 c b- pg13
Garfield 6/11/2004 21.727611 3094 c- b- pg
The Stepford Wives 6/11/2004 21.406781 3057 c+ c+ pg13
Harry Potter 6/4/2004 93.687367 3855 b+ b+ pg
The Day After Tomorrow 5/28/2004 85.807341 3425 c+ b pg13
Raising Helen 5/28/2004 14.239252 2717 c- b- pg13
Shrek 2 5/21/2004 108.037878 4163 b a- pg

You can not copy and paste this table directly into Minitab. Follow these instructions to save typing. If you went ahead and gathered all of your data, then don't copy the headings in the steps below and you will also need to make sure that you make sure data I've entered exactly matches what you've entered.

  1. Highlight the table on the web page including the headings.
  2. Copy the data from the web page
  3. Open up Microsoft Excel and paste the data
  4. Copy the data from Excel (It should still be highlighted)
  5. Switch to Minitab and paste the data. Make sure you've clicked in the grey cell above row 1 where the labels go. If you didn't copy the labels, then start in row 1.

Specifying the order in Minitab

We want to make sure the data is displayed in the proper order. Normally, Minitab will display data in alphabetical order, but that will cause problems with what we have here. Do the following for the "critic", "yahoo", and "mpaa" columns.

The order only affects the output. The data will look the same as it did before this step.

  1. Click anywhere in a column that you want to specify the order for.
  2. Right click the mouse button and choose Column and then Value Order.
  3. Check the User-specified order box
  4. In the Define an order window on the right, type the order you want the values to appear in. You can choose whether you want A+ first or last and the order you want the ratings in.
  5. Click OK and then repeat for the other two variables.

Displaying Categorical Data (Question 2)

There are several ways to explore the data. You will need to pick which methods to use with which variables. Below you will find the Minitab instructions for several types. There is nothing that says you have to use the same variables I do, just change the name of the variable where appropriate in the instructions.

Frequency Table

Frequency tables are appropriate for categorical data or quantitative data with limited numbers of responses. All you can do with the frequency table is count information where each data point is counted only once. If you have frequency data then use the contingency table (Cross tabulation).

  1. Choose Stat / Tables / Tally Individual Values
  2. Click in the Variables box and then double click on the variables you want to tally.
  3. (Optional) Check the type of tallys you want. The default is counts and that is usually good enough. You may want percents. Cumulative counts or percents tell you what part of the sample data is either that category or below. This only makes sense when you have ordered data.
  4. Click OK

Bar Chart

Bar charts are appropriate for categorical data. It allows you to graph a function (count, mean, st. dev) of a quantitative (measurement) level variable that is categorized by another variable.

For example, let's say that you want to know how many theaters are showing each rating of movie.

  1. Choose Graph / Bar Chart
  2. Decide on the type of bar chart
    1. If each case is to have equal weight and you just want to count the number of times values appear, then let each bar represent counts of unique values. This would be useful to count the number of movies for each rating. If you choose this option, you will need to supply the categorical variable. There will be a bar for each value of the categorical variable.
    2. If you want to find some statistical function for a variable, then choose to let each bar represent a function of a variable. If you want to know the total or average gross for movies by their rating, then you would choose this. If you choose this, then you get to pick a function like mean (average), count, or sum (total). The graph variable is the column that you want to apply the function to and the categorical variable is the column that contains which category the item falls into.
  3. (Optional - Recommended) Choose labels and add a title to the graph that describes what we're looking at.
  4. Choose labels and switch to the data labels tab. Check the use y-value labels radio button.
  5. Click OK

After the graph is generated, you can right click the mouse button on parts of it to make changes. For example, you could right click over the bars and change the formatting so that the background has slashes instead of a solid color.

Pie Charts

Pie charts are useful when you want to summarize a categorical variable graphically. You may either have just one variable or another variable that specifies frequencies (like the number of theaters).

The categorical variable represents the individual slices of the pie.

  1. Choose Graph / Pie Chart
  2. Choose the type of data you have.
    1. If you have frequency data, then check the Chart values from a table box and enter the Categorical variable and make your frequency variable the summary variable.
    2. If you just have raw data and not frequencies, then select the Chart raw data box. The Categorical variables should contain the variable you wish to graph.
  3. (Optional) If you have several categories that have very small percentages, then click on Pie Chart Options and enter a value into the "Combine slices less than ___ % into one group" line. If you entered 2 here, then any category with less than 2% of the values would be lumped into a group called "Other"
  4. (Optional - Recommended) Enter a title for the graph. Click on Labels and add a title. Otherwise it will say "Pie Chart of ____" where the blank is the name of the variable you graphed.
  5. Click on Labels and change to the Slice Labels tab. Click either frequency to show the count in each slice or percent to show the percent of the area in each slice. You can also check the Category Name box to have the name of the category displayed next to the slice. This is more readable than using just the legend.
  6. Click OK

Under the Data Options menu, you can change to the Group Options tab and tell it not to include missing as a group. This would be good if you only want the pie chart to represent the data you know.

Box Plots (Question 3)

For this example, I'll display the opening weekend gross by the MPAA rating. Note, this is not what I asked you to create, you will need to change the instructions to match what I've asked for.

  1. Choose Graph / Boxplot
  2. Choose the type of boxplot. Use simple if you only have one variable to graph and choose With Groups if you would like side-by-side box plots where the data is broken down by some classification variable. In this example, I'll use groups.
  3. For the graph variables, choose the data you want to make the boxplot for. I would use gross for this example.
  4. For the categorical variables, choose the way you want to categorize the data. I would use MPAA for this example.
  5. (Optional) Add a point for the mean. Click on Data View and check the box for Mean Symbol.
  6. (Optional - Recommended) Click on Labels and add a Title
  7. (Optional) Display which points are outliers. In this case, I could label the outlier points with the title of the movie so I could see which movies were outliers.
    1. Choose Labels / Data Labels
    2. Click on the pull down box for Labels to say Outliers.
    3. Tell it to "Use labels from ____" and put the title variable in the blank. If you leave this step off, it will put the numerical value of the outlier, which may be okay depending on what you want.
  8. Click OK

Numerical Descriptions (Question 4)

You will use this a lot in this course. Minitab knows that and so it is the first choice under the stats menu. It gives you the sample size, mean, median, trimmed mean, standard deviation, standard error of the mean, minimum, maximum, first quartile, and third quartile.

  1. Choose Stat / Basic Statistics / Display Descriptive Statistics
  2. You may describe several variables. Each variable will get its own row in the output.
  3. (Optional) You may describe the data by another variable. This is useful if you want to compare movies by another variable like their MPAA rating. You would specify "By variable ____" and then put the classification variable in that spot.
  4. (Optional) You can control the statistics that are displayed by clicking on Statistics and checking or unchecking variables. In particular, we almost never use N*, the number of missing values.
  5. Click OK

Normal Probability Plots (Question 5)

There are two places to generate a normal probability plot. You can either go to Graph / Probability Plots or to Stat / Basic Statistics / Normality Test. The first does everything the second does and also includes a confidence interval band (which doesn't mean anything to you right now, but will later).

  1. Choose Graph / Probability Plots / Single
  2. Double click on the name of the variable you want to test for normality.
  3. (Optional) Click on Labels and add a title. The default title is pretty good.
  4. Click OK

The explanation for interpreting the normal probability plot is given in your book.

Some of the more observant students will notice that the standard deviation here does not agree with the standard deviation from the descriptive statistics. The rest of you will be asking, "How long does this project go on for?" The standard deviation here assumes that your data is the entire population, whereas the standard deviation from the descriptive statistics assumes that the data is from a sample. The formulas are slightly different, but they are both a measure of spread.

I want to explain the Anderson-Darling (AD) Normality Test values on the right side. The p-value is the probability of getting the results we did if the data is normally distributed. If the p-value is small (say less than 5% or 0.05), then there is a very small chance that we would get these results if there data were normal. If oru p-value was 0.015, which is less than 5%, we would say that our data is unusual for a normally distributed population. Since our results are unusual, we'll reject the assumption that our data is normally distributed. This should agree with the books explanation about the data falling along the line, but gives a slightly more definite approach (an actual number instead of just "close").

Making a scatter plot (Question 6)

Scatter plot of opening weekend sales vs number of theatersA scatter plot is appropriate when you have two quantitative (measurement level) variables and you want to see if they're correlated with each other.

The y variable will be "gross" and the x variable will be "theaters"

  1. Choose Graph / Scatterplot / Simple
  2. Click in the Y column for graph 1. Double click on the response variable (gross)
  3. Click in the X column for graph 1. Double click on the predictor variable (theaters)
  4. (Optional - Recommended) Add a title to the graph by choosing Labels / Title
  5. Click OK

You may have more than one scatter plot on the same graph. If you do this, you probably want to use the same x variable for both, otherwise things can get really confusing.

You can also change the color of the dots. To do this, click on the dots with the left mouse button and then click the right mouse button and choose Edit Symbols. If it doesn'ts say Edit Symbols, you don't have the dots selected. Choose custom and then you can edit the way they look. This technique works for editing all parts of the graphs -- highlight the part you want to edit and then right click and choose Edit (or control-T is the keyboard shortcut).

Regression Analysis (Question 7)

  1. Choose Stat / Regression / Regression
  2. The response variable is "gross" and the predictor variable is "theaters".
  3. Click OK

There is no graphical output here. Just copy the text into Word and explain what you're looking at.