Technology Exercise 1: Exploring and Understanding Data

Collecting the data

Go to the Box Office reviews at Yahoo Movies at http://movies.yahoo.com/boxoffice/latest/rank.html. You may want to right click your mouse button on that link and say "Open in New Window" so that you can continue to have these instructions available to you. Depending on the time of the week, you will either get the top 10 movies for the weekend or the full 100 from the last week. If you get the one that has the top 10 movies for the weekend, don't use it as the data is incomplete for that weekend. You may gather previous weekends worth of data, though.

Creating the Worksheet

You may use either Excel or Minitab to collect the data. If you use Excel, you can collect the data at home and then bring it into school on a floppy disk to import into Minitab. If you enter the data directly into Minitab, you won't have to do the import, but you'll have to be at school.

  1. Label columns as "title", "weekend", "gross", "theaters", "critic", "yahoo", and "mpaa".
  2. For each movie on the web page, look for those that have been in release for one week, ranked in the top 10 movies for that week, and had was showing in least 2000 theaters.
    1. Enter the title, beginning date of the weekend (for example: November 21, 2003, would be entered as 11/21/03), the weekend gross in millions of dollars ($37,062,535 would get entered as 37.062535), and the number of theaters in the columns below the headings. Do not enter commas or dollar signs into any of the values.
    2. Click on the title of the movie and it will bring you to another page that contains a critic's rating, yahoo users' rating, and the MPAA rating. Enter those values in the next three columns. Make sure you are consistent in your data entry (always use the same case for the ratings and always use either "pg-13" or "pg13", but don't mix them). Use the back button to get back to the data.
  3. When you are done collecting information about this week, click on the pull down menu under "Archived Charts", select the previous week, and then click "GO". Enter the information for the new movies for that week and then repeat the process, choosing previous weeks until you can choose no more.
  4. The menu of archived weeks only allows for the last 12 weeks, but the previous weeks are available. The instructor has gone back and collected information for October through December, 2003, so that you can still use that information. That table of information appears below these instructions.
  5. Save your spreadsheet into the R: drive. Use the R:\xx\tech2 folder where xx is your section number (01, 02, or 03). Save it as a name that is unique for your group. If you are collecting this data at home, then save it onto a floppy disk to bring into school. If you feel comfortable making a folder to save your information into, then go ahead and save all your files in that folder.

From now on, whenever you need to work with this data, open up your project by going to File and choosing Open Project.

Data from October 3 - December 12, 2003.

title date gross theaters critic yahoo mpaa
Something's Gotta Give 12/12/2003 16.064723 2677 b b pg13
Stuck on You 12/12/2003 9.411055 3003 b- b- pg13
The Last Samuri 12/5/2003 24.271354 2908 b a- r
Haunted Mansion 11/28/2003 24.278410 3122 c b pg
Bad Santa 11/28/2003 12.292952 2005 b b- r
The Missing 11/28/2003 10.833633 2756 c+ b- r
Timeline 11/28/2003 8.440629 2787 c- b- pg13
Cat in the Hat 11/21/2003 38.329160 3464 d+ c+ pg
Gothika 11/21/2003 19.288438 2382 c- b r
Master & Commander 11/14/2003 25.105990 3101 a- b+ pg13
Matrix Revolutions 11/7/2003 48.475154 3502 c+ b r
Elf 11/7/2003 31.113501 3337 b b+ pg
Scary Movie 3 10/24/2003 48.113770 3505 c b pg13
Radio 10/24/2003 13.303724 3074 c+ a- pg
Texas Chainsaw Massacre 10/17/2003 28.094014 3016 c b+ r
Runaway Jury 10/17/2003 11.836705 2815 b- b pg13
Kill Bill 10/10/2003 22.089322 3102 b b+ r
Good Boy 10/10/2003 13.107022 3225 c+ b- pg
Intolerable Cruelty 10/10/2003 12.525075 2564 b c+ pg13
School of Rock 10/3/2003 19.622714 2614 b+ a- pg13
Out of Time 10/3/2003 16.185316 3076 b- b+ pg13

Specifying the order in Minitab

We want to make sure the data is displayed in the proper order. Normally, Minitab will display data in alphabetical order, but that will cause problems with what we have here. Do the following for the "critic", "yahoo", and "mpaa" columns.

  1. Right click the mouse button over the column and choose Column and then Value Order.
  2. Check the User-specified order box
  3. In the Define an order window on the right, type the order you want the values to appear in. You can choose whether you want A+ first or last and the order you want the ratings in.
  4. Click OK and then repeat for the other two variables.

Displaying Categorical Data (Question 2)

There are several ways to explore the data. You will need to pick which methods to use with which variables. Below you will find the Minitab instructions for several types. There is nothing that says you have to use the same variables I do, just change the name of the variable where appropriate in the instructions.

Frequency Table

Frequency tables are appropriate for categorical data or quantitative data with limited numbers of responses. All you can do with the frequency table is count information where each data point is counted only once. If you have frequency data then use the contingency table (Cross tabulation).

  1. Choose Stat / Tables / Tally Individual Values
  2. Click in the Variables box and then double click on the variables you want to tally.
  3. (Optional) Check the type of tallys you want. The default is counts and that is usually good enough. You may want percents. Cumulative counts or percents tell you what part of the sample data is either that category or below. This only makes sense when you have ordered data.
  4. Click OK

Bar Chart

Bar charts are appropriate for categorical data. It allows you to graph a function (count, mean, st. dev) of a quantitative (measurement) level variable that is categorized by another variable.

For example, let's say that you want to know how many theaters are showing each rating of movie.

  1. Choose Graph / Bar Chart
  2. Decide on the type of bar chart
    1. If each case is to have equal weight and you just want to count the number of times values appear, then let each bar represent counts of unique values. This would be useful to count the number of movies for each rating. If you choose this option, you will need to supply the categorical variable. There will be a bar for each value of the categorical variable.
    2. If you want to find some statistical function for a variable, then choose to let each bar represent a function of a variable. If you want to know the total or average gross for movies by their rating, then you would choose this. If you choose this, then you get to pick a function like mean (average), count, or sum (total). The graph variable is the column that you want to apply the function to and the categorical variable is the column that contains which category the item falls into.
  3. (Optional - Recommended) Choose labels and add a title to the graph that describes what we're looking at.
  4. Choose labels and switch to the data labels tab. Check the use y-value labels radio button.
  5. Click OK

After the graph is generated, you can right click the mouse button on parts of it to make changes. For example, you could right click over the bars and change the formatting so that the background has slashes instead of a solid color.

Pie Charts

Pie charts are useful when you want to summarize a categorical variable graphically. You may either have just one variable or another variable that specifies frequencies (like the number of theaters).

  1. Choose Graph / Pie Chart
  2. If you have frequency data, then check the Chart Table box and enter the categories and variables. Otherwise, just click in the Chart Data In box and then double click on the categorical variable you wish to graph.
  3. (Optional) If you have several categories that have very small percentages, then click on Pie Chart Options and enter a value into the "Combine slices less than ___ % into one group" line. If you entered 2 here, then any category with less than 2% of the values would be lumped into a group called "Other"
  4. (Optional - Recommended) Enter a title for the graph. Click on Labels and add a title. Otherwise it will say "Pie Chart of ____" where the blank is the name of the variable you graphed.
  5. Click on Labels and change to the Slice Labels tab. Click either frequency to show the count in each slice or percent to show the percent of the area in each slice. You can also check the Category Name box to have the name of the category displayed next to the slice. This is more readable than using just the legend.
  6. Click OK

Under the Data Options menu, you can change to the Group Options tab and tell it not to include missing as a group. This would be good if you only want the pie chart to represent the data you know.

Displaying Quantitative Data (Question 3)

When the variable you want to display is quantitative (measurement is Minitab's word for it), then you want to choose a dot plot, stem and leaf plot, or histogram.

Dot Plots

  1. Choose Graph / Dotplot
  2. Choose the type of dot plot you want. Most of the time, it will be the simple case, but if you want to look at your data by category (for example, comparing men and women), then you might choose with groups.
  3. Choose the variable(s) you want dot plots for. You may use more than one variable, each dot plot will generate in its own window.
  4. (Optional - Recommended) Click on Labels and enter a title.
  5. Click OK

If you would like to generate dot plots by a classification variable, then choose the "with groups" dot plot at the beginning. This will give you an extra box called "Categorical variable" which is where you enter the variable that contains how your groups are divided.

Histogram

Histograms are like bar charts except that the vertical axis is always frequency (or percent) and the horizontal axis are not categories, but are determined by the values of your variable. They are the bins that your textbook talked.

Let's generate a histogram of the number of theaters that each movie is playing at.

  1. Choose Graph / Histogram.
  2. Choose either simple or with fit. The "with fit" option will attempt to fit a normal distribution to the curve and then you will be able to see how well the data fits the graph.
  3. Click in the graph variables box and then double click the variable you want to graph. For our example, that would be theaters.
  4. (Optional - Recommended) You can click on Labels and add a title to the graph.
  5. Click OK

If you don't like the way the histogram looks, you can position your mouse over the bars in the graph and click the right mouse button and choose Edit Bars.

  1. Under the Attributes tab, you can change the background from solid to slashed and change the color of the graphs.
  2. Under the Binning tab, you can change how many bars there are and how they are determined.
    1. The interval type determines whether they values given are the midpoints of the intervals or the cut points (boundaries) between two intervals.
    2. Automatic interval definition allows Minitab to make a best guess attempt at drawing the histogram. You can specify the number of intervals. You can also specify the cut points or midpoints with a space separated list of numbers. For example, if you would like the cut points to be 100, 120, 140, 160, 180, and 200, you would enter "100 120 140 160 180 200" and it would give you five bars that were 20 units wide each. Another way to specify that same interval without listing each number is to go "100:200/20", which means start at 100, go to 200, and make each bar 20 wide.

Box Plots (Question 4)

For this example, I'll display the opening weekend gross by the MPAA rating. Note, this is not what I asked you to create.

  1. Choose Graph / Boxplot
  2. Choose the type of boxplot. Use simple if you only have one variable to graph and choose With Groups if you would like side-by-side box plots where the data is broken down by some classification variable. In this example, I'll use groups.
  3. For the graph variables, choose the data you want to make the boxplot for. I would use gross for this example.
  4. For the categorical variables, choose the way you want to categorize the data. I would use MPAA for this example.
  5. (Optional) Add a point for the mean. Click on Data View and check the box for Mean Symbol.
  6. (Optional - Recommended) Click on Labels and add a Title
  7. (Optional) Display which points are outliers. In this case, I could label the outlier points with the title of the movie so I could see which movies were outliers.
    1. Choose Labels / Data Labels
    2. Click on the pull down box for Labels to say Outliers.
    3. Tell it to "Use labels from ____" and put the title variable in the blank. If you leave this step off, it will put the numerical value of the outlier, which may be okay depending on what you want.
  8. Click OK

Numerical descriptions (Question 5)

You will use this a lot in this course. Minitab knows that and so it is the first choice under the stats menu. It gives you the sample size, mean, median, trimmed mean, standard deviation, standard error of the mean, minimum, maximum, first quartile, and third quartile.

  1. Choose Stat / Basic Statistics / Display Descriptive Statistics
  2. You may describe several variables. Each variable will get its own row in the output.
  3. (Optional) You may describe the data by another variable. This is useful if you want to compare movies by another variable like their MPAA rating. You would specify "By variable ____" and then put the classification variable in that spot.
  4. (Optional) You can control the statistics that are displayed by clicking on Statistics and checking or unchecking variables. In particular, we almost never use N*, the number of missing values.
  5. Click OK

Normal Probability Plots (Question 6)

There are two places to generate a normal probability plot. You can either go to Graph / Probability Plots or to Stat / Basic Statistics / Normality Test. The first does everything the second does and also includes a confidence interval band (which doesn't mean anything to you right now, but will later).

  1. Choose Graph / Probability Plots / Single
  2. Double click on the name of the variable you want to test for normality.
  3. (Optional) Click on Labels and add a title. The default title is pretty good.
  4. Click OK

The explanation for interpreting the normal probability plot is given in your book.

Some of the more observant students will notice that the standard deviation here does not agree with the standard deviation from the descriptive statistics. The rest of you will be asking, "How long does this project go on for?" The standard deviation here assumes that your data is the entire population, whereas the standard deviation from the descriptive statistics assumes that the data is from a sample. The formulas are slightly different, but they are both a measure of spread.

I want to explain the Anderson-Darling (AD) Normality Test values on the right side. The p-value is the probability of getting the results we did if the data is normally distributed. If the p-value is small (say less than 5% or 0.05), then there is a very small chance that we would get these results if there data were normal. If oru p-value was 0.015, which is less than 5%, we would say that our data is unusual for a normally distributed population. Since our results are unusual, we'll reject the assumption that our data is normally distributed. This should agree with the books explanation about the data falling along the line, but gives a slightly more definite approach (an actual number instead of just "close").