# Minitab questions with statistics #4

I have attached the information here.

I can’t find the PDF on it, but here is a power point.

**MBA 501A – [STATISTICS]**

**ASSIGNMENT 4**

**INSTRUCTIONS**: . The total number of points possible is 50. Please note that point allocation varies per question. Use the Help feature in MINITAB 16 to read descriptions for the data sets so that you can make meaningful comments.

[10 pts] 1. Use the data set MBASURVEY.MTW in the student 14 folder.

a)

Perform the Chi Square test for independence to determine whether Gender and Highest Degree are related. Use α = 0.05. Explain your results.

[10 pts] 2. Use the data set PULSEA.MTW in the student 14 folder.

a) Perform the Chi Square test for independence to determine whether smoking status and usual level of activity are related. Use α = 0.05. Explain your results.

[30 pts] 3. Use the data set SCHOOLSDATA.MTW in the Student14 folder.

a)

Is average SAT verbal score (SATV) significant in explaining the percent of students going to college (%College) at ? Explain your conclusion. [10 pts]

b) What is the 95% confidence interval for the percent of students going to college if the average SAT verbal score is 500? Interpret. [5 pts]

c) What is the 95% prediction interval for the percent of students going to college if the average SAT verbal score is 500? Interpret. [5 pts]

d)

Is cost per pupil significant in explaining the percent of students going to college (%College) at ? Explain your conclusion. [10 pts]

2

05

.

=

a

05

.

=

a

Chapter 13

Inference for Counts: Chi-Square Tests

© 2011 Pearson Education, Inc.

*

Business Statistics:

A First Course

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

Given the following…

1) Counts of items in each of several categories

2) A model that predicts the distribution of the relative frequencies

…the basic idea is to ask:

“Does the actual distribution differ from the model because of random error or do the differences mean that the model does not fit the data?”

In other words, “How good is the fit between what we observe and what we expect to observe?”

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

Example: Stock Market “Up” Days

Sample of 1000 “up” days

“Up” days appear to be more common than expected on Fridays (we expect them to be equally likely across trading days).

Null Hypothesis: The distribution of “up” days is no different from what we expect (equally likely across days).

Test the hypothesis with a chi-square goodness-of-fit test.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

The Chi-Square Distribution

Note that “accumulates” the relative squared deviation of each cell from its expected value.

So, gets “big” when the model is a poor fit.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

Assumptions and Conditions

Counted Data Condition – The data must be counts for the categories of a categorical variable.

Independence Assumption – The counts should be independent of each other.

Randomization Condition – The counted individuals should be a random sample of the population.

Expected Cell Frequency Condition – Expect at least 5 individuals per cell.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

The Chi-Square Calculation

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.1 Chi-Square Tests

The Chi-Square Calculation: Stock Market “Up” Days

Using a chi-square table at a significance level of 0.05 and with 4 degrees of freedom:

Do not reject the null hypothesis. (The fit is “good”.)

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.2 Interpreting Chi-Square Values

The Chi-Square Distribution

The distribution is right-skewed and becomes broader with increasing degrees of freedom:

The test is a one-sided test.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

When we reject a null hypothesis, we can examine the residuals in each cell to discover which values are extraordinary.

Because we might compare residuals for cells with very different counts, we should examine standardized residuals:

13.3 Examining the Residuals

Note that standardized residuals from goodness-of-fit tests are actually z-scores (which we already know how to interpret and analyze).

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Standardized residuals for the trading days data:

13.3 Examining the Residuals

None of these values is remarkable.

The largest, Friday, at 1.292, is not impressive when viewed as a z-score.

The deviations are in the direction of a “weekend effect”, but they aren’t quite large enough for us to conclude they are real.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.6 Chi-Square Test of Independence

The table below shows the importance of personal appearance for several age groups.

Are Age and Appearance independent, or is there a relationship?

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.6 Chi-Square Test of Independence

A stacked barchart suggests a relationship:

Test for independence using a chi-square test of independence.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.6 Chi-Square Test of Independence

The test requires finding expected counts under the assumption that the null hypothesis is true (that the two variables are independent). Find the expected count for each cell by multiplying the appropriate row and column totals and divide by the table total:

Exp ij = Total Row i x Total Column j / Table Total

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.6 Chi-Square Test of Independence

For the Appearance and Age example, we reject the null hypothesis that the variables are independent.

So, it may be of interest to know how differently two age groups (teens and 30-something adults) select the “very important” category (Appearance response 6 or 7).

You can construct a confidence interval for the true difference in these proportions…

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

13.6 Chi-Square Test of Independence

From the table, the relevant percentages of responses (6 or 7) on Appearance for teens and 30 something adults are:

Teens: 45.17%

30-39: 39.91%

The 95% confidence interval is found below:

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What Can Go Wrong?

Don’t use chi-square methods unless you have counts.

Beware of large samples! With a sufficiently large sample size, a chi-square test will result in rejecting the null hypothesis.

Don’t say that one variable “depends” on the other just because they’re not independent.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What Have We Learned?

Goodness-of-fit tests compare the observed distribution of a single categorical variable to an expected distribution based on a theory or model.

Tests of independence examine counts from a single group for evidence of an association between two categorical variables.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

2

c

2

c

2

4

9.4882.62

c

=>

22

2

(192193.4)(218199.7)

…2.62

193.4199.7

x

—

=++=

Chapter 14

Inference for Regression

© 2011 Pearson Education, Inc.

*

Business Statistics:

A First Course

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.1 The Population and the Sample

We already know that we can model the relationship between two quantitative variables by fitting a straight line to a sample of ordered pairs.

But, observations differ from sample to sample:

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.1 The Population and the Sample

We can imagine a line that summarizes the true relationship between x and y for the entire population,

where y is the population mean of y at a given value of x.

NOTE: We are assuming an idealized case in which the the points (x, y) are in fact exactly linear.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.1 The Population and the Sample

For a given value x:

The value of ŷ for a specific value of x obtained from a particular sample may not lie on the line µy.

These values of ŷ will be distributed about µy.

We can account for the error between ŷ and µy by adding an error term () to the model:

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.1 The Population and the Sample

Regression Inference

Collect a sample and estimate the population ’s by finding a regression line:

The residuals e = y – ŷ are the sample based versions of .

Account for the uncertainties in 0 and 1 by making confidence intervals, as we’ve done for means and proportions.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.2 Assumptions and Conditions

Inference in regression are based on these assumptions (should check these assumptions in this order):

Linearity Assumption

Independence Assumption

Equal Variance Assumption

Normal Population Assumption

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.2 Assumptions and Conditions

Testing the Assumptions

Make a scatterplot of the data to check for linearity. (Linearity Assumption)

Fit a regression and find the residuals, e, and predicted values ŷ.

Plot the residuals against time (if appropriate) and check for evidence of patterns (Independence Assumption).

Make a scatterplot of the residuals against x or the predicted values. This plot should not exhibit a “fan” or “cone” shape. (Equal Variance Assumption)

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.2 Assumptions and Conditions

5. Make a histogram and/or Normal probability plot of the residuals (Normal Population Assumption)

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.2 Assumptions and Conditions

Graphical Summary of Assumptions and Conditions

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

For a sample, we expect b1 to be close to the model slope 1. For similar samples, the standard error of the slope is a measure of the variability of b1 about the true slope 1.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

Hint: Compare se’s.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

Hint: Compare sx’s.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

Hint: Compare n’s.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.3 Regression Inference

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.4 Standard Errors for Predicted Values

SE becomes larger the further x gets from .

That is, the confidence interval broadens as you move away from . (See figure at right.)

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.4 Standard Errors for Predicted Values

SE, and the confidence interval, becomes smaller with increasing n.

SE, and the confidence interval, are larger for samples with more spread around the line (when se is larger).

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.4 Standard Errors for Predicted Values

Because of the extra term , the prediction interval for individual values is broader that the confidence interval for predicted mean values.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.5 Using Confidence and Prediction

Intervals

Confidence interval for a mean:

The result at 95% means

“We are 95% confident that the mean value of y is between 4.40 and 4.70 when .”

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.5 Using Confidence and Prediction

Intervals

Prediction interval for an individual value:

The result at 95% means

“We are 95% confident that a single

particular value of y will be between 2.95

and 5.15 when .”

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.6 Extrapolation and Prediction

Extrapolating – predicting a y value by extending the regression model to regions outside the range of the x-values of the data.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.6 Extrapolation and Prediction

Why is extrapolation dangerous?

It introduces the questionable and untested assumption that the relationship between x and y does not change.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.6 Extrapolation and Prediction

Cautionary Example: Oil Prices in Constant Dollars

Model Prediction (Extrapolation):

On average, a barrel of oil will increase $7.39 per year from 1983 to 1998.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.6 Extrapolation and Prediction

Cautionary Example: Oil Prices in Constant Dollars

Actual Price Behavior

Extrapolating the 1971-1982 model to the ’80s and ’90s lead to grossly erroneous forecasts.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

14.6 Extrapolation and Prediction

Remember: Linear models ought not be trusted beyond the span of the x-values of the data.

If you extrapolate far into the future, be prepared for the actual values to be (possibly quite) different from your predictions.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

In regression, an outlier can stand out in two ways. It can have…

1) a large residual:

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

In regression, an outlier can stand out in two ways. It can have…

2) a large distance from :

“High-leverage

point”

A high leverage point is influential if omitting it gives a regression model with a very different slope.

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

Not high-leverage

Large residual

Not very influential

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

High-leverage

Small residual

Not very influential

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

Tell whether the point is a high-leverage point, if it has a large residual, and if it is influential.

High-leverage

Medium residual

Very influential (omitting the red point will change the slope dramatically!)

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What should you do with a high-leverage point?

Sometimes, these points are important. They can indicate that the underlying relationship is in fact nonlinear.

Other times, they simply do not belong with the rest of the data and ought to be omitted.

When in doubt, create and report two models: one with the outlier and one without.

14.7 Unusual and Extraordinary Observations

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What Have We Learned?

Do not fit a linear regression to data that are not straight.

Watch out for changing spread.

Watch out for non-Normal errors.

Beware of extrapolating, especially far into the future.

Look for unusual points. Consider setting aside outliers and re-running the regression.

Treat unusual points honestly.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What Have We Learned?

Under certain conditions, the sampling distribution for the slope of a regression line can be modeled by a Student’s t-model with n – 2 degrees of freedom.

Check four conditions – in order – before proceeding to inference.

Linearity

Independence

Equal Variance

Normality

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

*

What Have We Learned?

Use the appropriate t-model to test a hypothesis (H0: 1 = 0) about the slope.

Create and interpret a confidence interval for the slope.

© 2011 Pearson Education, Inc.

QTM1310/ Sharpe

QTM1310/ Sharpe

*

01

ˆ

ybbx

=+

10,11,1

20,2

2

1,2

1

ˆ

sample 1

ˆ

sam

ˆˆ

Lines and are not necessarily the same

ple

.

2

ybbx

y

y

bbx

y

=+

=+

01

y

x

mbb

=+

01

yx

bbe

=++

01

0011

ˆ

,

ybbx

bb

bb

=+

»»

x

2

e

s

(

)

(

)

(

)

(

)

2

2

*2

21

ˆˆ

e

n

s

xyxtSEbxx

n

nnnnn

m

–

=±´-+

(

)

ˆ

10.14.550.15

n

m

=±

10.1

x

=

(

)

(

)

(

)

(

)

2

2

*22

21

ˆˆ

e

ne

s

yxyxtSEbxxs

n

nnnn

–

=±´-++

(

)

ˆ

10.14.550.60

y

=±

x

(

)

*

1121

n

btSEb

b

–

=±´