# DAV Public School Statistics Limits Question

**Show your work to receive credits if there is any work to show.**

**13.1-13.2**: 13.1, 13.2, 13.3, 13.5, 13.7（page 521）

**P370**: 9.38, 9.39, 9.40

Correction: 9.40(a) 31.33 Should be 31.53.

ftoc.qxd

10/15/09

12:38 PM

Page xviii

This online teaching and learning environment

integrates the entire digital textbook with the

most effective instructor and student resources

to fit every learning style.

With WileyPLUS:

• Students achieve concept

mastery in a rich,

structured environment

that’s available 24/7

• Instructors personalize and manage

their course more effectively with

assessment, assignments, grade

tracking, and more

• manage time better

• study smarter

• save money

From multiple study paths, to self-assessment, to a wealth of interactive

visual and audio resources, WileyPLUS gives you everything you need to

personalize the teaching and learning experience.

» F i n d o u t h ow t o M a k e I t Yo u r s »

www.wileyplus.com

all the help, resources, and personal support

you and your students need!

2-Minute Tutorials and all

of the resources you & your

students need to get started

www.wileyplus.com/firstday

Pre-loaded, ready-to-use

assignments and presentations

www.wiley.com/college/quickstart

Student support from an

Collaborate with your colleagues,

experienced student user

find a mentor, attend virtual and live

Ask your local representative

events, and view resources

for details!

www.WhereFacultyConnect.com

Technical Support 24/7

FAQs, online chat,

and phone support

www.wileyplus.com/support

Your WileyPLUS

Account Manager

Training and implementation support

www.wileyplus.com/accountmanager

Make It Yours!

ffirs.qxd

10/15/09

12:24 PM

Page iii

Statistics

Principles and Methods

SIXTH EDITION

Richard A. Johnson

University of Wisconsin at Madison

Gouri K. Bhattacharyya

John Wiley & Sons, Inc.

ffirs.qxd

10/15/09

12:24 PM

Page iv

Vice President & Executive Publisher

Project Editor

Senior Development Editor

Production Manager

Senior Production Editor

Marketing Manager

Creative Director

Design Director

Production Management Services

Photo Editor

Editorial Assistant

Media Editor

Cover Photo Credit

Cover Designer

Laurie Rosatone

Ellen Keohane

Anne Scanlan-Rohrer

Dorothy Sinclair

Valerie A. Vargas

Sarah Davis

Harry Nolan

Jeof Vita

mb editorial services

Sheena Goldstein

Beth Pearson

Melissa Edwards

Gallo Images-Hein von

Horsten/Getty Images, Inc.

Celia Wiley

This book was set in 10/12 Berling by Laserwords Private Limited, India and printed and bound by

RR Donnelley-Crawsfordville. The cover was printed by RR Donnelley-Crawsfordville.

Copyright © 2010, 2006 John Wiley & Sons, Inc. All rights reserved. No part of this publication

may be reproduced, stored in a retrieval system or transmitted in any form or by any means,

electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted

under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior

written permission of the Publisher, or authorization through payment of the appropriate per-copy

fee to the Copyright Clearance Center, Inc. 222 Rosewood Drive, Danvers, MA 01923, website

www.copyright.com. Requests to the Publisher for permission should be addressed to the

Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774,

(201)748-6011, fax (201)748-6008, website http://www.wiley.com/go/permissions.

Evaluation copies are provided to qualified academics and professionals for review purposes only,

for use in their courses during the next academic year. These copies are licensed and may not

be sold or transferred to a third party. Upon completion of the review period, please return the

evaluation copy to Wiley. Return instructions and a free of charge return shipping label are

available at www.wiley.com/go/returnlabel. Outside of the United States, please contact your

local representative.

ISBN-13 978-0-470-40927-5

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

fpref.qxd

10/15/09

12:37 PM

Page v

Preface

THE NATURE OF THE BOOK

Conclusions, decisions, and actions that are data driven predominate in today’s

world. Statistics — the subject of data analysis and data-based reasoning — is necessarily playing a vital role in virtually all professions. Some familiarity with this subject is now an essential component of any college education. Yet, pressures to accommodate a growing list of academic requirements often necessitate that this

exposure be brief. Keeping these conditions in mind, we have written this book to

provide students with a first exposure to the powerful ideas of modern statistics. It

presents the key statistical concepts and the most commonly applied methods of

statistical analysis. Moreover, to keep it accessible to freshmen and sophomores

from a wide range of disciplines, we have avoided mathematical derivations. They

usually pose a stumbling block to learning the essentials in a short period of time.

This book is intended for students who do not have a strong background in

mathematics but seek to learn the basic ideas of statistics and their application

in a variety of practical settings. The core material of this book is common to almost all first courses in statistics and is designed to be covered well within a

one-semester course in introductory statistics for freshmen – seniors. It is supplemented with some additional special-topics chapters.

ORIENTATION

The topics treated in this text are, by and large, the ones typically covered in an

introductory statistics course. They span three major areas: (i) descriptive statistics, which deals with summarization and description of data; (ii) ideas of probability and an understanding of the manner in which sample-to-sample variation

influences our conclusions; and (iii) a collection of statistical methods for analyzing the types of data that are of common occurrence. However, it is the treatment

of these topics that makes the text distinctive. Throughout, we have endeavored

to give clear and concise explanations of the concepts and important statistical

terminology and methods. By means of good motivation, sound explanations, and

an abundance of illustrations given in a real-world context, it emphasizes more

than just a superficial understanding.

v

fpref.qxd

10/15/09

vi

12:37 PM

Page vi

PREFACE

Each statistical concept or method is motivated by setting out its goal and then

focusing on an example to further elaborate important aspects and to illustrate its

application. The subsequent discussion is not only limited to showing how a

method works but includes an explanation of the why. Even without recourse to

mathematics, we are able to make the reader aware of possible pitfalls in the statistical analysis. Students can gain a proper appreciation of statistics only when they

are provided with a careful explanation of the underlying logic. Without this understanding, a learning of elementary statistics is bound to be rote and transient.

When describing the various methods of statistical analysis, the reader is

continually reminded that the validity of a statistical inference is contingent

upon certain model assumptions. Misleading conclusions may result when these

assumptions are violated. We feel that the teaching of statistics, even at an introductory level, should not be limited to the prescription of methods. Students

should be encouraged to develop a critical attitude in applying the methods and

to be cautious when interpreting the results. This attitude is especially important in the study of relationship among variables, which is perhaps the most

widely used (and also abused) area of statistics. In addition to discussing inference procedures in this context, we have particularly stressed critical examination of the model assumptions and careful interpretation of the conclusions.

SPECIAL FEATURES

1. Crucial elements are boxed to highlight important concepts and methods. These boxes provide an ongoing summary of the important items

essential for learning statistics. At the end of each chapter, all of its key

ideas and formulas are summarized.

2. A rich collection of examples and exercises is included. These are

drawn from a large variety of real-life settings. In fact, many data sets

stem from genuine experiments, surveys, or reports.

3. Exercises are provided at the end of each major section. These provide the

reader with the opportunity to practice the ideas just learned. Occasionally, they supplement some points raised in the text. A larger collection of

exercises appears at the end of a chapter. The starred problems are relatively difficult and suited to the more mathematically competent student.

4. Using Statistics Wisely, a feature at the end of each chapter, provides

important guidelines for the appropriate use of the statistical procedures presented in the chapter.

5. Statistics in Context sections, in four of the beginning chapters, each

describe an important statistical application where a statistical approach

to understanding variation is vital. These extended examples reveal, early

on in the course, the value of understanding the subject of statistics.

6. P – values are emphasized in examples concerning tests of hypotheses.

Graphs giving the relevant normal or t density curve, rejection region,

and P – value are presented.

fpref.qxd

10/15/09

12:37 PM

Page vii

PREFACE

vii

7. Regression analysis is a primary statistical technique so we provide a

more thorough coverage of the topic than is usual at this level. The basics of regression are introduced in Chapter 11, whereas Chapter 12

stretches the discussion to several issues of practical importance. These

include methods of model checking, handling nonlinear relations, and

multiple regression analysis. Complex formulas and calculations are judiciously replaced by computer output so the main ideas can be learned

and appreciated with a minimum of stress.

8. Integrated Technology, at the end of most chapters, details the steps for using MINITAB, EXCEL,1 and TI-84 calculator. With this presentation available, with few exceptions, only computer output is needed in the text.

Software packages remove much of the drudgery of hand calculation

and they allow students to work with larger data sets where patterns are

more pronounced. Some computer exercises are included in all chapters where relevant.

9. Convenient Electronic Data Bank at the end of the book contains a substantial collection of data. These data sets, together with numerous others throughout the book, allow for considerable flexibility in the choice

between concept-orientated and applications-orientated exercises. The

Data Bank and the other larger data sets are available for download on

the accompanying Web site located at www.wiley.com/college/johnson.

10. Technical Appendix A presents a few statistical facts of a mathematical

nature. These are separated from the main text so that they can be left

out if the instructor so desires.

ABOUT THE SIXTH EDITION

The sixth edition of STATISTICS — Principles and Methods maintains the objectives and level of presentation of the earlier editions. The goals are the developing (i) of an understanding of the reasonings by which findings from sample

data can be extended to general conclusions and (ii) a familiarity with some

basic statistical methods. There are numerous data sets and computer outputs

which give an appreciation of the role of the computer in modern data analysis.

Clear and concise explanations introduce the concepts and important statistical terminology and methods. Real-life settings are used to motivate the

statistical ideas and well organized discussions proceed to cover statistical

methods with heavy emphasis on examples. The sixth edition enhances these

special features. The major improvements are:

Bayes’ Theorem. A new section is added to Chapter 4 to highlight the reasoning underlying Bayes’s theorem and to present applications.

Approximate t. A new subsection is added to Chapter 7, which describes

the approximate two sample t statistic that is now pervasive in statistical software programs. For normal distributions, with unequal variances, this has become the preferred approach.

1Commands and the worksheets with data sets pertain to EXCEL 2003.

fpref.qxd

10/15/09

viii

12:37 PM

Page viii

PREFACE

New Examples. A substantial number of new examples are included, especially in the core chapters, Chapter 11 on regression, and Chapter 13 on contingency tables.

More Data-Based Exercises. Most of the new exercises are keyed to new

data-based examples in the text. New data are also presented in the exercises.

Other new exercises are based on the credit card use and opinion data that are

added to the data bank.

New Exercises. Numerous new exercises provide practice on understanding

the concepts and others address computations. These new exercises, which augment the already rich collection, are placed in real-life settings to help promote

a greater appreciation of the wide span of applicability of statistical methods.

ORGANIZATION

This book is organized into fifteen chapters, an optional technical appendix

(Appendix A), and a collection of tables (Appendix B). Although designed for a

one-semester or a two-quarter course, it is enriched with ample additional material to allow the instructor some choices of topics. Beyond Chapter 1, which sets

the theme of statistics and distinguishes population and sample, the subject

matter could be classified as follows:

Topic

Descriptive study of data

Probability and distributions

Sampling variability

Core ideas and methods

of statistical inference

Special topics of

statistical inference

Chapter

2, 3

4, 5, 6

7

8, 9, 10

11, 12, 13, 14, 15

We regard Chapters 1 to 10 as constituting the core material of an introductory statistics course, with the exception of the starred sections in Chapter 6. Although this material is just about enough for a one-semester course, many

instructors may wish to eliminate some sections in order to cover the basics of regression analysis in Chapter 11. This is most conveniently done by initially skipping

Chapter 3 and then taking up only those portions that are linked to Chapter 11.

Also, instead of a thorough coverage of probability that is provided in Chapter 4,

the later sections of that chapter may receive a lighter coverage.

SUPPLEMENTS

Instructor’s Solution Manual. (ISBN 978-0-470-53519-6) This manual contains complete solutions to all exercises.

fpref.qxd

10/15/09

12:37 PM

Page ix

PREFACE

ix

Test Bank. (Available on the accompanying website: www.wiley.com/

college/johnson) Contains a large number of additional questions for each

chapter.

Student Solutions Manual. (ISBN 978-0-470-53521-9) This manual contains complete solutions to all odd-numbered exercises.

Electronic Data Bank. (Available on the accompanying website: www.

wiley.com/college/johnson) Contains interesting data sets used in the text but that

can be used to perform additional analyses with statistical software packages.

WileyPLUS. This powerful online tool provides a completely integrated suite

of teaching and learning resources in one easy-to-use website. WileyPLUS offers

an online assessment system with full gradebook capabilities and algorithmically

generated skill building questions. This online teaching and learning environment

also integrates the entire digital textbook. To view a demo of WileyPLUS, contact

your local Wiley Sales Representative or visit: www.wiley.com/college/wileyplus.

ACKNOWLEDGMENTS

We thank Minitab (State College, Pa.) and the SAS Institute (Cary, N.C.) for permission to include commands and output from their software packages. A special

thanks to K. T. Wu and Kam Tsui for many helpful suggestions and comments on

earlier editions. We also thank all those who have contributed the data sets which

enrich the presentation and all those who reviewed the previous editions. The

following people gave their careful attention to this edition:

Hongshik Ahn, Stony Brook University

Prasanta Basak, Penn State University Altoona

Andrea Boito, Penn State University Altoona

Patricia M. Buchanan, Penn State University

Nural Chowdhury, University of Saskatchewan

S. Abdul Fazal, California State University Stanislaus

Christian K. Hansen, Eastern Washington University

Susan Kay Herring, Sonoma State University

Hui-Kuang Hsieh, University of Massachusetts Amherst

Hira L. Koul, Michigan State University

Melanie Martin, California State University Stanislaus

Mark McKibben, Goucher College

Charles H. Morgan, Jr., Lock Haven University of Pennsylvania

Perpetua Lynne Nielsen, Brigham Young University

Ashish Kumar Srivastava, St. Louis University

James Stamey, Baylor University

Masoud Tabatabai, Penn State University Harrisburg

Jed W. Utsinger, Ohio University

R. Patrick Vernon, Rhodes College

fpref.qxd

10/15/09

x

12:37 PM

Page x

PREFACE

Roumen Vesselinov, University of South Carolina

Vladimir Vinogradov, Ohio University

A. G. Warrack, North Carolina A&T State University

Richard A. Johnson

Gouri K. Bhattacharyya

ftoc.qxd

10/15/09

12:38 PM

Page xi

Contents

1

INTRODUCTION

1

2

3

4

5

6

7

8

9

10

2

1

What Is Statistics? 3

Statistics in Our Everyday Life 3

Statistics in Aid of Scientific Inquiry 5

Two Basic Concepts — Population and Sample 8

The Purposeful Collection of Data 14

Statistics in Context 15

Objectives of Statistics 17

Using Statistics Wisely 18

Key Ideas 18

Review Exercises 19

ORGANIZATION AND DESCRIPTION OF DATA

21

1 Introduction 23

2 Main Types of Data 23

3 Describing Data by Tables and Graphs 24

3.1 Categorical Data 24

3.2 Discrete Data 28

3.3 Data on a Continuous Variable 29

4 Measures of Center 40

5 Measures of Variation 48

6 Checking the Stability of the Observations over Time 60

7 More on Graphics 64

8 Statistics in Context 66

9 Using Statistics Wisely 68

10 Key Ideas and Formulas 68

11 Technology 70

12 Review Exercises 73

xi

ftoc.qxd

10/15/09

xii

3

12:38 PM

Page xii

CONTENTS

DESCRIPTIVE STUDY OF BIVARIATE DATA

1

2

3

4

5

6

7

8

9

10

4

81

Introduction 83

Summarization of Bivariate Categorical Data 83

A Designed Experiment for Making a Comparison 88

Scatter Diagram of Bivariate Measurement Data 90

The Correlation Coefficient — A Measure of Linear Relation 93

Prediction of One Variable from Another (Linear Regression) 104

Using Statistics Wisely 109

Key Ideas and Formulas 109

Technology 110

Review Exercises 111

PROBABILITY

115

1 Introduction 117

2 Probability of an Event 118

3 Methods of Assigning Probability 124

3.1 Equally Likely Elementary Outcomes —

The Uniform Probability Model 124

3.2 Probability As the Long-Run Relative Frequency 126

4 Event Relations and Two Laws of Probability 132

5 Conditional Probability and Independence 141

6 Bayes’ Theorem 140

7 Random Sampling from a Finite Population 155

8 Using Statistics Wisely 162

9 Key Ideas and Formulas 162

10 Technology 164

11 Review Exercises 165

5

PROBABILITY DISTRIBUTIONS

1

2

3

4

5

6

7

8

9

10

11

Introduction 173

Random Variables 173

Probability Distribution of a Discrete Random Variable 176

Expectation (Mean) and Standard Deviation

of a Probability Distribution 185

Successes and Failures — Bernoulli Trials 193

The Binomial Distribution 198

The Binomial Distribution in Context 208

Using Statistics Wisely 211

Key Ideas and Formulas 212

Technology 213

Review Exercises 215

171

ftoc.qxd

10/15/09

12:38 PM

Page xiii

CONTENTS

6

THE NORMAL DISTRIBUTION

xiii

221

1 Probability Model for a Continuous

Random Variable 223

2 The Normal Distribution — Its General Features 230

3 The Standard Normal Distribution 233

4 Probability Calculations with Normal Distributions 238

5 The Normal Approximation to the Binomial 242

*6 Checking the Plausibility of a Normal Model 248

*7 Transforming Observations to Attain

Near Normality 251

8 Using Statistics Wisely 254

9 Key Ideas and Formulas 255

10 Technology 256

11 Review Exercises 257

7

VARIATION IN REPEATED SAMPLES —

SAMPLING DISTRIBUTIONS

263

1 Introduction 265

2 The Sampling Distribution of a Statistic 266

3 Distribution of the Sample Mean and

the Central Limit Theorem 273

4 Statistics in Context 285

5 Using Statistics Wisely 289

6 Key Ideas and Formulas 289

7 Review Exercises 290

8 Class Projects 292

9 Computer Project 293

8

DRAWING INFERENCES FROM LARGE SAMPLES

1

2

3

4

5

6

7

8

9

Introduction 297

Point Estimation of a Population Mean 299

Confidence Interval for a Population Mean 305

Testing Hypotheses about a Population Mean 314

Inferences about a Population Proportion 329

Using Statistics Wisely 337

Key Ideas and Formulas 338

Technology 340

Review Exercises 343

295

ftoc.qxd

10/15/09

xiv

9

12:38 PM

Page xiv

CONTENTS

SMALL-SAMPLE INFERENCES

FOR NORMAL POPULATIONS

349

1 Introduction 351

2 Student’s t Distribution 351

3 Inferences about — Small Sample Size 355

3.1 Confidence Interval for 355

3.2 Hypotheses Tests for 358

4 Relationship between Tests and Confidence Intervals 363

5 Inferences about the Standard Deviation

(The Chi-Square Distribution) 366

6 Robustness of Inference Procedures 371

7 Using Statistics Wisely 372

8 Key Ideas and Formulas 373

9 Technology 375

10 Review Exercises 376

10

COMPARING TWO TREATMENTS

1

2

3

4

5

6

7

8

9

10

11

12

13

11

381

Introduction 383

Independent Random Samples from Two Populations 386

Large Samples Inference about Difference of Two Means 388

Inferences from Small Samples: Normal Populations with

Equal Variances 394

Inferences from Small Samples: Normal Populations with Unequal

Variances 400

5.1 A Conservative t Test 400

5.2 An Approximate t Test—Satterthwaite Correction 402

Randomization and Its Role in Inference 407

Matched Pairs Comparisons 409

7.1 Inferences from a Large Number of Matched Pairs 412

7.2 Inferences from a Small Number of Matched Pairs 413

7.3 Randomization with Matched Pairs 416

Choosing between Independent Samples and a Matched Pairs Sample 418

Comparing Two Population Proportions 420

Using Statistics Wisely 426

Key Ideas and Formulas 427

Technology 431

Review Exercises 434

REGRESSION ANALYSIS — I

Simple Linear Regression

1 Introduction 441

2 Regression with a Single Predictor 443

439

ftoc.qxd

10/15/09

12:38 PM

Page xv

CONTENTS

xv

3 A Straight-Line Regression Model 446

4 The Method of Least Squares 448

5 The Sampling Variability of the Least Squares Estimators —

Tools for Inference 456

6 Important Inference Problems 458

6.1. Inference Concerning the Slope 1 458

6.2. Inference about the Intercept 0 460

6.3. Estimation of the Mean Response for a Specified x Value 460

6.4. Prediction of a Single Response for a Specified x Value 463

7 The Strength of a Linear Relation 471

8 Remarks about the Straight Line Model Assumptions 476

9 Using Statistics Wisely 476

10 Key Ideas and Formulas 477

11 Technology 480

12 Review Exercises 481

12

REGRESSION ANALYSIS — II

Multiple Linear Regression and Other Topics

1

2

3

4

5

6

7

8

13

485

Introduction 487

Nonlinear Relations and Linearizing Transformations 487

Multiple Linear Regression 491

Residual Plots to Check the Adequacy of a Statistical Model 503

Using Statistics Wisely 507

Key Ideas and Formulas 507

Technology 508

Review Exercises 509

ANALYSIS OF CATEGORICAL DATA

513

1 Introduction 515

2 Pearson’s 2 Test for Goodness of Fit 518

3 Contingency Table with One Margin Fixed

(Test of Homogeneity) 522

4 Contingency Table with Neither Margin Fixed (Test of Independence) 531

5 Using Statistics Wisely 537

6 Key Ideas and Formulas 537

7 Technology 539

8 Review Exercises 540

14

ANALYSIS OF VARIANCE (ANOVA)

1 Introduction 545

2 Comparison of Several Treatments —

The Completely Randomized Design 545

543

ftoc.qxd

10/15/09

xvi

12:38 PM

Page xvi

CONTENTS

3 Population Model and Inferences

for a Completely Randomized Design 553

4 Simultaneous Confidence Intervals 557

5 Graphical Diagnostics and Displays

to Supplement ANOVA 561

6 Randomized Block Experiments

for Comparing k Treatments 563

7 Using Statistics Wisely 571

8 Key Ideas and Formulas 572

9 Technology 573

10 Review Exercises 574

15

NONPARAMETRIC INFERENCE

577

1 Introduction 579

2 The Wilcoxon Rank-Sum Test for Comparing

Two Treatments 579

3 Matched Pairs Comparisons 590

4 Measure of Correlation Based on Ranks 599

5 Concluding Remarks 603

6 Using Statistics Wisely 604

7 Key Ideas and Formulas 604

8 Technology 605

9 Review Exercises 605

APPENDIX A1

SUMMATION NOTATION

609

APPENDIX A2

RULES FOR COUNTING

614

APPENDIX A3

EXPECTATION AND

STANDARD DEVIATION—PROPERTIES

617

THE EXPECTED VALUE AND_

STANDARD DEVIATION OF X

622

APPENDIX A4

ftoc.qxd

10/15/09

12:38 PM

Page xvii

CONTENTS

APPENDIX B

TABLES

xvii

624

Table 1

Table 2

Table 3

Table 4

Table 5

Table 6

Table 7

Random Digits 624

Cumulative Binomial Probabilities 627

Standard Normal Probabilities 634

Percentage Points of t Distributions 636

Percentage Points of 2 Distributions 637

Percentage Points of F ( v1 , v2 ) Distributions 638

Selected Tail Probabilities for the Null Distribution of

Wilcoxon’s Rank-Sum Statistic 640

Table 8 Selected Tail Probabilities for the Null Distribution

of Wilcoxon’s Signed-Rank Statistic 645

DATA BANK

647

ANSWERS TO SELECTED ODD-NUMBERED EXERCISES

665

INDEX

681

ftoc.qxd

10/15/09

12:38 PM

Page xviii

c01.qxd

10/15/09

11:59 AM

Page 1

1

Introduction

1.

2.

3.

4.

5.

6.

7.

8.

What Is Statistics?

Statistics in Our Everyday Life

Statistics in Aid of Scientific Inquiry

Two Basic Concepts—Population and Sample

The Purposeful Collection of Data

Statistics in Context

Objectives of Statistics

Review Exercises

c01.qxd

10/15/09

11:59 AM

Page 2

Surveys Provide Information

About the Population

What is your favorite spectator sport?

Football

Baseball

Basketball

Other

36.4%

12.7%

12.5%

38.4%

College and professional sports are combined in our summary.1 Clearly, football

is the most popular spectator sport. Actually, the National Football League by

itself is more popular than baseball.

Until the mid 1960s, baseball was most popular according to similar surveys.

Surveys, repeated at different times, can detect trends in opinion.

Hometown fans attending today’s game are but a sample of the population of all local

football fans. A self-selected sample may not be entirely representative of the population

on issues such as ticket price increases. Kiichiro Sato/ © AP/Wide World Photos

1

These percentages are similar to those obtained by the ESPN Sports Poll, a service of TNS, in a

2007 poll of over 27,000 fans.

c01.qxd

10/15/09

11:59 AM

Page 3

2. STATISTICS IN OUR EVERYDAY LIFE

3

1. WHAT IS STATISTICS?

The word statistics originated from the Latin word “status,” meaning “state.” For a

long time, it was identified solely with the displays of data and charts pertaining

to the economic, demographic, and political situations prevailing in a country.

Even today, a major segment of the general public thinks of statistics as synonymous with forbidding arrays of numbers and myriad graphs. This image is enhanced by numerous government reports that contain a massive compilation of

numbers and carry the word statistics in their titles: “Statistics of Farm Production,” “Statistics of Trade and Shipping,” “Labor Statistics,” to name a few. However, gigantic advances during the twentieth century have enabled statistics to

grow and assume its present importance as a discipline of data-based reasoning.

Passive display of numbers and charts is now a minor aspect of statistics, and

few, if any, of today’s statisticians are engaged in the routine activities of tabulation and charting.

What, then, are the role and principal objectives of statistics as a scientific

discipline? Stretching well beyond the confines of data display, statistics deals

with collecting informative data, interpreting these data, and drawing conclusions

about a phenomenon under study. The scope of this subject naturally extends to

all processes of acquiring knowledge that involve fact finding through collection

and examination of data. Opinion polls (surveys of households to study sociological, economic, or health-related issues), agricultural field experiments (with new

seeds, pesticides, or farming equipment), clinical studies of vaccines, and cloud

seeding for artificial rain production are just a few examples. The principles and

methodology of statistics are useful in answering questions such as, What kind

and how much data need to be collected? How should we organize and interpret

the data? How can we analyze the data and draw conclusions? How do we assess

the strength of the conclusions and gauge their uncertainty?

Statistics as a subject provides a body of principles and methodology for

designing the process of data collection, summarizing and interpreting

the data, and drawing conclusions or generalities.

2. STATISTICS IN OUR EVERYDAY LIFE

Fact finding through the collection and interpretation of data is not confined to professional researchers. In our attempts to understand issues of environmental protection, the state of unemployment, or the performance of competing football teams,

numerical facts and figures need to be reviewed and interpreted. In our day-to-day

life, learning takes place through an often implicit analysis of factual information.

We are all familiar to some extent with reports in the news media on important statistics.

c01.qxd

10/15/09

4

11:59 AM

Page 4

CHAPTER 1/INTRODUCTION

Employment. Monthly, as part of the Current Population Survey, the

Bureau of Census collects information about employment status from a sample of

about 65,000 households. Households are contacted on a rotating basis with threefourths of the sample remaining the same for any two consecutive months.

The survey data are analyzed by the Bureau of Labor Statistics, which reports monthly unemployment rates.

Cost of Living. The consumer price index (CPI) measures the cost of a

fixed market basket of over 400 goods and services. Each month, prices are obtained from a sample of over 18,000 retail stores that are distributed over 85

metropolitan areas. These prices are then combined taking into account the relative quantity of goods and services required by a hypothetical “1967 urban wage

earner.” Let us not be concerned with the details of the sampling method and

calculations as these are quite intricate. They are, however, under close scrutiny

because of the importance to the hundreds of thousands of Americans whose

earnings or retirement benefits are tied to the CPI.

Election time brings the pollsters into the limelight.

Gallup Poll. This, the best known of the national polls, produces estimates of the percentage of popular vote for each candidate based on interviews

with a minimum of 1500 adults. Beginning several months before the presidential election, results are regularly published. These reports help predict winners

and track changes in voter preferences.

Our sources of factual information range from individual experience to reports

in news media, government records, and articles in professional journals. As consumers of these reports, citizens need some idea of statistical reasoning to properly

interpret the data and evaluate the conclusions. Statistical reasoning provides criteria for determining which conclusions are supported by the data and which are not.

The credibility of conclusions also depends greatly on the use of statistical methods

at the data collection stage. Statistics provides a key ingredient for any systematic

approach to improve any type of process from manufacturing to service.

Quality and Productivity Improvement. In the past 30 years, the

United States has faced increasing competition in the world marketplace. An international revolution in quality and productivity improvement has heightened

the pressure on the U.S. economy. The ideas and teaching of W. Edwards Deming helped rejuvenate Japan’s industry in the late 1940s and 1950s. In the 1980s

and 1990s, Deming stressed to American executives that, in order to survive,

they must mobilize their work force to make a continuing commitment to quality improvement. His ideas have also been applied to government. The city of

Madison, WI, has implemented quality improvement projects in the police department and in bus repair and scheduling. In each case, the project goal was

better service at less cost. Treating citizens as the customers of government services, the first step was to collect information from them in order to identify situations that needed improvement. One end result was the strategic placement

of a new police substation and a subsequent increase in the number of foot patrol persons to interact with the community.

c01.qxd

10/15/09

11:59 AM

Page 5

3. STATISTICS IN AID OF SCIENTIFIC INQUIRY

5

Statistical reasoning can guide the purposeful collection and analysis of data toward the

continuous improvement of any process. © Andrew Sacks/Stone/Getty Images

Once a candidate project is selected for improvement, data must be collected to assess the current status and then more data collected on the effects of

possible changes. At this stage, statistical skills in the collection and presentation

of summaries are not only valuable but necessary for all participants.

In an industrial setting, statistical training for all employees — production

line and office workers, supervisors, and managers — is vital to the quality transformation of American industry.

3. STATISTICS IN AID OF SCIENTIFIC INQUIRY

The phrase scientific inquiry refers to a systematic process of learning. A scientist sets the goal of an investigation, collects relevant factual information (or

data), analyzes the data, draws conclusions, and decides further courses of action. We briefly outline a few illustrative scenarios.

Training Programs. Training or teaching programs in many fields designed

for a specific type of clientele (college students, industrial workers, minority groups,

physically handicapped people, retarded children, etc.) are continually monitored,

evaluated, and modified to improve their usefulness to society. To learn about the

comparative effectiveness of different programs, it is essential to collect data on the

achievement or growth of skill of the trainees at the completion of each program.

Monitoring Advertising Claims. The public is constantly bombarded

with commercials that claim the superiority of one product brand in comparison to

others. When such comparisons are founded on sound experimental evidence, they

c01.qxd

10/15/09

6

11:59 AM

Page 6

CHAPTER 1/INTRODUCTION

serve to educate the consumer. Not infrequently, however, misleading advertising

claims are made due to insufficient experimentation, faulty analysis of data, or even

blatant manipulation of experimental results. Government agencies and consumer

groups must be prepared to verify the comparative quality of products by using adequate data collection procedures and proper methods of statistical analysis.

Plant Breeding. To increase food production, agricultural scientists

develop new hybrids by cross-fertilizing different plant species. Promising new

strains need to be compared with the current best ones. Their relative productivity is assessed by planting some of each variety at a number of sites. Yields are

recorded and then analyzed for apparent differences. The strains may also be

compared on the basis of disease resistance or fertilizer requirements.

Genomics. This century’s most exciting scientific advances are occurring

in biology and genetics. Scientists can now study the genome, or sum total of all

of a living organism’s genes. The human DNA sequence is now known along

with the DNA sequences of hundreds of other organisms.

A primary goal of many studies is to identify the specific genes and related genetic states that give rise to complex traits (e.g., diabetes, heart disease, cancer).

New instruments for measuring genes and their products are continually being

developed. One popular technology is the microarray, a rectangular array of tens of

thousands of genes. The power of microarray technologies derives from the ability

to compare, for instance, healthy and diseased tissue. Two-color microarrays have

two kinds of DNA material deposited at each site in the array. Due to the impact

Statistically designed experiments are needed to document the advantages of the new

hybrid versus the old species. © Mitch Wojnarowicz/The Image Works

c01.qxd

10/15/09

11:59 AM

Page 7

3. STATISTICS IN AID OF SCIENTIFIC INQUIRY

7

of the disease and the availability of human tumor specimens, many early microarray

studies focused on human cancer. Significant advances have been made in cancer

classification, knowledge of cancer biology, and prognostic prediction. A hallmark example of the power of microarrays used in prognostic prediction is Mammaprint

approved by the FDA in 2007. This, the first approved microarray based test, classifies a breast cancer patient as low or high risk for recurrence.

This is clearly only the beginning, as numerous groups are employing microarrays and other high-throughput technologies in their research studies. Typically, genomics experiments feature the simultaneous measurement of a great

number of responses. As more and more data are collected, there is a growing

need for novel statistical methods for analyzing data and thereby addressing critical scientific questions. Statisticians and other computational scientists are play

ing a major role in these efforts to better human health.

Factual information is crucial to any investigation. The branch of statistics

called experimental design can guide the investigator in planning the manner

and extent of data collection.

The Conjecture-Experiment-Analysis Learning Cycle

Invention of the Sandwich by the Earl of Sandwich

(According to Woody Allen, Humorist)*

Analysis

Experiment

First completed

work:

a slice of bread, a

slice of bread and a

slice of turkey on top

of both

fails miserably

Conjecture

two slices of turkey

with a slice of bread

in the middle

rejected

C

three consecutive

slices of ham stacked

on one another

C

improved reputation

three slices of bread

several strips of ham,

enclosed top and bottom by two slices of

bread

some interest,

mostly in intellectual circles

C

immediate success

*Copyright © 1966 by Woody Allen. Adapted by permission of Random House, Inc. from Getting Even, by Woody Allen.

c01.qxd

10/15/09

8

11:59 AM

Page 8

CHAPTER 1/INTRODUCTION

After the data are collected, statistical methods are available that summarize and describe the prominent features of data. These are commonly known as

descriptive statistics. Today, a major thrust of the subject is the evaluation of information present in data and the assessment of the new learning gained from

this information. This is the area of inferential statistics and its associated methods are known as the methods of statistical inference.

It must be realized that a scientific investigation is typically a process of trial

and error. Rarely, if ever, can a phenomenon be completely understood or a theory perfected by means of a single, definitive experiment. It is too much to expect to get it all right in one shot. Even after his first success with the electric

light bulb, Thomas Edison had to continue to experiment with numerous materials for the filament before it was perfected. Data obtained from an experiment

provide new knowledge. This knowledge often suggests a revision of an existing

theory, and this itself may require further investigation through more experiments and analysis of data. Humorous as it may appear, the excerpt boxed

above from a Woody Allen writing captures the vital point that a scientific

process of learning is essentially iterative in nature.

4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE

In the preceding sections, we cited a few examples of situations where evaluation of factual information is essential for acquiring new knowledge. Although

these examples are drawn from widely differing fields and only sketchy descriptions of the scope and objectives of the studies are provided, a few common

characteristics are readily discernible.

First, in order to acquire new knowledge, relevant data must be collected.

Second, some amount of variability in the data is unavoidable even though observations are made under the same or closely similar conditions. For instance,

the treatment for an allergy may provide long-lasting relief for some individuals

whereas it may bring only transient relief or even none at all to others. Likewise, it is unrealistic to expect that college freshmen whose high school records

were alike would perform equally well in college. Nature does not follow such

a rigid law.

A third notable feature is that access to a complete set of data is either

physically impossible or from a practical standpoint not feasible. When data are

obtained from laboratory experiments or field trials, no matter how much experimentation has been performed, more can always be done. In public opinion

or consumer expenditure studies, a complete body of information would

emerge only if data were gathered from every individual in the nation — undoubtedly a monumental if not an impossible task. To collect an exhaustive set

of data related to the damage sustained by all cars of a particular model under

collision at a specified speed, every car of that model coming off the production

lines would have to be subjected to a collision! Thus, the limitations of time, resources, and facilities, and sometimes the destructive nature of the testing, mean

that we must work with incomplete information — the data that are actually

collected in the course of an experimental study.

c01.qxd

10/15/09

11:59 AM

Page 9

4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE

9

The preceding discussions highlight a distinction between the data set that

is actually acquired through the process of observation and the vast collection of

all potential observations that can be conceived in a given context. The statistical name for the former is sample; for the latter, it is population, or statistical

population. To further elucidate these concepts, we observe that each measurement in a data set originates from a distinct source which may be a patient, tree,

farm, household, or some other entity depending on the object of a study. The

source of each measurement is called a sampling unit, or simply, a unit.

To emphasize population as the entire collection of units, we term it the

population of units.

A unit is a single entity, usually a person or an object, whose characteristics are of interest.

The population of units is the complete collection of units about

which information is sought.

There is another aspect to any population and that is the value, for each unit, of

a characteristic or variable of interest. There can be several characteristics of interest for a given population of units, as indicated in Table 1.

TABLE 1 Populations, Units, and Variables

Population

Unit

Variables/Characteristics

Registered voters in your state

Voter

Political party

Voted or not in last election

Age

Sex

Conservative/liberal

All rental apartments near

campus

Apartment

Rent

Size in square feet

Number of bedrooms

Number of bathrooms

TV and Internet connections

All campus fast food restaurants

Restaurant

Number of employees

Seating capacity

Hiring/not hiring

All computers owned by

students at your school

Computer

Speed of processor

Size of hard disk

Speed of Internet connection

Screen size

For a given variable or characteristic of interest, we call the collection of values, evaluated for every unit in the population, the statistical population or just

c01.qxd

10/15/09

10

11:59 AM

Page 10

CHAPTER 1/INTRODUCTION

the population. We refer to the collection of units as the population of units

when there is a need to differentiate it from the collection of values.

A statistical population is the set of measurements (or record of some

qualitative trait) corresponding to the entire collection of units about

which information is sought.

The population represents the target of an investigation. We learn about the

population by taking a sample from the population. A sample or sample data

set then consists of measurements recorded for those units that are actually observed. It constitutes a part of a far larger collection about which we wish to

make inferences — the set of measurements that would result if all the units in

the population could be observed.

A sample from a statistical population is the subset of measurements that

are actually collected in the course of an investigation.

Example 1

Identifying the Population and Sample

Questions concerning the effect on health of two or fewer cups of coffee a

day are still largely unresolved. Current studies seek to find physiological

changes that could prove harmful. An article carried the headline CAFFEINE

DECREASES CEREBRAL BLOOD FLOW. It describes a study2 which establishes a physiological side effect — a substantial decrease in cerebral blood

flow for persons drinking two to three cups of coffee daily.

The cerebral blood flow was measured twice on each of 20 subjects. It was

measured once after taking an oral dose of caffeine equivalent to two to three

cups of coffee and then, on another day, after taking a look-alike dose but without caffeine. The order of the two tests was random and subjects were not told

which dose they received. The measured decrease in cerebral blood flow was

significant.

Identify the population and sample.

SOLUTION

As the article implies, the conclusion should apply to you and me. The population could well be the potential decreases in cerebral blood flow for all

adults living in the United States. It might even apply to all the decrease in

blood flow for all caffeine users in the world, although the cultural customs

2

A. Field et al. “Dietary Caffeine Consumption and Withdrawal: Confounding Variables in Quantitative Cerebral Perfusion Studies?” Radiology 227 (2003), pp. 129 – 135.

c01.qxd

10/15/09

11:59 AM

Page 11

4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE

11

may vary the type of caffeine consumption from coffee breaks to tea time to

kola nut chewing.

The sample consists of the decreases in blood flow for the 20 subjects who

agreed to participate in the study.

Example 2

A Misleading Sample

A host of a radio music show announced that she wants to know which

singer is the favorite among city residents. Listeners were then asked to call in

and name their favorite singer.

Identify the population and sample. Comment on how to get a sample

that is more representative of the city’s population.

SOLUTION

The population is the collection of singer preferences of all city residents and

the purported goal was to learn who was the favorite singer. Because it would

be nearly impossible to question all the residents in a large city, one must

necessarily settle for taking a sample.

Having residents make a local call is certainly a low-cost method of getting a sample. The sample would then consist of the singers named by each

person who calls the radio station. Unfortunately, with this selection procedure,

the sample is not very representative of the responses from all city residents.

Those who listen to the particular radio station are already a special subgroup

with similar listening tastes. Furthermore, those listeners who take the time

and effort to call are usually those who feel strongest about their opinions.

The resulting responses could well be much stronger in favor of a particular

country western or rock singer than is the case for preference among the total

population of city residents or even those who listen to the station.

If the purpose of asking the question is really to determine the favorite

singer of the city’s residents, we have to proceed otherwise. One procedure

commonly employed is a phone survey where the phone numbers are chosen

at random. For instance, one can imagine that the numbers 0, 1, 2, 3, 4, 5, 6,

7, 8, and 9 are written on separate pieces of paper and placed in a hat. Slips

are then drawn one at a time and replaced between drawings. Later, we will

see that computers can mimic this selection quickly and easily. Four draws

will produce a random telephone number within a three-digit exchange.

Telephone numbers chosen in this manner will certainly produce a much

more representative sample than the self-selected sample of persons who call

the station.

Self-selected samples consisting of responses to call-in or write-in requests

will, in general, not be representative of the population. They arise primarily

from subjects who feel strongly about the issue in question. To their credit,

many TV news and entertainment programs now state that their call-in polls are

nonscientific and merely reflect the opinions of those persons who responded.

c01.qxd

10/15/09

12

11:59 AM

Page 12

CHAPTER 1/INTRODUCTION

USING A RANDOM NUMBER TABLE TO SELECT A SAMPLE

The choice of which population units to include in a sample must be impartial

and objective. When the total number of units is finite, the name or number of

each population unit could be written on a separate slip of paper and the slips

placed in a box. Slips could be drawn one at a time without replacement and

the corresponding units selected as the sample of units. Unfortunately, this simple and intuitive procedure is cumbersome to implement. Also, it is difficult to

mix the slips well enough to ensure impartiality.

Alternatively, a better method is to take 10 identical marbles, number them

0 through 9, and place them in an urn. After shuffling, select 1 marble. After replacing the marble, shuffle and draw again. Continuing in this way, we create a

sequence of random digits. Each digit has an equal chance of appearing in any

given position, all pairs have the same chance of appearing in any two given positions, and so on. Further, any digit or collection of digits is unrelated to any

other disjoint subset of digits. For convenience of use, these digits can be placed

in a table called a random number table.

The digits in Table 1 of Appendix B were actually generated using computer

software that closely mimics the drawing of marbles. A portion of this table is

shown here as Table 2.

To obtain a random sample of units from a population of size N, we first

number the units from 1 to N. Then numbers are read from the table of random

digits until enough different numbers in the appropriate range are selected.

TABLE 2 Random Digits: A Portion of Table 1, Appendix B

Row

1

2

3

4

5

0695

0437

6242

7090

0683

7741

5434

2998

4074

6999

8254

8503

0205

1257

4828

4297

3928

5469

7175

7888

0000

6979

3365

3310

0087

5277

9393

7950

0712

9288

6563

8936

7256

4748

7855

9265

9088

3716

4226

2678

1023

5744

8385

0604

3315

5925

4790

0253

3804

6718

6

7

8

9

10

7013

8808

9876

1873

2581

4300

2786

3602

1065

3075

3768

5369

5812

8976

4622

2572

9571

0124

1295

2974

6473

3412

1997

9434

7069

2411

2465

6445

3178

5605

6285

6419

3176

0602

0420

0069

3990

2682

0732

2949

5422

0294

1259

6616

4387

6175

0896

1728

7972

7679

11

12

13

14

15

3785

8626

6253

0113

4646

6401

4017

0726

4546

6474

0540

1544

9483

2212

9983

5077

4202

6753

9829

8738

7132

8986

4732

2351

1603

4135

1432

2284

1370

8671

4646

2810

0421

2707

0489

3834

2418

3010

3329

9588

6753

8052

7885

6574

3309

1593

2710

8436

7002

5860

c01.qxd

10/15/09

11:59 AM

Page 13

4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE

Example 3

Using the Table of Random Digits to Select Items for a Price Check

One week, the advertisement for a large grocery store contains 72 special sale

items. Five items will be selected with the intention of comparing the sales

price with the scan price at the checkout counter. Select the five items at random to avoid partiality.

SOLUTION

The 72 sale items are first numbered from 1 to 72. Since the population size

N 72 has two digits, we will select random digits two at a time from

Table 2. Arbitrarily, we decide to start in row 7 and columns 19 and 20. Starting with the two digits in columns 19 and 20 and reading down, we obtain

13

12 97 34 69 32 86 32 51

We ignore 97 and 86 because they are larger than the population size 72. We

also ignore any number when it appears a second time as 32 does here. Consequently, the sale items numbered

12 34 69 32 51

are selected for the price check.

For large sample size situations or frequent applications, it is often more

convenient to use computer software to choose the random numbers.

Example 4

Selecting a Sample by Random Digit Dialing

A major Internet service provider wants to learn about the proportion of

people in one target area who are aware of its latest product. Suppose there

is a single three-digit telephone exchange that covers the target area. Use

Table 1, in Appendix B, to select six telephone numbers for a phone survey.

SOLUTION

We arbitrarily decide to start at row 31 and columns 25 to 28. Proceeding

upward, we obtain

7566 0766 1619 9320 1307 6435

Together with the three-digit exchange, these six numbers form the phone

numbers called in the survey. Every phone number, listed or unlisted, has the

same chance of being selected. The same holds for every pair, every triplet,

and so on. Commercial phones may have to be discarded and another four

digits selected. If there are two exchanges in the area, separate selections

could be done for each exchange.

For large sample sizes, it is better to use computer-generated random digits or even computer-dialed random phone numbers.

Data collected with a clear-cut purpose in mind are very different from anecdotal data. Most of us have heard people say they won money at a casino, but

certainly most people cannot win most of the time as casinos are not in the business of giving away money. People tend to tell good things about themselves. In a

c01.qxd

10/15/09

14

11:59 AM

Page 14

CHAPTER 1/INTRODUCTION

similar vein, some drivers’ lives are saved when they are thrown free of car

wrecks because they were not wearing seat belts. Although such stories are told

and retold, you must remember that there is really no opportunity to hear from

those who would have lived if they had worn their seat belts. Anecdotal information is usually repeated because it has some striking feature that may not be representative of the mass of cases in the population. Consequently, it is not apt to

provide reliable answers to questions.

5. THE PURPOSEFUL COLLECTION OF DATA

Many poor decisions are made, in both business and everyday activities, because

of the failure to understand and account for variability. Certainly, the purchasing

habits of one person may not represent those of the population, or the reaction

of one mouse, on exposure to a potentially toxic chemical compound, may not

represent that of a large population of mice. However, despite diversity among

the purchasing habits of individuals, we can obtain accurate information about

the purchasing habits of the population by collecting data on a large number of

persons. By the same token, much can be learned about the toxicity of a chemical if many mice are exposed.

Just making the decision to collect data to answer a question, to provide the

basis for taking action, or to improve a process is a key step. Once that decision

has been made, an important next step is to develop a statement of purpose that

is both specific and unambiguous. If the subject of the study is public transportation being behind schedule, you must carefully specify what is meant by

late. Is it 1 minute, 5 minutes, or more than 10 minutes behind scheduled times

that should result in calling a bus or commuter train late? Words like soft or uncomfortable in a statement are even harder to quantify. One common approach,

for a quality like comfort, is to ask passengers to rate the ride on public transportation on the five-point scale

1

Very uncomfortable

2

3

Neutral

4

5

Very comfortable

where the numbers 1 through 5 are attached to the scale, with 1 for very uncomfortable and so on through 5 for very comfortable.

We might conclude that the ride is comfortable if the majority of persons in

the sample check either of the top two boxes.

Example 5

A Clear Statement of Purpose Concerning Water Quality

Each day, a city must sample the lake water in and around a swimming beach to

determine if the water is safe for swimming. During late summer, the primary

difficulty is algae growth and the safe limit has been set in terms of water clarity.

SOLUTION

The problem is already well defined so the statement of purpose is straightforward.

c01.qxd

10/15/09

11:59 AM

Page 15

6. STATISTICS IN CONTEXT

15

PURPOSE: Determine whether or not the water clarity at the beach is

below the safe limit.

The city has already decided to take measurements of clarity at three separated locations. In Chapter 8, we will learn how to decide if the water is safe

despite the variation in the three sample values.

The overall purpose can be quite general but a specific statement of purpose is

required at each step to guide the collection of data. For instance:

GENERAL PURPOSE: Design a data collection and monitoring program

at a completely automated plant that handles radioactive materials.

One issue is to ensure that the production plant will shut down quickly if materials start accumulating anywhere along the production line. More specifically,

the weight of materials could be measured at critical positions. A quick shutdown will be implemented if any of these exceed a safe limit. For this step, a

statement of purpose could be:

PURPOSE: Implement a fast shutdown if the weight at any critical position exceeds 1.2 kilograms.

The safe limit 1.2 kilograms should be obtained from experts; preferrably it

would be a consensus of expert opinion.

There still remain statistical issues of how many critical positions to choose

and how often to measure the weight. These are followed with questions on

how to analyze data and specify a rule for implementing a fast shutdown.

A clearly specified statement of purpose will guide the choice of what data

to collect and help ensure that it will be relevant to the purpose. Without a

clearly specified purpose, or terms unambiguously defined, much effort can be

wasted in collecting data that will not answer the question of interest.

6. STATISTICS IN CONTEXT

A primary health facility became aware that sometimes it was taking too long to

return patients’ phone calls. That is, patients would phone in with requests for

information. These requests, in turn, had to be turned over to doctors or nurses

who would collect the information and return the call. The overall objective was

to understand the current procedure and then improve on it. As a good first

step, it was decided to find how long it was taking to return calls under the current procedure. Variation in times from call to call is expected, so the purpose of

the initial investigation is to benchmark the variability with the current procedure by collecting a sample of times.

PURPOSE: Obtain a reference or benchmark for the current procedure

by collecting a sample of times to return a patient’s call under the current

procedure.

c01.qxd

10/15/09

16

11:59 AM

Page 16

CHAPTER 1/INTRODUCTION

For a sample of incoming calls collected during the week, the time received was

noted along with the request. When the return call was completed, the elapsed

time, in minutes, was recorded. Each of these times is represented as a dot in

Figure 1. Notice that over one-third of the calls took over 120 minutes, or over

two hours, to return. This could be a long time to wait for information if it concerns a child with a high fever or an adult with acute symptoms. If the purpose

was to determine what proportion of calls took too long to return, we would

need to agree on a more precise definition of “too long” in terms of number of

minutes. Instead, these data clearly indicate that the process needs improvement

and the next step is to proceed in that direction.

0

40

80

120

160

200

240

Time (min)

Figure 1 Time in minutes to return call.

In any context, to pursue potential improvements of a process, one needs to

focus more closely on particulars. Three questions

When Where Who

should always be asked before gathering further data. More specifically, data

should be sought that will answer the following questions.

When do the difficulties arise? Is it during certain hours, certain days of the

week or month, or in coincidence with some other activities?

Where do the difficulties arise? Try to identify the locations of bottlenecks

and unnecessary delays.

Who was performing the activity and who was supervising? The idea is not

to pin blame, but to understand the roles of participants with the goal of making improvements.

It is often helpful to construct a cause-and-effect diagram or fishbone diagram. The main centerline represents the problem or the effect. A somewhat

simplified fishbone chart is shown in Figure 2 for the where question regarding

the location of delays when returning patients’ phone calls. The main centerline

represents the problem: Where are delays occurring? Calls come to the reception desk, but when these lines are busy, the calls go directly to nurses on the

third or fourth floor. The main diagonal arms in Figure 2 represent the floors

and the smaller horizontal lines more specific locations on the floor where the

delay could occur. For instance, the horizontal line representing a delay in retrieving a patient’s medical record connects to the second floor diagonal line.

The resulting figure resembles the skeleton of a fish. Consideration of the diagram can help guide the choice of what new data to collect.

Fortunately, the quality team conducting this study had already given preliminary consideration to the When, Where, and Who questions and recorded not

only the time of day but also the day and person receiving the call. That is, their

c01.qxd

10/15/09

11:59 AM

Page 17

7. OBJECTIVES OF STATISTICS

17

current data gave them a start on determining if the time to return calls depends on when or where the call is received.

Although we go no further with this application here, the quality team next

developed more detailed diagrams to study the flow of paper between the time

the call is received and when it is returned. They then identified bottlenecks in

the flow of information that were removed and the process was improved. In

later chapters, you will learn how to compare and display data from two locations or old and new processes, but the key idea emphasized here is the purposeful collection of relevant data.

3rd Floor

1st Floor

Lab

Receptionist

X-ray

WHERE

ARE THE

DELAYS?

Records

2nd Floor

4th Floor

Figure 2 A cause-and-effect diagram for the location of delays.

7. OBJECTIVES OF STATISTICS

The subject of statistics provides the methodology to make inferences about the

population from the collection and analysis of sample data. These methods enable one to derive plausible generalizations and then assess the extent of uncertainty underlying these generalizations. Statistical concepts are also essential

during the planning stage of an investigation when decisions must be made as to

the mode and extent of the sampling process.

The major objectives of statistics are:

1. To make inferences about a population from an analysis of information contained in sample data. This includes assessments of the extent

of uncertainty involved in these inferences.

2. To design the process and the extent of sampling so that the observations form a basis for drawing valid inferences.

The design of the sampling process is an important step. A good design for

the process of data collection permits efficient inferences to be made, often with

c01.qxd

10/15/09

18

11:59 AM

Page 18

CHAPTER 1/INTRODUCTION

a straightforward analysis. Unfortunately, even the most sophisticated methods

of data analysis cannot, in themselves, salvage much information from data that

are produced by a poorly planned experiment or survey.

The early use of statistics in the compilation and passive presentation of

data has been largely superseded by the modern role of providing analytical

tools with which data can be efficiently gathered, understood, and interpreted.

Statistical concepts and methods make it possible to draw valid conclusions

about the population on the basis of a sample. Given its extended goal, the subject of statistics has penetrated all fields of human endeavor in which the evaluation of information must be grounded in data-based evidence.

The basic statistical concepts and methods described in this book form the

core in all areas of application. We present examples drawn from a wide range

of applications to help develop an appreciation of various statistical methods,

their potential uses, and their vulnerabilities to misuse.

USING STATISTICS WISELY

1. Compose a clear statement of purpose and use it to help decide upon which

variables to observe.

2. Carefully define the population of interest.

3. Whenever possible, select samples using a random device or random number table.

4. Do not unquestionably accept conclusions based on self-selected samples.

5. Remember that conclusions reached in TV, magazine, or newspaper reports

might not be as obvious as reported. When reading or listening to reports,

you must be aware that the advocate, often a politician or advertiser, may

only be presenting statistics that emphasize positive features.

KEY IDEAS

Before gathering data, on a characteristic of interest, identify a unit or sampling

unit. This is usually a person or object. The population of units is the complete

collection of units. In statistics we concentrate on the collection of values of the

characteristic, or record of a qualitative trait, evaluated for each unit in the population. We call this the statistical population or just the population.

A sample or sample data set from the population is the subset of measurements that are actually collected.

Statistics is a body of principles that helps to first design the process and extent of sampling and then guides the making of inferences about the population (inferential statistics). Descriptive statistics help summarize the sample.

Procedures for statistical inference allow us to make generalizations about the

population from the information in the sample.

A statement of purpose is a key step in designing the data collection process.

c01.qxd

10/15/09

11:59 AM

Page 19

8. REVIEW EXERCISES

19

8. REVIEW EXERCISES

1.1

A newspaper headline reads,

1.7

It is often easy to put off doing an unpleasant task.

At a Web site,5 persons can take a test and receive

a score that determines if they have a serious

problem with procrastination. Should the scores

from people who take this test on-line be considered a random sample? Explain your reasoning.

1.8

A magazine that features the latest electronics

and computer software for homes enclosed a

short questionnaire on a postcard. Readers were

asked to answer questions concerning their use

and ownership of various software and hardware

products, and to then send the card to the publisher. A summary of the results appeared in a

later issue of the magazine that used the data to

make statements such as 40% of readers have purchased program X. Identify a population and sample and comment on the representativeness of the

sample. Are readers who have not purchased any

new products mentioned in the questionnaire as

likely to respond as those who have purchased?

1.9

Each year a local weekly newspaper gives out

“Best of the City” awards in categories such as

restaurant, deli, pastry shop, and so on. Readers

are asked to fill in their favorites on a form enclosed in this free weekly paper and then send it

to the publisher. The establishment receiving the

most votes is declared the winner in its category.

Identify the population and sample and comment on the representativeness of the sample.

U.S. TEENS TRUST, FEAR THEIR PEERS

and the article explains that a telephone poll was

conducted of 1055 persons 13 to 17 years old.

Identify a statistical population and the sample.

1.2

Consider the population of all students at your

college. You want to learn about total monthly

entertainment expenses for a student.

(a) Specify the population unit.

(b) Specify the variable of interest.

(c) Specify the statistical population.

1.3

Consider the population of persons living in

Chicago. You want to learn about the proportion

which are illegal aliens.

(a) Specify the population unit.

(b) Specify the variable of interest.

(c) Specify the statistical population.

1.4

A student is asked to estimate the mean height

of all male students on campus. She decides to

use the heights of members of the basketball

team because they are conveniently printed in

the game program.

(a) Identify the statistical population and the

sample.

(b) Comment on the selection of the sample.

(c) How should a sample of males be selected?

1.5

Psychologists3 asked 46 golfers, after they played

a round, to estimate the diameter of the hole on

the green by visually selecting one of nine holes

cut in a board.

1.10 Which of the following are anecdotal and which

are based on sample?

(a) Out of 200 students questioned, 40 admitted they lied regularly.

(b) Bobbie says the produce at Market W is the

freshest in the city.

(c) Out of 50 persons interviewed at a shopping mall, 18 had made a purchase that day.

(a) Specify the population unit.

(b) Specify the statistical population and sample.

1.6

A phone survey in 20084 of 1010 adults included

a response to the number of leisure hours per

week. Identify the population unit, statistical

population, and sample.

1.11 Which of the following are anecdotal and which

are based on a sample?

(a) Tom says he gets the best prices on electronics at the www.bestelc.com Internet site.

3

J. Witt et al. “Putting to a bigger hole: Golf performance relates

to perceived size,” Psychonomic Bulletin and Review 15(3)

(2008), pp. 581–586.

4

Harris Interactive telephone survey (October 16–19, 2008).

5

http://psychologytoday.psychtests.com/tests/

procrastination_access.html

c01.qxd

10/15/09

20

11:59 AM

Page 20

CHAPTER 1/INTRODUCTION

(b) What is the proportion of the 20 experiments that give one of the students you like

and one other?

(b) Out of 22 students, 6 had multiple credit

cards.

(c) Among 55 people checking in at the airport, 12 were going to destinations outside

of the continental United States.

(c) What is the proportion of the 20 experiments that give none of the students you

like?

1.12 What is wrong with this statement of purpose?

PURPOSE: Determine if a newly designed rollerball pen is comfortable to hold when writing.

Give an improved statement of purpose.

1.13 What is wrong with this statement of purpose?

PURPOSE: Determine if it takes too long to get

cash from the automated teller machine during the

lunch hour.

1.18 According to the cause-and-effect diagram on

page 17, where are the possible delays on the

first floor?

1.19 Refer to the cause-and-effect diagram on page

17. The workers have now noticed that a delay

could occur:

(i)

Give an improved statement of purpose.

1.14 Give a statement of purpose for determining the

amount of time it takes to make hotel reservations in San Francisco using the Internet.

1.15 Thirty-five classrooms on campus are equiped

for multimedia instruction. Use Table 1, Appendix B, to select 4 of these classrooms to visit and

check whether or not the instructor is using the

equipment during that day’s first hour lecture.

1.16 Fifty band members would like to ride the band

bus to an out-of-town game. However, there is

room for only 44. Use Table 1, Appendix B, to

select the 44 persons who will go. Determine

how to make your selection by taking only a few

two-digit selections.

1.17 Eight young students need mentors. Of these,

there are three whom you enjoy being with while

you are indifferent about the others. Two of the

students will be randomly assigned to you. Label

the students you like by 0, 1, and 2 and the others by 3, 4, 5, 6, and 7. Then, the process of assigning two students at random is equivalent to

choosing two different digits from the table of

random digits and ignoring any 8 or 9. Repeat

the experiment of assigning two students 20

times by using the table of random digits. Record

the pairs of digits you draw for each experiment.

(a) What is the proportion of the 20 experiments that give two students that you

like?

On the fourth floor at the pharmacy

(ii) On the third floor at the practitioners’ station

Redraw the diagram and include this added information.

1.20 The United States Environmental Protection

Agency6 reports that in 2006, each American

generated 4.6 pounds of solid waste a day.

(a) Does this mean every single American produces the same amount of garbage? What

do you think this statement means?

(b) Was the number 4.6 obtained from a sample? Explain.

(c) How would you select a sample?

1.21 As a very extreme case of self-selection, imagine

a five-foot-high solid wood fence surrounding a

collection of Great Danes and Miniature Poodles. You want to estimate the proportion of

Great Danes inside and decide to collect your

sample by observing the first seven dogs to jump

high enough to be seen above the fence.

(a) Explain how this is a self-selected sample

that is, of course, very misleading.

(b) How is this sample selection procedure like

a call-in election poll?

6

http://www.epa.gov/epawaste/nonhaz/index.htm

c02a.qxd

10/15/09

12:02 PM

Page 21

2

Organization and

Description of Data

1.

2.

3.

4.

5.

6.

7.

8.

9.

Introduction

Main Types of Data

Describing Data by Tables and Graphs

Measures of Center

Measures of Variation

Checking the Stability of the Observations over Time

More on Graphics

Statistics in Context

Review Exercises

c02a.qxd

10/15/09

12:02 PM

Page 22

Acid Rain Is Killing Our Lakes

© SuperStock, Inc.

Acid precipitation is linked to the disappearance of sport fish and other organisms from lakes. Sources of air pollution, including automobile emissions and

the burning of fossil fuels, add to the natural acidity of precipitation. The Wisconsin Department of Natural Resources initiated a precipitation monitoring

program with the goal of developing appropriate air pollution controls to reduce the problem. The acidity of the first 50 rains monitored, measured on a pH

scale from 1 (very acidic) to 7 (basic), are summarized by the histogram.

25

10

9

4

2

3.0

3.5

4.0

4.5

5.0

5.5

6.0 pH

Histogram of acid rain data

Notice that all the rains are more acidic than normal rain, which has a pH of

5.6. (As a comparison, apples are about pH 3 and milk is about pH 6.)

Researchers in Canada have established that lake water with a pH below

5.6 may severely affect the reproduction of game fish. More research will undoubtedly improve our understanding of the acid rain problem and lead, it is

hoped, to an improved environment.

c02a.qxd

10/15/09

12:02 PM

Page 23

2. MAIN TYPES OF DATA

23

1. INTRODUCTION

In Chapter 1, we cited several examples of situations where the collection of

data by appropriate processes of experimentation or observation is essential to

acquire new knowledge. A data set may range in complexity from a few entries

to hundreds or even thousands of them. Each entry corresponds to the observation of a specified characteristic of a sampling unit. For example, a nutritionist

may provide an experimental diet to 30 undernourished children and record

their weight gains after two months. Here, children are the sampling units, and

the data set would consist of 30 measurements of weight gains. Once the data

are collected, a primary step is to organize the information and extract a descriptive summary that highlights its salient features. In this chapter, we learn

how to organize and describe a set of data by means of tables, graphs, and calculation of some numerical summary measures.

2. MAIN TYPES OF DATA

In discussing the methods for providing summary descriptions of data, it helps

to distinguish between the two basic types:

1. Qualitative or categorical data

2. Numerical or measurement data

When the characteristic under study concerns a qualitative trait that is only

classified in categories and not numerically measured, the resulting data are

called categorical data. Hair color (blond, brown, red, black), employment status (employed, unemployed), and blood type (O, A, B, AB) are but some examples. If, on the other hand, the characteristic is measured on a numerical scale,

the resulting data consist of a set of numbers and are called measurement data.

We will use the term numerical-valued variable or just variable to refer to a

characteristic that is measured on a numerical scale. The word “variable” signifies

that the measurements vary over different sampling units. In this terminology,

observations of a numerical-valued variable yield measurement data. A few examples of numerical-valued variables are the shoe size of an adult male, daily

number of traffic fatalities in a state, intensity of an earthquake, height of a 1year-old pine seedling, the time in line at an automated teller, and the number

of offspring in an animal litter.

Although in all these examples the stated characteristic can be numerically measured, a close scrutiny reveals two distinct types of underlying scale

of measurement. Shoe sizes are numbers such as 6, 6 12 , 7, 7 12 , . . . , which

proceed in steps of 12 . The count of traffic fatalities can only be an integer and

so is the number of offspring in an animal litter. These are examples of discrete variables. The name discrete draws from the fact that the scale is made

up of distinct numbers with gaps in between. On the other hand, some variables such as height, weight, and survival time can ideally take any value in an

c02a.qxd

10/15/09

24

12:02 PM

Page 24

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

interval. Since the measurement scale does not have gaps, such variables are

called continuous.

We must admit that a truly continuous scale of measurement is an idealization. Measurements actually recorded in a data set are always rounded either for

the sake of simplicity or because the measuring device has a limited accuracy.

Still, even though weights may be recorded in the nearest pounds or time

recorded in the whole hours, their actual values occur on a continuous scale so

the data are referred to as continuous. Counts are inherently discrete and

treated as such, provided that they take relatively few distinct values (e.g., the

number of children in a family or the number of traffic violations of a driver).

But when a count spans a wide range of values, it is often treated as a continuous variable. For example, the count of white blood cells, number of insects in a

colony, and number of shares of stock traded per day are strictly discrete, but for

practical purposes, they are viewed as continuous.

A summary description of categorical data is discussed in Section 3.1. The

remainder of this chapter is devoted to a descriptive study of measurement

data, both discrete and continuous. As in the case of summarization and commentary on a long, wordy document, it is difficult to prescribe concrete steps for

summary descriptions that work well for all types of measurement data. However, a few important aspects that deserve special attention are outlined here to

provide general guidelines for this process.

Describing a Data Set of Measurements

1. Summarization and description of the overall pattern.

(a) Presentation of tables and graphs.

(b) Noting important features of the graphed data including symmetry or departures from it.

(c) Scanning the graphed data to detect any observations that seem

to stick far out from the major mass of the data—the outliers.

2. Computation of numerical measures.

(a) A typical or representative value that indicates the center of the

data.

(b) The amount of spread or variation present in the data.

3. DESCRIBING DATA BY TABLES AND GRAPHS

3.1 CATEGORICAL DATA

When a qualitative trait is observed for a sample of units, each observation is

recorded as a member of one of several categories. Such data are readily organized in the form of a frequency table that shows the counts (frequencies) of

the individual categories. Our understanding of the data is further enhanced by

c02a.qxd

10/15/09

12:02 PM

Page 25

3. DESCRIBING DATA BY TABLES AND GRAPHS

25

calculation of the proportion (also called relative frequency) of observations in

each category.

Frequency in the category

Relative frequency

of a category

Total number of observations

Example 1

SOLUTION

Calculating Relative Frequencies to Summarize an Opinion Poll

A campus press polled a sample of 280 undergraduate students in order to

study student attitude toward a proposed change in the dormitory regulations. Each student was to respond as support, oppose, or neutral in regard to

the issue. The numbers were 152 support, 77 neutral, and 51 opposed. Tabulate the results and calculate the relative frequencies for the three response

categories.

Table 1 records the frequencies in the second column, and the relative frequencies are calculated in the third column. The relative frequencies show

that about 54% of the polled students supported the change, 18% opposed,

and 28% were neutral.

TABLE 1 Summary Results

of an Opinion Poll

Responses

Frequency

Support

152

Neutral

77

Oppose

51

Total

280

Relative

Frequency

152

280

77

280

51

280

.543

.275

.182

1.000

Remark: The relative frequencies provide the most relevant information

as to the pattern of the data. One should also state the sample size, which

serves as an indicator of the credibility of the relative frequencies. (More on

this in Chapter 8.)

Categorical data are often presented graphically as a pie chart in which the

segments of a circle exhibit the relative frequencies of the categories. To obtain

the angle for any category, we multiply the relative frequency by 360 degrees,

10/15/09

26

12:02 PM

Page 26

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

which corresponds to the complete circle. Although laying out the angles by

hand can be tedious, many software packages generate the chart with a single

command. Figure 1 presents a pie chart for the data in Example 1.

Oppose

18%

Support

54%

Neutral

28%

Figure 1 Pie chart of student

opinion on change in dormitory

regulations.

When questions arise that need answering but the decision makers lack precise knowledge of the state of nature or the full ramifications of their decisions,

the best procedure is often to collect more data. In the context of quality improvement, if a problem is recognized, the first step is to collect data on the

magnitude and possible causes. This information is most effectively communicated through graphical presentations.

A Pareto diagram is a powerful graphical technique for displaying events

according to their frequency. According to Pareto’s empirical law, any collection

of events consists of only a few that are major in that they are the ones that occur most of the time.

Figure 2 gives a Pareto diagram for the type of defects found in a day’s production of facial tissues. The cumulative frequency is 22 for the first cause and

20

15

Frequency

c02a.qxd

10

5

0

Tears

Holes

Folds

Other

Figure 2 Pareto diagram of facial tissue defects.

10/15/09

12:02 PM

Page 27

3. DESCRIBING DATA BY TABLES AND GRAPHS

27

22 15 37 for the first and second causes combined. This illustrates

Pareto’s rule, with two of the causes being responsible for 37 out of 50, or 74%,

of the defects.

Example 2

A Pareto Diagram Clarifies Circumstances Needing Improvement

Graduate students in a counseling course were asked to choose one of their

personal habits that needed improvement. In order to reduce the effect of

this habit, they were asked to first gather data on the frequency of the occurrence and the circumstances. One student collected the following frequency

data on fingernail biting over a two-week period.

SOLUTION

Frequency

Activity

58

21

14

7

3

12

Watching television

Reading newspaper

Talking on phone

Driving a car

Grocery shopping

Other

Make a Pareto diagram showing the relationship between nail biting and

type of activity.

The cumulative frequencies are 58, 58 21 79, and so on, out of 115.

The Pareto diagram is shown in Figure 3, where watching TV accounts for

50.4% of the instances.

60

50

Frequency

c02a.qxd

40

30

20

10

0

TV

Paper

Phone

Driving

Shopping

Other

Figure 3 Pareto diagram for nail biting example.

The next step for this person would be to try and find a substitute for

nail biting while watching television.

c02a.qxd

10/15/09

28

12:02 PM

Page 28

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

3.2 DISCRETE DATA

We next consider summary descriptions of measurement data and begin our discussion with discrete measurement scales. As explained in Section 2, a data set

is identified as discrete when the underlying scale is discrete and the distinct values observed are not too numerous.

Similar to our description of categorical data, the information in a discrete

data set can be summarized in a frequency table, or frequency distribution

that includes a calculation of the relative frequencies. In place of the qualitative

categories, we now list the distinct numerical measurements that appear in the

data set and then count their frequencies.

Example 3

Creating a Frequency Distribution

Retail stores experience their heaviest returns on December 26 and December

27 each year. Most are gifts that, for some reason, did not please the recipient.

The number of items returned, by a sample of 30 persons at a large discount department store, are observed and the data of Table 2 are obtained. Determine

the frequency distribution.

TABLE 2

1

2

2

SOLUTION

4

5

3

Number of items returned

3

1

2

2

4

3

3

2

2

4

1

1

5

3

4

1

2

3

2

4

2

1

1

5

The frequency distribution of these data is presented in Table 3. The values

are paired with the frequency and the calculated relative frequency.

TABLE 3 Frequency Distribution for

Number (x) of Items Returned

Value x

Frequency

Relative Frequency

1

2

3

4

5

7

9

6

5

3

.233

.300

.200

.167

.100

Total

30

1.000

12:02 PM

Page 29

29

3. DESCRIBING DATA BY TABLES AND GRAPHS

The frequency distribution of a discrete variable can be presented pictorially by drawing either lines or rectangles to represent the relative frequencies.

First, the distinct values of the variable are located on the horizontal axis. For a

line diagram, we draw a vertical line at each value and make the height of the

line equal to the relative frequency. A histogram employs vertical rectangles

instead of lines. These rectangles are centered at the values and their areas represent relative frequencies. Typically, the values proceed in equal steps so the

rectangles are all of the same width and their heights are proportional to the relative frequencies as well as frequencies. Figure 4(a) shows the line diagram and

4(b) the histogram of the frequency distribution of Table 3.

0.3

0.3

Relative frequency

10/15/09

Relative frequency

c02a.qxd

0.2

0.1

0

1

2

3 4 5

(a) Line diagram

x

0.2

0.1

0

1

2

3 4 5

(b) Histogram

x

Figure 4 Graphic display of the frequency distribution of data in Table 3.

3.3 DATA ON A CONTINUOUS VARIABLE

We now consider tabular and graphical presentations of data sets that contain

numerical measurements on a virtually continuous scale. Of course, the

recorded measurements are always rounded. In contrast with the discrete case, a

data set of measurements on a continuous variable may contain many distinct

values. Then, a table or plot of all distinct values and their frequencies will not

provide a condensed or informative summary of the data.

The two main graphical methods used to display a data set of measurements are the dot diagram and the histogram. Dot diagrams are employed

when there are relatively few observations (say, less than 20 or 25); histograms

are used with a larger number of observations.

Dot Diagram

When the data consist of a small set of numbers, they can be graphically represented by drawing a line with a scale covering the range of values of the measurements. Individual measurements are plotted above this line as prominent

dots. The resulting diagram is called a dot diagram.

c02a.qxd

10/15/09

30

12:02 PM

Page 30

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

Example 4

A Dot Diagram Reveals an Unusual Observation

The number of days the first six heart transplant patients at Stanford survived after their operations were 15, 3, 46, 623, 126, 64. Make a dot diagram.

SOLUTION

These survival times extended from 3 to 623 days. Drawing a line segment

from 0 to 700, we can plot the data as shown in Figure 5. This dot diagram

shows a cluster of small survival times and a single, rather large value.

0

100

200

300

400

Survival time (days)

500

600

700

Figure 5 Dot diagram for the heart transplant data.

Frequency Distribution on Intervals

When the data consist of a large number of measurements, a dot diagram may

be quite tedious to construct. More seriously, overcrowding of the dots will

cause them to smear and mar the clarity of the diagram. In such cases, it is convenient to condense the data by grouping the observations according to intervals

and recording the frequencies of the intervals. Unlike a discrete frequency distribution, where grouping naturally takes place on points, here we use intervals of

values. The main steps in this process are outlined as follows.

Constructing a Frequency Distribution

for a Continuous Variable

1. Find the minimum and the maximum values in the data set.

2. Choose intervals or cells of equal length that cover the range between

the minimum and the maximum without overlapping. These are

called class intervals, and their endpoints class boundaries.

3. Count the number of observations in the data that belong to each

class interval. The count in each class is the class frequency or cell frequency.

4. Calculate the relative frequency of each class by dividing the class frequency by the total number of observations in the data:

Relative frequency

Class frequency

Total number of observations

The choice of the number and position of the class intervals is primarily a

matter of judgment guided by the following considerations. The number of

c02a.qxd

10/15/09

12:02 PM

Page 31

3. DESCRIBING DATA BY TABLES AND GRAPHS

31

Paying Attention

© Britt Erlanson/The Image Bank/Getty Images

Paying

attention in class. Observations on 24 rst grade students.

Paying attention in class. Observations on 24 first-grade students.

0

1

2

3

4

5

6 7 8

Minutes

9

10 11 12 13

Figure 6 Time not concentrating on the mathematics assignment (out of 20 minutes).

First-grade teachers allot a portion of each day to mathematics. An educator, concerned about how students utilize this time, selected 24 students and observed them for a total of 20 minutes spread over several

days. The number of minutes, out of 20, that the student was not on task

was recorded (courtesy of T. Romberg). These lack-of-attention times are

graphically portrayed in the dot diagram in Figure 6. The student with 13

out of 20 minutes off-task stands out enough to merit further consideration. Is this a student who finds the subject too difficult or might it be a

very bright child who is bored?

classes usually ranges from 5 to 15, depending on the number of observations in

the data. Grouping the observations sacrifices information concerning how the

observations are distributed within each cell. With too few cells, the loss of information is serious. On the other hand, if one chooses too many cells and the

c02a.qxd

10/15/09

32

12:02 PM

Page 32

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

data set is relatively small, the frequencies from one cell to the next would

jump up and down in a chaotic manner and no overall pattern would emerge.

As an initial step, frequencies may be determined with a large number of intervals that can later be combined as desired in order to obtain a smooth pattern of

the distribution.

Computers conveniently order data from smallest to largest so that the observations in any cell can easily be counted. The construction of a frequency distribution is illustrated in Example 5.

Example 5

Creating a Frequency Distribution for Hours of Sleep

Students require different amounts of sleep. A sample of 59 students at a large

midwest university reported the following hours of sleep the previous night.

TABLE 4 Hours of Sleep for Fifty-nine Students

4.5

6.0

6.7

7.3

8.0

8.5

SOLUTION

4.7

6.0

6.7

7.3

8.0

8.7

5.0

6.0

6.7

7.5

8.0

8.7

5.0

6.0

6.7

7.5

8.0

9.0

5.3

6.3

7.0

7.5

8.3

9.0

5.5

6.3

7.0

7.5

8.3

9.0

5.5

6.3

7.0

7.7

8.3

9.3

5.7

6.5

7.0

7.7

8.5

9.3

5.7

6.5

7.3

7.7

8.5

10.0

5.7

6.5

7.3

7.7

8.5

Construct a frequency distribution of the sleep data.

To construct a frequency distribution, we first notice that the minimum

hours of sleep is 4.5 and the maximum is 10.0. We choose class intervals of

length 1.2 hours as a matter of convenience.

The selection of class boundaries is a bit of fussy work. Because the data

have one decimal place, we could add a second decimal to avoid the possibility of any observation falling exactly on the boundary. For example, we could

end the first class interval at 5.45. Alternatively, and more neatly, we could

write 4.3–5.5 and make the endpoint convention that the left-hand end

point is included but not the right.

The first interval contains 5 observations so its frequency is 5 and its rel5

ative frequency is 59

.085. Table 5 gives the frequency distribution. The

relative frequencies add to 1, as they should (up to rounding error) for any

frequency distribution. We see, for instance, that just about one-third of the

students .271 + .051 = .322 got 7.9 hours or more of sleep.

Remark: The rule requiring equal class intervals is inconvenient when

the data are spread over a wide range but are highly concentrated in a

small part of the range with relatively few numbers elsewhere. Using

smaller intervals where the data are highly concentrated and larger intervals where the data are sparse helps to reduce the loss of information due

to grouping.

c02a.qxd

10/15/09

12:02 PM

Page 33

3. DESCRIBING DATA BY TABLES AND GRAPHS

33

TABLE 5 Frequency Distribution for Hours of Sleep Data (left

endpoints included but right endpoints excluded)

Class Interval

Frequency

4.3– 5.5

5

5.5 – 6.7

15

6.7– 7.9

20

7.9– 9.1

16

9.1– 10.3

3

Total

59

Relative Frequency

5

59

15

59

20

59

16

59

3

59

.085

.254

.339

.271

.051

1.000

In every application involving an endpoint convention, it is important that you

clearly state which endpoint is included and which is excluded. This information

should be presented in the title or in a footnote of any frequency distribution.

Histogram

A frequency distribution can be graphically presented as a histogram. To draw a

histogram, we first mark the class intervals on the horizontal axis. On each interval,

we then draw a vertical rectangle whose area represents the relative frequency—

that is, the proportion of the observations occurring in that class interval.

To create rectangles whose area is equal to relative frequency, use the rule

Height

Relative frequency

Width of interval

The total area of all rectangles equals 1, the sum of the relative frequencies.

The total area of a histogram is 1.

The histogram for Table 5 is shown in Figure 7. For example, the rectangle

drawn on the class interval 4.3– 5.5 has area .071 1.2 .085, which is

the relative frequency of this class. Actually, we determined the height .071 as

Height

Relative frequency

.085

.071

Width of interval

1.2

The units on the vertical axis can be viewed as relative frequencies per unit

of the horizontal scale. For instance, .071 is the relative frequency per hour for

the interval 4.3– 5.5.

10/15/09

34

12:02 PM

Page 34

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

0.4

Relative frequency per hour

c02a.qxd

.339

0.3

.271

.254

0.2

.085

0.1

.051

4.3

5.5

6.7

7.9

9.1

10.3

Hours sleep

Figure 7 Histogram of the sleep data of Tables 4 and 5.

Sample size 59.

Visually, we note that the rectangle having largest area, or most frequent class

interval, is 6.7– 7.9. Also, proportion .085 .254 .339 of the students slept

less than 6.7 hours.

Remark: When all class intervals have equal widths, the heights of the rectangles are proportional to the relative frequencies that the areas represent. The

formal calculation of height, as area divided by the width, is then redundant. Instead, one can mark the vertical scale according to the relative frequencies—

that is, make the heights of the rectangles equal to the relative frequencies. The

resulting picture also makes the areas represent the relative frequencies if we

read the vertical scale as if it is in units of the class interval. This leeway when

plotting the histogram is not permitted in the case of unequal class intervals.

Figure 8 shows one ingenious way of displaying two histograms for comparison.

In spite of their complicated shapes, their back-to-back plot as a “tree” allows for

easy visual comparison. Females are the clear majority in the last age groups of

the male and female age distributions.

Stem-and-Leaf Display

A stem-and-leaf display provides a more efficient variant of the histogram for

displaying data, especially when the observations are two-digit numbers. This

plot is obtained by sorting the observations into rows according to their leading

digit. The stem-and-leaf display for the data of Table 6 is shown in Table 7. To

make this display:

1. List the digits 0 through 9 in a column and draw a vertical line. These

correspond to the leading digit.

2. For each observation, record its second digit to the right of this vertical

line in the row where the first digit appears.

3. Finally, arrange the second digits in each row so they are in increasing order.

12:02 PM

Page 35

3. DESCRIBING DATA BY TABLES AND GRAPHS

35

N = 148.7 million

Male

10/15/09

Age

0

10

20

30

40

50

60

70

80

90

100 100 and over

Female

c02a.qxd

N = 153.0 million

Figure 8 Population tree (histograms) of the male and female age distributions in the United States in 2007. (Source: U.S. Bureau of the Census.)

TABLE 6 Examination Scores of 50 Students

75

86

68

49

93

84

98

78

57

92

85

64

42

37

95

83

70

73

75

99

55

71

62

48

84

66

79

78

80

72

87

90

88

53

74

TABLE 7 Stem-and-Leaf Display for

the Examination Scores

0

1

2

3

4

5

6

7

8

9

7

289

35789

022345689

01234556778899

00134456789

0023589

65

79

76

81

69

59

80

60

77

90

63

89

77

58

62

c02a.qxd

10/15/09

36

12:02 PM

Page 36

CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA

In the stem-and-leaf display, the column of first digits to the left of the vertical line is viewed as the stem, and the second digits as the leaves. Viewed sidewise, it looks like a histogram with a cell width equal to 10. However, it is more

informative than a histogram because the actual data points are retained. In fact,

every observation can be recovered exactly from this stem-and-leaf display.

A stem-and-leaf display retains all the information in the leading digits of

the data. When the leaf unit .01, 3.5&0 2 3 7 8 presents the data 3.50, 3.52,

3.53, 3.57, and 3.58. Leaves may also be two-digit at times. When the first leaf digit

.01, .4&07 13 82 90 presents the data 0.407, 0.413, 0.482, and 0.490.

Further variants of the stem-and-leaf display are described in Exercises 2.25

and 2.26. This versatile display is one of the most applicable techniques of exploratory data analysis.

When…