DAV Public School Statistics Limits Question

Show your work to receive credits if there is any work to show.

13.1-13.2: 13.1, 13.2, 13.3, 13.5, 13.7(page 521)

P370: 9.38, 9.39, 9.40

Correction: 9.40(a) 31.33 Should be 31.53.

ftoc.qxd
10/15/09
12:38 PM
Page xviii
This online teaching and learning environment
integrates the entire digital textbook with the
most effective instructor and student resources
to fit every learning style.
With WileyPLUS:
• Students achieve concept
mastery in a rich,
structured environment
that’s available 24/7
• Instructors personalize and manage
their course more effectively with
assessment, assignments, grade
tracking, and more
• manage time better
• study smarter
• save money
From multiple study paths, to self-assessment, to a wealth of interactive
visual and audio resources, WileyPLUS gives you everything you need to
personalize the teaching and learning experience.
» F i n d o u t h ow t o M a k e I t Yo u r s »
www.wileyplus.com
all the help, resources, and personal support
you and your students need!
2-Minute Tutorials and all
of the resources you & your
students need to get started
www.wileyplus.com/firstday
Pre-loaded, ready-to-use
assignments and presentations
www.wiley.com/college/quickstart
Student support from an
Collaborate with your colleagues,
experienced student user
find a mentor, attend virtual and live
Ask your local representative
events, and view resources
for details!
www.WhereFacultyConnect.com
Technical Support 24/7
FAQs, online chat,
and phone support
www.wileyplus.com/support
Your WileyPLUS
Account Manager
Training and implementation support
www.wileyplus.com/accountmanager
Make It Yours!
ffirs.qxd
10/15/09
12:24 PM
Page iii
Statistics
Principles and Methods
SIXTH EDITION
Richard A. Johnson
University of Wisconsin at Madison
Gouri K. Bhattacharyya
John Wiley & Sons, Inc.
ffirs.qxd
10/15/09
12:24 PM
Page iv
Vice President & Executive Publisher
Project Editor
Senior Development Editor
Production Manager
Senior Production Editor
Marketing Manager
Creative Director
Design Director
Production Management Services
Photo Editor
Editorial Assistant
Media Editor
Cover Photo Credit
Cover Designer
Laurie Rosatone
Ellen Keohane
Anne Scanlan-Rohrer
Dorothy Sinclair
Valerie A. Vargas
Sarah Davis
Harry Nolan
Jeof Vita
mb editorial services
Sheena Goldstein
Beth Pearson
Melissa Edwards
Gallo Images-Hein von
Horsten/Getty Images, Inc.
Celia Wiley
This book was set in 10/12 Berling by Laserwords Private Limited, India and printed and bound by
RR Donnelley-Crawsfordville. The cover was printed by RR Donnelley-Crawsfordville.
Copyright © 2010, 2006 John Wiley & Sons, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system or transmitted in any form or by any means,
electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted
under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy
fee to the Copyright Clearance Center, Inc. 222 Rosewood Drive, Danvers, MA 01923, website
www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774,
(201)748-6011, fax (201)748-6008, website http://www.wiley.com/go/permissions.
Evaluation copies are provided to qualified academics and professionals for review purposes only,
for use in their courses during the next academic year. These copies are licensed and may not
be sold or transferred to a third party. Upon completion of the review period, please return the
evaluation copy to Wiley. Return instructions and a free of charge return shipping label are
available at www.wiley.com/go/returnlabel. Outside of the United States, please contact your
local representative.
ISBN-13 978-0-470-40927-5
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
fpref.qxd
10/15/09
12:37 PM
Page v
Preface
THE NATURE OF THE BOOK
Conclusions, decisions, and actions that are data driven predominate in today’s
world. Statistics — the subject of data analysis and data-based reasoning — is necessarily playing a vital role in virtually all professions. Some familiarity with this subject is now an essential component of any college education. Yet, pressures to accommodate a growing list of academic requirements often necessitate that this
exposure be brief. Keeping these conditions in mind, we have written this book to
provide students with a first exposure to the powerful ideas of modern statistics. It
presents the key statistical concepts and the most commonly applied methods of
statistical analysis. Moreover, to keep it accessible to freshmen and sophomores
from a wide range of disciplines, we have avoided mathematical derivations. They
usually pose a stumbling block to learning the essentials in a short period of time.
This book is intended for students who do not have a strong background in
mathematics but seek to learn the basic ideas of statistics and their application
in a variety of practical settings. The core material of this book is common to almost all first courses in statistics and is designed to be covered well within a
one-semester course in introductory statistics for freshmen – seniors. It is supplemented with some additional special-topics chapters.
ORIENTATION
The topics treated in this text are, by and large, the ones typically covered in an
introductory statistics course. They span three major areas: (i) descriptive statistics, which deals with summarization and description of data; (ii) ideas of probability and an understanding of the manner in which sample-to-sample variation
influences our conclusions; and (iii) a collection of statistical methods for analyzing the types of data that are of common occurrence. However, it is the treatment
of these topics that makes the text distinctive. Throughout, we have endeavored
to give clear and concise explanations of the concepts and important statistical
terminology and methods. By means of good motivation, sound explanations, and
an abundance of illustrations given in a real-world context, it emphasizes more
than just a superficial understanding.
v
fpref.qxd
10/15/09
vi
12:37 PM
Page vi
PREFACE
Each statistical concept or method is motivated by setting out its goal and then
focusing on an example to further elaborate important aspects and to illustrate its
application. The subsequent discussion is not only limited to showing how a
method works but includes an explanation of the why. Even without recourse to
mathematics, we are able to make the reader aware of possible pitfalls in the statistical analysis. Students can gain a proper appreciation of statistics only when they
are provided with a careful explanation of the underlying logic. Without this understanding, a learning of elementary statistics is bound to be rote and transient.
When describing the various methods of statistical analysis, the reader is
continually reminded that the validity of a statistical inference is contingent
upon certain model assumptions. Misleading conclusions may result when these
assumptions are violated. We feel that the teaching of statistics, even at an introductory level, should not be limited to the prescription of methods. Students
should be encouraged to develop a critical attitude in applying the methods and
to be cautious when interpreting the results. This attitude is especially important in the study of relationship among variables, which is perhaps the most
widely used (and also abused) area of statistics. In addition to discussing inference procedures in this context, we have particularly stressed critical examination of the model assumptions and careful interpretation of the conclusions.
SPECIAL FEATURES
1. Crucial elements are boxed to highlight important concepts and methods. These boxes provide an ongoing summary of the important items
essential for learning statistics. At the end of each chapter, all of its key
ideas and formulas are summarized.
2. A rich collection of examples and exercises is included. These are
drawn from a large variety of real-life settings. In fact, many data sets
stem from genuine experiments, surveys, or reports.
3. Exercises are provided at the end of each major section. These provide the
reader with the opportunity to practice the ideas just learned. Occasionally, they supplement some points raised in the text. A larger collection of
exercises appears at the end of a chapter. The starred problems are relatively difficult and suited to the more mathematically competent student.
4. Using Statistics Wisely, a feature at the end of each chapter, provides
important guidelines for the appropriate use of the statistical procedures presented in the chapter.
5. Statistics in Context sections, in four of the beginning chapters, each
describe an important statistical application where a statistical approach
to understanding variation is vital. These extended examples reveal, early
on in the course, the value of understanding the subject of statistics.
6. P – values are emphasized in examples concerning tests of hypotheses.
Graphs giving the relevant normal or t density curve, rejection region,
and P – value are presented.
fpref.qxd
10/15/09
12:37 PM
Page vii
PREFACE
vii
7. Regression analysis is a primary statistical technique so we provide a
more thorough coverage of the topic than is usual at this level. The basics of regression are introduced in Chapter 11, whereas Chapter 12
stretches the discussion to several issues of practical importance. These
include methods of model checking, handling nonlinear relations, and
multiple regression analysis. Complex formulas and calculations are judiciously replaced by computer output so the main ideas can be learned
and appreciated with a minimum of stress.
8. Integrated Technology, at the end of most chapters, details the steps for using MINITAB, EXCEL,1 and TI-84 calculator. With this presentation available, with few exceptions, only computer output is needed in the text.
Software packages remove much of the drudgery of hand calculation
and they allow students to work with larger data sets where patterns are
more pronounced. Some computer exercises are included in all chapters where relevant.
9. Convenient Electronic Data Bank at the end of the book contains a substantial collection of data. These data sets, together with numerous others throughout the book, allow for considerable flexibility in the choice
between concept-orientated and applications-orientated exercises. The
Data Bank and the other larger data sets are available for download on
the accompanying Web site located at www.wiley.com/college/johnson.
10. Technical Appendix A presents a few statistical facts of a mathematical
nature. These are separated from the main text so that they can be left
out if the instructor so desires.
ABOUT THE SIXTH EDITION
The sixth edition of STATISTICS — Principles and Methods maintains the objectives and level of presentation of the earlier editions. The goals are the developing (i) of an understanding of the reasonings by which findings from sample
data can be extended to general conclusions and (ii) a familiarity with some
basic statistical methods. There are numerous data sets and computer outputs
which give an appreciation of the role of the computer in modern data analysis.
Clear and concise explanations introduce the concepts and important statistical terminology and methods. Real-life settings are used to motivate the
statistical ideas and well organized discussions proceed to cover statistical
methods with heavy emphasis on examples. The sixth edition enhances these
special features. The major improvements are:
Bayes’ Theorem. A new section is added to Chapter 4 to highlight the reasoning underlying Bayes’s theorem and to present applications.
Approximate t. A new subsection is added to Chapter 7, which describes
the approximate two sample t statistic that is now pervasive in statistical software programs. For normal distributions, with unequal variances, this has become the preferred approach.
1Commands and the worksheets with data sets pertain to EXCEL 2003.
fpref.qxd
10/15/09
viii
12:37 PM
Page viii
PREFACE
New Examples. A substantial number of new examples are included, especially in the core chapters, Chapter 11 on regression, and Chapter 13 on contingency tables.
More Data-Based Exercises. Most of the new exercises are keyed to new
data-based examples in the text. New data are also presented in the exercises.
Other new exercises are based on the credit card use and opinion data that are
added to the data bank.
New Exercises. Numerous new exercises provide practice on understanding
the concepts and others address computations. These new exercises, which augment the already rich collection, are placed in real-life settings to help promote
a greater appreciation of the wide span of applicability of statistical methods.
ORGANIZATION
This book is organized into fifteen chapters, an optional technical appendix
(Appendix A), and a collection of tables (Appendix B). Although designed for a
one-semester or a two-quarter course, it is enriched with ample additional material to allow the instructor some choices of topics. Beyond Chapter 1, which sets
the theme of statistics and distinguishes population and sample, the subject
matter could be classified as follows:
Topic
Descriptive study of data
Probability and distributions
Sampling variability
Core ideas and methods
of statistical inference
Special topics of
statistical inference
Chapter
2, 3
4, 5, 6
7
8, 9, 10
11, 12, 13, 14, 15
We regard Chapters 1 to 10 as constituting the core material of an introductory statistics course, with the exception of the starred sections in Chapter 6. Although this material is just about enough for a one-semester course, many
instructors may wish to eliminate some sections in order to cover the basics of regression analysis in Chapter 11. This is most conveniently done by initially skipping
Chapter 3 and then taking up only those portions that are linked to Chapter 11.
Also, instead of a thorough coverage of probability that is provided in Chapter 4,
the later sections of that chapter may receive a lighter coverage.
SUPPLEMENTS
Instructor’s Solution Manual. (ISBN 978-0-470-53519-6) This manual contains complete solutions to all exercises.
fpref.qxd
10/15/09
12:37 PM
Page ix
PREFACE
ix
Test Bank. (Available on the accompanying website: www.wiley.com/
college/johnson) Contains a large number of additional questions for each
chapter.
Student Solutions Manual. (ISBN 978-0-470-53521-9) This manual contains complete solutions to all odd-numbered exercises.
Electronic Data Bank. (Available on the accompanying website: www.
wiley.com/college/johnson) Contains interesting data sets used in the text but that
can be used to perform additional analyses with statistical software packages.
WileyPLUS. This powerful online tool provides a completely integrated suite
of teaching and learning resources in one easy-to-use website. WileyPLUS offers
an online assessment system with full gradebook capabilities and algorithmically
generated skill building questions. This online teaching and learning environment
also integrates the entire digital textbook. To view a demo of WileyPLUS, contact
your local Wiley Sales Representative or visit: www.wiley.com/college/wileyplus.
ACKNOWLEDGMENTS
We thank Minitab (State College, Pa.) and the SAS Institute (Cary, N.C.) for permission to include commands and output from their software packages. A special
thanks to K. T. Wu and Kam Tsui for many helpful suggestions and comments on
earlier editions. We also thank all those who have contributed the data sets which
enrich the presentation and all those who reviewed the previous editions. The
following people gave their careful attention to this edition:
Hongshik Ahn, Stony Brook University
Prasanta Basak, Penn State University Altoona
Andrea Boito, Penn State University Altoona
Patricia M. Buchanan, Penn State University
Nural Chowdhury, University of Saskatchewan
S. Abdul Fazal, California State University Stanislaus
Christian K. Hansen, Eastern Washington University
Susan Kay Herring, Sonoma State University
Hui-Kuang Hsieh, University of Massachusetts Amherst
Hira L. Koul, Michigan State University
Melanie Martin, California State University Stanislaus
Mark McKibben, Goucher College
Charles H. Morgan, Jr., Lock Haven University of Pennsylvania
Perpetua Lynne Nielsen, Brigham Young University
Ashish Kumar Srivastava, St. Louis University
James Stamey, Baylor University
Masoud Tabatabai, Penn State University Harrisburg
Jed W. Utsinger, Ohio University
R. Patrick Vernon, Rhodes College
fpref.qxd
10/15/09
x
12:37 PM
Page x
PREFACE
Roumen Vesselinov, University of South Carolina
Vladimir Vinogradov, Ohio University
A. G. Warrack, North Carolina A&T State University
Richard A. Johnson
Gouri K. Bhattacharyya
ftoc.qxd
10/15/09
12:38 PM
Page xi
Contents
1
INTRODUCTION
1
2
3
4
5
6
7
8
9
10
2
1
What Is Statistics? 3
Statistics in Our Everyday Life 3
Statistics in Aid of Scientific Inquiry 5
Two Basic Concepts — Population and Sample 8
The Purposeful Collection of Data 14
Statistics in Context 15
Objectives of Statistics 17
Using Statistics Wisely 18
Key Ideas 18
Review Exercises 19
ORGANIZATION AND DESCRIPTION OF DATA
21
1 Introduction 23
2 Main Types of Data 23
3 Describing Data by Tables and Graphs 24
3.1 Categorical Data 24
3.2 Discrete Data 28
3.3 Data on a Continuous Variable 29
4 Measures of Center 40
5 Measures of Variation 48
6 Checking the Stability of the Observations over Time 60
7 More on Graphics 64
8 Statistics in Context 66
9 Using Statistics Wisely 68
10 Key Ideas and Formulas 68
11 Technology 70
12 Review Exercises 73
xi
ftoc.qxd
10/15/09
xii
3
12:38 PM
Page xii
CONTENTS
DESCRIPTIVE STUDY OF BIVARIATE DATA
1
2
3
4
5
6
7
8
9
10
4
81
Introduction 83
Summarization of Bivariate Categorical Data 83
A Designed Experiment for Making a Comparison 88
Scatter Diagram of Bivariate Measurement Data 90
The Correlation Coefficient — A Measure of Linear Relation 93
Prediction of One Variable from Another (Linear Regression) 104
Using Statistics Wisely 109
Key Ideas and Formulas 109
Technology 110
Review Exercises 111
PROBABILITY
115
1 Introduction 117
2 Probability of an Event 118
3 Methods of Assigning Probability 124
3.1 Equally Likely Elementary Outcomes —
The Uniform Probability Model 124
3.2 Probability As the Long-Run Relative Frequency 126
4 Event Relations and Two Laws of Probability 132
5 Conditional Probability and Independence 141
6 Bayes’ Theorem 140
7 Random Sampling from a Finite Population 155
8 Using Statistics Wisely 162
9 Key Ideas and Formulas 162
10 Technology 164
11 Review Exercises 165
5
PROBABILITY DISTRIBUTIONS
1
2
3
4
5
6
7
8
9
10
11
Introduction 173
Random Variables 173
Probability Distribution of a Discrete Random Variable 176
Expectation (Mean) and Standard Deviation
of a Probability Distribution 185
Successes and Failures — Bernoulli Trials 193
The Binomial Distribution 198
The Binomial Distribution in Context 208
Using Statistics Wisely 211
Key Ideas and Formulas 212
Technology 213
Review Exercises 215
171
ftoc.qxd
10/15/09
12:38 PM
Page xiii
CONTENTS
6
THE NORMAL DISTRIBUTION
xiii
221
1 Probability Model for a Continuous
Random Variable 223
2 The Normal Distribution — Its General Features 230
3 The Standard Normal Distribution 233
4 Probability Calculations with Normal Distributions 238
5 The Normal Approximation to the Binomial 242
*6 Checking the Plausibility of a Normal Model 248
*7 Transforming Observations to Attain
Near Normality 251
8 Using Statistics Wisely 254
9 Key Ideas and Formulas 255
10 Technology 256
11 Review Exercises 257
7
VARIATION IN REPEATED SAMPLES —
SAMPLING DISTRIBUTIONS
263
1 Introduction 265
2 The Sampling Distribution of a Statistic 266
3 Distribution of the Sample Mean and
the Central Limit Theorem 273
4 Statistics in Context 285
5 Using Statistics Wisely 289
6 Key Ideas and Formulas 289
7 Review Exercises 290
8 Class Projects 292
9 Computer Project 293
8
DRAWING INFERENCES FROM LARGE SAMPLES
1
2
3
4
5
6
7
8
9
Introduction 297
Point Estimation of a Population Mean 299
Confidence Interval for a Population Mean 305
Testing Hypotheses about a Population Mean 314
Inferences about a Population Proportion 329
Using Statistics Wisely 337
Key Ideas and Formulas 338
Technology 340
Review Exercises 343
295
ftoc.qxd
10/15/09
xiv
9
12:38 PM
Page xiv
CONTENTS
SMALL-SAMPLE INFERENCES
FOR NORMAL POPULATIONS
349
1 Introduction 351
2 Student’s t Distribution 351
3 Inferences about  — Small Sample Size 355
3.1 Confidence Interval for  355
3.2 Hypotheses Tests for  358
4 Relationship between Tests and Confidence Intervals 363
5 Inferences about the Standard Deviation 
(The Chi-Square Distribution) 366
6 Robustness of Inference Procedures 371
7 Using Statistics Wisely 372
8 Key Ideas and Formulas 373
9 Technology 375
10 Review Exercises 376
10
COMPARING TWO TREATMENTS
1
2
3
4
5
6
7
8
9
10
11
12
13
11
381
Introduction 383
Independent Random Samples from Two Populations 386
Large Samples Inference about Difference of Two Means 388
Inferences from Small Samples: Normal Populations with
Equal Variances 394
Inferences from Small Samples: Normal Populations with Unequal
Variances 400
5.1 A Conservative t Test 400
5.2 An Approximate t Test—Satterthwaite Correction 402
Randomization and Its Role in Inference 407
Matched Pairs Comparisons 409
7.1 Inferences from a Large Number of Matched Pairs 412
7.2 Inferences from a Small Number of Matched Pairs 413
7.3 Randomization with Matched Pairs 416
Choosing between Independent Samples and a Matched Pairs Sample 418
Comparing Two Population Proportions 420
Using Statistics Wisely 426
Key Ideas and Formulas 427
Technology 431
Review Exercises 434
REGRESSION ANALYSIS — I
Simple Linear Regression
1 Introduction 441
2 Regression with a Single Predictor 443
439
ftoc.qxd
10/15/09
12:38 PM
Page xv
CONTENTS
xv
3 A Straight-Line Regression Model 446
4 The Method of Least Squares 448
5 The Sampling Variability of the Least Squares Estimators —
Tools for Inference 456
6 Important Inference Problems 458
6.1. Inference Concerning the Slope 1 458
6.2. Inference about the Intercept 0 460
6.3. Estimation of the Mean Response for a Specified x Value 460
6.4. Prediction of a Single Response for a Specified x Value 463
7 The Strength of a Linear Relation 471
8 Remarks about the Straight Line Model Assumptions 476
9 Using Statistics Wisely 476
10 Key Ideas and Formulas 477
11 Technology 480
12 Review Exercises 481
12
REGRESSION ANALYSIS — II
Multiple Linear Regression and Other Topics
1
2
3
4
5
6
7
8
13
485
Introduction 487
Nonlinear Relations and Linearizing Transformations 487
Multiple Linear Regression 491
Residual Plots to Check the Adequacy of a Statistical Model 503
Using Statistics Wisely 507
Key Ideas and Formulas 507
Technology 508
Review Exercises 509
ANALYSIS OF CATEGORICAL DATA
513
1 Introduction 515
2 Pearson’s 2 Test for Goodness of Fit 518
3 Contingency Table with One Margin Fixed
(Test of Homogeneity) 522
4 Contingency Table with Neither Margin Fixed (Test of Independence) 531
5 Using Statistics Wisely 537
6 Key Ideas and Formulas 537
7 Technology 539
8 Review Exercises 540
14
ANALYSIS OF VARIANCE (ANOVA)
1 Introduction 545
2 Comparison of Several Treatments —
The Completely Randomized Design 545
543
ftoc.qxd
10/15/09
xvi
12:38 PM
Page xvi
CONTENTS
3 Population Model and Inferences
for a Completely Randomized Design 553
4 Simultaneous Confidence Intervals 557
5 Graphical Diagnostics and Displays
to Supplement ANOVA 561
6 Randomized Block Experiments
for Comparing k Treatments 563
7 Using Statistics Wisely 571
8 Key Ideas and Formulas 572
9 Technology 573
10 Review Exercises 574
15
NONPARAMETRIC INFERENCE
577
1 Introduction 579
2 The Wilcoxon Rank-Sum Test for Comparing
Two Treatments 579
3 Matched Pairs Comparisons 590
4 Measure of Correlation Based on Ranks 599
5 Concluding Remarks 603
6 Using Statistics Wisely 604
7 Key Ideas and Formulas 604
8 Technology 605
9 Review Exercises 605
APPENDIX A1
SUMMATION NOTATION
609
APPENDIX A2
RULES FOR COUNTING
614
APPENDIX A3
EXPECTATION AND
STANDARD DEVIATION—PROPERTIES
617
THE EXPECTED VALUE AND_
STANDARD DEVIATION OF X
622
APPENDIX A4
ftoc.qxd
10/15/09
12:38 PM
Page xvii
CONTENTS
APPENDIX B
TABLES
xvii
624
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
Table 7
Random Digits 624
Cumulative Binomial Probabilities 627
Standard Normal Probabilities 634
Percentage Points of t Distributions 636
Percentage Points of 2 Distributions 637
Percentage Points of F ( v1 , v2 ) Distributions 638
Selected Tail Probabilities for the Null Distribution of
Wilcoxon’s Rank-Sum Statistic 640
Table 8 Selected Tail Probabilities for the Null Distribution
of Wilcoxon’s Signed-Rank Statistic 645
DATA BANK
647
ANSWERS TO SELECTED ODD-NUMBERED EXERCISES
665
INDEX
681
ftoc.qxd
10/15/09
12:38 PM
Page xviii
c01.qxd
10/15/09
11:59 AM
Page 1
1
Introduction
1.
2.
3.
4.
5.
6.
7.
8.
What Is Statistics?
Statistics in Our Everyday Life
Statistics in Aid of Scientific Inquiry
Two Basic Concepts—Population and Sample
The Purposeful Collection of Data
Statistics in Context
Objectives of Statistics
Review Exercises
c01.qxd
10/15/09
11:59 AM
Page 2
Surveys Provide Information
About the Population
What is your favorite spectator sport?
Football
Baseball
Basketball
Other
36.4%
12.7%
12.5%
38.4%
College and professional sports are combined in our summary.1 Clearly, football
is the most popular spectator sport. Actually, the National Football League by
itself is more popular than baseball.
Until the mid 1960s, baseball was most popular according to similar surveys.
Surveys, repeated at different times, can detect trends in opinion.
Hometown fans attending today’s game are but a sample of the population of all local
football fans. A self-selected sample may not be entirely representative of the population
on issues such as ticket price increases. Kiichiro Sato/ © AP/Wide World Photos
1
These percentages are similar to those obtained by the ESPN Sports Poll, a service of TNS, in a
2007 poll of over 27,000 fans.
c01.qxd
10/15/09
11:59 AM
Page 3
2. STATISTICS IN OUR EVERYDAY LIFE
3
1. WHAT IS STATISTICS?
The word statistics originated from the Latin word “status,” meaning “state.” For a
long time, it was identified solely with the displays of data and charts pertaining
to the economic, demographic, and political situations prevailing in a country.
Even today, a major segment of the general public thinks of statistics as synonymous with forbidding arrays of numbers and myriad graphs. This image is enhanced by numerous government reports that contain a massive compilation of
numbers and carry the word statistics in their titles: “Statistics of Farm Production,” “Statistics of Trade and Shipping,” “Labor Statistics,” to name a few. However, gigantic advances during the twentieth century have enabled statistics to
grow and assume its present importance as a discipline of data-based reasoning.
Passive display of numbers and charts is now a minor aspect of statistics, and
few, if any, of today’s statisticians are engaged in the routine activities of tabulation and charting.
What, then, are the role and principal objectives of statistics as a scientific
discipline? Stretching well beyond the confines of data display, statistics deals
with collecting informative data, interpreting these data, and drawing conclusions
about a phenomenon under study. The scope of this subject naturally extends to
all processes of acquiring knowledge that involve fact finding through collection
and examination of data. Opinion polls (surveys of households to study sociological, economic, or health-related issues), agricultural field experiments (with new
seeds, pesticides, or farming equipment), clinical studies of vaccines, and cloud
seeding for artificial rain production are just a few examples. The principles and
methodology of statistics are useful in answering questions such as, What kind
and how much data need to be collected? How should we organize and interpret
the data? How can we analyze the data and draw conclusions? How do we assess
the strength of the conclusions and gauge their uncertainty?
Statistics as a subject provides a body of principles and methodology for
designing the process of data collection, summarizing and interpreting
the data, and drawing conclusions or generalities.
2. STATISTICS IN OUR EVERYDAY LIFE
Fact finding through the collection and interpretation of data is not confined to professional researchers. In our attempts to understand issues of environmental protection, the state of unemployment, or the performance of competing football teams,
numerical facts and figures need to be reviewed and interpreted. In our day-to-day
life, learning takes place through an often implicit analysis of factual information.
We are all familiar to some extent with reports in the news media on important statistics.
c01.qxd
10/15/09
4
11:59 AM
Page 4
CHAPTER 1/INTRODUCTION
Employment. Monthly, as part of the Current Population Survey, the
Bureau of Census collects information about employment status from a sample of
about 65,000 households. Households are contacted on a rotating basis with threefourths of the sample remaining the same for any two consecutive months.
The survey data are analyzed by the Bureau of Labor Statistics, which reports monthly unemployment rates.

Cost of Living. The consumer price index (CPI) measures the cost of a
fixed market basket of over 400 goods and services. Each month, prices are obtained from a sample of over 18,000 retail stores that are distributed over 85
metropolitan areas. These prices are then combined taking into account the relative quantity of goods and services required by a hypothetical “1967 urban wage
earner.” Let us not be concerned with the details of the sampling method and
calculations as these are quite intricate. They are, however, under close scrutiny
because of the importance to the hundreds of thousands of Americans whose

earnings or retirement benefits are tied to the CPI.
Election time brings the pollsters into the limelight.
Gallup Poll. This, the best known of the national polls, produces estimates of the percentage of popular vote for each candidate based on interviews
with a minimum of 1500 adults. Beginning several months before the presidential election, results are regularly published. These reports help predict winners
and track changes in voter preferences.

Our sources of factual information range from individual experience to reports
in news media, government records, and articles in professional journals. As consumers of these reports, citizens need some idea of statistical reasoning to properly
interpret the data and evaluate the conclusions. Statistical reasoning provides criteria for determining which conclusions are supported by the data and which are not.
The credibility of conclusions also depends greatly on the use of statistical methods
at the data collection stage. Statistics provides a key ingredient for any systematic
approach to improve any type of process from manufacturing to service.
Quality and Productivity Improvement. In the past 30 years, the
United States has faced increasing competition in the world marketplace. An international revolution in quality and productivity improvement has heightened
the pressure on the U.S. economy. The ideas and teaching of W. Edwards Deming helped rejuvenate Japan’s industry in the late 1940s and 1950s. In the 1980s
and 1990s, Deming stressed to American executives that, in order to survive,
they must mobilize their work force to make a continuing commitment to quality improvement. His ideas have also been applied to government. The city of
Madison, WI, has implemented quality improvement projects in the police department and in bus repair and scheduling. In each case, the project goal was
better service at less cost. Treating citizens as the customers of government services, the first step was to collect information from them in order to identify situations that needed improvement. One end result was the strategic placement
of a new police substation and a subsequent increase in the number of foot patrol persons to interact with the community.
c01.qxd
10/15/09
11:59 AM
Page 5
3. STATISTICS IN AID OF SCIENTIFIC INQUIRY
5
Statistical reasoning can guide the purposeful collection and analysis of data toward the
continuous improvement of any process. © Andrew Sacks/Stone/Getty Images
Once a candidate project is selected for improvement, data must be collected to assess the current status and then more data collected on the effects of
possible changes. At this stage, statistical skills in the collection and presentation
of summaries are not only valuable but necessary for all participants.
In an industrial setting, statistical training for all employees — production
line and office workers, supervisors, and managers — is vital to the quality transformation of American industry.

3. STATISTICS IN AID OF SCIENTIFIC INQUIRY
The phrase scientific inquiry refers to a systematic process of learning. A scientist sets the goal of an investigation, collects relevant factual information (or
data), analyzes the data, draws conclusions, and decides further courses of action. We briefly outline a few illustrative scenarios.
Training Programs. Training or teaching programs in many fields designed
for a specific type of clientele (college students, industrial workers, minority groups,
physically handicapped people, retarded children, etc.) are continually monitored,
evaluated, and modified to improve their usefulness to society. To learn about the
comparative effectiveness of different programs, it is essential to collect data on the
achievement or growth of skill of the trainees at the completion of each program. 
Monitoring Advertising Claims. The public is constantly bombarded
with commercials that claim the superiority of one product brand in comparison to
others. When such comparisons are founded on sound experimental evidence, they
c01.qxd
10/15/09
6
11:59 AM
Page 6
CHAPTER 1/INTRODUCTION
serve to educate the consumer. Not infrequently, however, misleading advertising
claims are made due to insufficient experimentation, faulty analysis of data, or even
blatant manipulation of experimental results. Government agencies and consumer
groups must be prepared to verify the comparative quality of products by using adequate data collection procedures and proper methods of statistical analysis.

Plant Breeding. To increase food production, agricultural scientists
develop new hybrids by cross-fertilizing different plant species. Promising new
strains need to be compared with the current best ones. Their relative productivity is assessed by planting some of each variety at a number of sites. Yields are
recorded and then analyzed for apparent differences. The strains may also be
compared on the basis of disease resistance or fertilizer requirements.

Genomics. This century’s most exciting scientific advances are occurring
in biology and genetics. Scientists can now study the genome, or sum total of all
of a living organism’s genes. The human DNA sequence is now known along
with the DNA sequences of hundreds of other organisms.
A primary goal of many studies is to identify the specific genes and related genetic states that give rise to complex traits (e.g., diabetes, heart disease, cancer).
New instruments for measuring genes and their products are continually being
developed. One popular technology is the microarray, a rectangular array of tens of
thousands of genes. The power of microarray technologies derives from the ability
to compare, for instance, healthy and diseased tissue. Two-color microarrays have
two kinds of DNA material deposited at each site in the array. Due to the impact
Statistically designed experiments are needed to document the advantages of the new
hybrid versus the old species. © Mitch Wojnarowicz/The Image Works
c01.qxd
10/15/09
11:59 AM
Page 7
3. STATISTICS IN AID OF SCIENTIFIC INQUIRY
7
of the disease and the availability of human tumor specimens, many early microarray
studies focused on human cancer. Significant advances have been made in cancer
classification, knowledge of cancer biology, and prognostic prediction. A hallmark example of the power of microarrays used in prognostic prediction is Mammaprint
approved by the FDA in 2007. This, the first approved microarray based test, classifies a breast cancer patient as low or high risk for recurrence.
This is clearly only the beginning, as numerous groups are employing microarrays and other high-throughput technologies in their research studies. Typically, genomics experiments feature the simultaneous measurement of a great
number of responses. As more and more data are collected, there is a growing
need for novel statistical methods for analyzing data and thereby addressing critical scientific questions. Statisticians and other computational scientists are play
ing a major role in these efforts to better human health.
Factual information is crucial to any investigation. The branch of statistics
called experimental design can guide the investigator in planning the manner
and extent of data collection.
The Conjecture-Experiment-Analysis Learning Cycle
Invention of the Sandwich by the Earl of Sandwich
(According to Woody Allen, Humorist)*
Analysis
Experiment
First completed
work:
a slice of bread, a
slice of bread and a
slice of turkey on top
of both
fails miserably
Conjecture
two slices of turkey
with a slice of bread
in the middle
rejected
C
three consecutive
slices of ham stacked
on one another
C
improved reputation
three slices of bread
several strips of ham,
enclosed top and bottom by two slices of
bread
some interest,
mostly in intellectual circles
C
immediate success
*Copyright © 1966 by Woody Allen. Adapted by permission of Random House, Inc. from Getting Even, by Woody Allen.
c01.qxd
10/15/09
8
11:59 AM
Page 8
CHAPTER 1/INTRODUCTION
After the data are collected, statistical methods are available that summarize and describe the prominent features of data. These are commonly known as
descriptive statistics. Today, a major thrust of the subject is the evaluation of information present in data and the assessment of the new learning gained from
this information. This is the area of inferential statistics and its associated methods are known as the methods of statistical inference.
It must be realized that a scientific investigation is typically a process of trial
and error. Rarely, if ever, can a phenomenon be completely understood or a theory perfected by means of a single, definitive experiment. It is too much to expect to get it all right in one shot. Even after his first success with the electric
light bulb, Thomas Edison had to continue to experiment with numerous materials for the filament before it was perfected. Data obtained from an experiment
provide new knowledge. This knowledge often suggests a revision of an existing
theory, and this itself may require further investigation through more experiments and analysis of data. Humorous as it may appear, the excerpt boxed
above from a Woody Allen writing captures the vital point that a scientific
process of learning is essentially iterative in nature.
4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE
In the preceding sections, we cited a few examples of situations where evaluation of factual information is essential for acquiring new knowledge. Although
these examples are drawn from widely differing fields and only sketchy descriptions of the scope and objectives of the studies are provided, a few common
characteristics are readily discernible.
First, in order to acquire new knowledge, relevant data must be collected.
Second, some amount of variability in the data is unavoidable even though observations are made under the same or closely similar conditions. For instance,
the treatment for an allergy may provide long-lasting relief for some individuals
whereas it may bring only transient relief or even none at all to others. Likewise, it is unrealistic to expect that college freshmen whose high school records
were alike would perform equally well in college. Nature does not follow such
a rigid law.
A third notable feature is that access to a complete set of data is either
physically impossible or from a practical standpoint not feasible. When data are
obtained from laboratory experiments or field trials, no matter how much experimentation has been performed, more can always be done. In public opinion
or consumer expenditure studies, a complete body of information would
emerge only if data were gathered from every individual in the nation — undoubtedly a monumental if not an impossible task. To collect an exhaustive set
of data related to the damage sustained by all cars of a particular model under
collision at a specified speed, every car of that model coming off the production
lines would have to be subjected to a collision! Thus, the limitations of time, resources, and facilities, and sometimes the destructive nature of the testing, mean
that we must work with incomplete information — the data that are actually
collected in the course of an experimental study.
c01.qxd
10/15/09
11:59 AM
Page 9
4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE
9
The preceding discussions highlight a distinction between the data set that
is actually acquired through the process of observation and the vast collection of
all potential observations that can be conceived in a given context. The statistical name for the former is sample; for the latter, it is population, or statistical
population. To further elucidate these concepts, we observe that each measurement in a data set originates from a distinct source which may be a patient, tree,
farm, household, or some other entity depending on the object of a study. The
source of each measurement is called a sampling unit, or simply, a unit.
To emphasize population as the entire collection of units, we term it the
population of units.
A unit is a single entity, usually a person or an object, whose characteristics are of interest.
The population of units is the complete collection of units about
which information is sought.
There is another aspect to any population and that is the value, for each unit, of
a characteristic or variable of interest. There can be several characteristics of interest for a given population of units, as indicated in Table 1.
TABLE 1 Populations, Units, and Variables
Population
Unit
Variables/Characteristics
Registered voters in your state
Voter
Political party
Voted or not in last election
Age
Sex
Conservative/liberal
All rental apartments near
campus
Apartment
Rent
Size in square feet
Number of bedrooms
Number of bathrooms
TV and Internet connections
All campus fast food restaurants
Restaurant
Number of employees
Seating capacity
Hiring/not hiring
All computers owned by
students at your school
Computer
Speed of processor
Size of hard disk
Speed of Internet connection
Screen size
For a given variable or characteristic of interest, we call the collection of values, evaluated for every unit in the population, the statistical population or just
c01.qxd
10/15/09
10
11:59 AM
Page 10
CHAPTER 1/INTRODUCTION
the population. We refer to the collection of units as the population of units
when there is a need to differentiate it from the collection of values.
A statistical population is the set of measurements (or record of some
qualitative trait) corresponding to the entire collection of units about
which information is sought.
The population represents the target of an investigation. We learn about the
population by taking a sample from the population. A sample or sample data
set then consists of measurements recorded for those units that are actually observed. It constitutes a part of a far larger collection about which we wish to
make inferences — the set of measurements that would result if all the units in
the population could be observed.
A sample from a statistical population is the subset of measurements that
are actually collected in the course of an investigation.
Example 1
Identifying the Population and Sample
Questions concerning the effect on health of two or fewer cups of coffee a
day are still largely unresolved. Current studies seek to find physiological
changes that could prove harmful. An article carried the headline CAFFEINE
DECREASES CEREBRAL BLOOD FLOW. It describes a study2 which establishes a physiological side effect — a substantial decrease in cerebral blood
flow for persons drinking two to three cups of coffee daily.
The cerebral blood flow was measured twice on each of 20 subjects. It was
measured once after taking an oral dose of caffeine equivalent to two to three
cups of coffee and then, on another day, after taking a look-alike dose but without caffeine. The order of the two tests was random and subjects were not told
which dose they received. The measured decrease in cerebral blood flow was
significant.
Identify the population and sample.
SOLUTION
As the article implies, the conclusion should apply to you and me. The population could well be the potential decreases in cerebral blood flow for all
adults living in the United States. It might even apply to all the decrease in
blood flow for all caffeine users in the world, although the cultural customs
2
A. Field et al. “Dietary Caffeine Consumption and Withdrawal: Confounding Variables in Quantitative Cerebral Perfusion Studies?” Radiology 227 (2003), pp. 129 – 135.
c01.qxd
10/15/09
11:59 AM
Page 11
4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE
11
may vary the type of caffeine consumption from coffee breaks to tea time to
kola nut chewing.
The sample consists of the decreases in blood flow for the 20 subjects who
agreed to participate in the study.
Example 2
A Misleading Sample
A host of a radio music show announced that she wants to know which
singer is the favorite among city residents. Listeners were then asked to call in
and name their favorite singer.
Identify the population and sample. Comment on how to get a sample
that is more representative of the city’s population.
SOLUTION
The population is the collection of singer preferences of all city residents and
the purported goal was to learn who was the favorite singer. Because it would
be nearly impossible to question all the residents in a large city, one must
necessarily settle for taking a sample.
Having residents make a local call is certainly a low-cost method of getting a sample. The sample would then consist of the singers named by each
person who calls the radio station. Unfortunately, with this selection procedure,
the sample is not very representative of the responses from all city residents.
Those who listen to the particular radio station are already a special subgroup
with similar listening tastes. Furthermore, those listeners who take the time
and effort to call are usually those who feel strongest about their opinions.
The resulting responses could well be much stronger in favor of a particular
country western or rock singer than is the case for preference among the total
population of city residents or even those who listen to the station.
If the purpose of asking the question is really to determine the favorite
singer of the city’s residents, we have to proceed otherwise. One procedure
commonly employed is a phone survey where the phone numbers are chosen
at random. For instance, one can imagine that the numbers 0, 1, 2, 3, 4, 5, 6,
7, 8, and 9 are written on separate pieces of paper and placed in a hat. Slips
are then drawn one at a time and replaced between drawings. Later, we will
see that computers can mimic this selection quickly and easily. Four draws
will produce a random telephone number within a three-digit exchange.
Telephone numbers chosen in this manner will certainly produce a much
more representative sample than the self-selected sample of persons who call
the station.
Self-selected samples consisting of responses to call-in or write-in requests
will, in general, not be representative of the population. They arise primarily
from subjects who feel strongly about the issue in question. To their credit,
many TV news and entertainment programs now state that their call-in polls are
nonscientific and merely reflect the opinions of those persons who responded.
c01.qxd
10/15/09
12
11:59 AM
Page 12
CHAPTER 1/INTRODUCTION
USING A RANDOM NUMBER TABLE TO SELECT A SAMPLE
The choice of which population units to include in a sample must be impartial
and objective. When the total number of units is finite, the name or number of
each population unit could be written on a separate slip of paper and the slips
placed in a box. Slips could be drawn one at a time without replacement and
the corresponding units selected as the sample of units. Unfortunately, this simple and intuitive procedure is cumbersome to implement. Also, it is difficult to
mix the slips well enough to ensure impartiality.
Alternatively, a better method is to take 10 identical marbles, number them
0 through 9, and place them in an urn. After shuffling, select 1 marble. After replacing the marble, shuffle and draw again. Continuing in this way, we create a
sequence of random digits. Each digit has an equal chance of appearing in any
given position, all pairs have the same chance of appearing in any two given positions, and so on. Further, any digit or collection of digits is unrelated to any
other disjoint subset of digits. For convenience of use, these digits can be placed
in a table called a random number table.
The digits in Table 1 of Appendix B were actually generated using computer
software that closely mimics the drawing of marbles. A portion of this table is
shown here as Table 2.
To obtain a random sample of units from a population of size N, we first
number the units from 1 to N. Then numbers are read from the table of random
digits until enough different numbers in the appropriate range are selected.
TABLE 2 Random Digits: A Portion of Table 1, Appendix B
Row
1
2
3
4
5
0695
0437
6242
7090
0683
7741
5434
2998
4074
6999
8254
8503
0205
1257
4828
4297
3928
5469
7175
7888
0000
6979
3365
3310
0087
5277
9393
7950
0712
9288
6563
8936
7256
4748
7855
9265
9088
3716
4226
2678
1023
5744
8385
0604
3315
5925
4790
0253
3804
6718
6
7
8
9
10
7013
8808
9876
1873
2581
4300
2786
3602
1065
3075
3768
5369
5812
8976
4622
2572
9571
0124
1295
2974
6473
3412
1997
9434
7069
2411
2465
6445
3178
5605
6285
6419
3176
0602
0420
0069
3990
2682
0732
2949
5422
0294
1259
6616
4387
6175
0896
1728
7972
7679
11
12
13
14
15
3785
8626
6253
0113
4646
6401
4017
0726
4546
6474
0540
1544
9483
2212
9983
5077
4202
6753
9829
8738
7132
8986
4732
2351
1603
4135
1432
2284
1370
8671
4646
2810
0421
2707
0489
3834
2418
3010
3329
9588
6753
8052
7885
6574
3309
1593
2710
8436
7002
5860
c01.qxd
10/15/09
11:59 AM
Page 13
4. TWO BASIC CONCEPTS — POPULATION AND SAMPLE
Example 3
Using the Table of Random Digits to Select Items for a Price Check
One week, the advertisement for a large grocery store contains 72 special sale
items. Five items will be selected with the intention of comparing the sales
price with the scan price at the checkout counter. Select the five items at random to avoid partiality.
SOLUTION
The 72 sale items are first numbered from 1 to 72. Since the population size
N  72 has two digits, we will select random digits two at a time from
Table 2. Arbitrarily, we decide to start in row 7 and columns 19 and 20. Starting with the two digits in columns 19 and 20 and reading down, we obtain
13
12 97 34 69 32 86 32 51
We ignore 97 and 86 because they are larger than the population size 72. We
also ignore any number when it appears a second time as 32 does here. Consequently, the sale items numbered
12 34 69 32 51
are selected for the price check.
For large sample size situations or frequent applications, it is often more
convenient to use computer software to choose the random numbers.
Example 4
Selecting a Sample by Random Digit Dialing
A major Internet service provider wants to learn about the proportion of
people in one target area who are aware of its latest product. Suppose there
is a single three-digit telephone exchange that covers the target area. Use
Table 1, in Appendix B, to select six telephone numbers for a phone survey.
SOLUTION
We arbitrarily decide to start at row 31 and columns 25 to 28. Proceeding
upward, we obtain
7566 0766 1619 9320 1307 6435
Together with the three-digit exchange, these six numbers form the phone
numbers called in the survey. Every phone number, listed or unlisted, has the
same chance of being selected. The same holds for every pair, every triplet,
and so on. Commercial phones may have to be discarded and another four
digits selected. If there are two exchanges in the area, separate selections
could be done for each exchange.
For large sample sizes, it is better to use computer-generated random digits or even computer-dialed random phone numbers.
Data collected with a clear-cut purpose in mind are very different from anecdotal data. Most of us have heard people say they won money at a casino, but
certainly most people cannot win most of the time as casinos are not in the business of giving away money. People tend to tell good things about themselves. In a
c01.qxd
10/15/09
14
11:59 AM
Page 14
CHAPTER 1/INTRODUCTION
similar vein, some drivers’ lives are saved when they are thrown free of car
wrecks because they were not wearing seat belts. Although such stories are told
and retold, you must remember that there is really no opportunity to hear from
those who would have lived if they had worn their seat belts. Anecdotal information is usually repeated because it has some striking feature that may not be representative of the mass of cases in the population. Consequently, it is not apt to
provide reliable answers to questions.
5. THE PURPOSEFUL COLLECTION OF DATA
Many poor decisions are made, in both business and everyday activities, because
of the failure to understand and account for variability. Certainly, the purchasing
habits of one person may not represent those of the population, or the reaction
of one mouse, on exposure to a potentially toxic chemical compound, may not
represent that of a large population of mice. However, despite diversity among
the purchasing habits of individuals, we can obtain accurate information about
the purchasing habits of the population by collecting data on a large number of
persons. By the same token, much can be learned about the toxicity of a chemical if many mice are exposed.
Just making the decision to collect data to answer a question, to provide the
basis for taking action, or to improve a process is a key step. Once that decision
has been made, an important next step is to develop a statement of purpose that
is both specific and unambiguous. If the subject of the study is public transportation being behind schedule, you must carefully specify what is meant by
late. Is it 1 minute, 5 minutes, or more than 10 minutes behind scheduled times
that should result in calling a bus or commuter train late? Words like soft or uncomfortable in a statement are even harder to quantify. One common approach,
for a quality like comfort, is to ask passengers to rate the ride on public transportation on the five-point scale
1
Very uncomfortable
2
3
Neutral
4
5
Very comfortable
where the numbers 1 through 5 are attached to the scale, with 1 for very uncomfortable and so on through 5 for very comfortable.
We might conclude that the ride is comfortable if the majority of persons in
the sample check either of the top two boxes.
Example 5
A Clear Statement of Purpose Concerning Water Quality
Each day, a city must sample the lake water in and around a swimming beach to
determine if the water is safe for swimming. During late summer, the primary
difficulty is algae growth and the safe limit has been set in terms of water clarity.
SOLUTION
The problem is already well defined so the statement of purpose is straightforward.
c01.qxd
10/15/09
11:59 AM
Page 15
6. STATISTICS IN CONTEXT
15
PURPOSE: Determine whether or not the water clarity at the beach is
below the safe limit.
The city has already decided to take measurements of clarity at three separated locations. In Chapter 8, we will learn how to decide if the water is safe
despite the variation in the three sample values.
The overall purpose can be quite general but a specific statement of purpose is
required at each step to guide the collection of data. For instance:
GENERAL PURPOSE: Design a data collection and monitoring program
at a completely automated plant that handles radioactive materials.
One issue is to ensure that the production plant will shut down quickly if materials start accumulating anywhere along the production line. More specifically,
the weight of materials could be measured at critical positions. A quick shutdown will be implemented if any of these exceed a safe limit. For this step, a
statement of purpose could be:
PURPOSE: Implement a fast shutdown if the weight at any critical position exceeds 1.2 kilograms.
The safe limit 1.2 kilograms should be obtained from experts; preferrably it
would be a consensus of expert opinion.
There still remain statistical issues of how many critical positions to choose
and how often to measure the weight. These are followed with questions on
how to analyze data and specify a rule for implementing a fast shutdown.
A clearly specified statement of purpose will guide the choice of what data
to collect and help ensure that it will be relevant to the purpose. Without a
clearly specified purpose, or terms unambiguously defined, much effort can be
wasted in collecting data that will not answer the question of interest.
6. STATISTICS IN CONTEXT
A primary health facility became aware that sometimes it was taking too long to
return patients’ phone calls. That is, patients would phone in with requests for
information. These requests, in turn, had to be turned over to doctors or nurses
who would collect the information and return the call. The overall objective was
to understand the current procedure and then improve on it. As a good first
step, it was decided to find how long it was taking to return calls under the current procedure. Variation in times from call to call is expected, so the purpose of
the initial investigation is to benchmark the variability with the current procedure by collecting a sample of times.
PURPOSE: Obtain a reference or benchmark for the current procedure
by collecting a sample of times to return a patient’s call under the current
procedure.
c01.qxd
10/15/09
16
11:59 AM
Page 16
CHAPTER 1/INTRODUCTION
For a sample of incoming calls collected during the week, the time received was
noted along with the request. When the return call was completed, the elapsed
time, in minutes, was recorded. Each of these times is represented as a dot in
Figure 1. Notice that over one-third of the calls took over 120 minutes, or over
two hours, to return. This could be a long time to wait for information if it concerns a child with a high fever or an adult with acute symptoms. If the purpose
was to determine what proportion of calls took too long to return, we would
need to agree on a more precise definition of “too long” in terms of number of
minutes. Instead, these data clearly indicate that the process needs improvement
and the next step is to proceed in that direction.
0
40
80
120
160
200
240
Time (min)
Figure 1 Time in minutes to return call.
In any context, to pursue potential improvements of a process, one needs to
focus more closely on particulars. Three questions
When Where Who
should always be asked before gathering further data. More specifically, data
should be sought that will answer the following questions.
When do the difficulties arise? Is it during certain hours, certain days of the
week or month, or in coincidence with some other activities?
Where do the difficulties arise? Try to identify the locations of bottlenecks
and unnecessary delays.
Who was performing the activity and who was supervising? The idea is not
to pin blame, but to understand the roles of participants with the goal of making improvements.
It is often helpful to construct a cause-and-effect diagram or fishbone diagram. The main centerline represents the problem or the effect. A somewhat
simplified fishbone chart is shown in Figure 2 for the where question regarding
the location of delays when returning patients’ phone calls. The main centerline
represents the problem: Where are delays occurring? Calls come to the reception desk, but when these lines are busy, the calls go directly to nurses on the
third or fourth floor. The main diagonal arms in Figure 2 represent the floors
and the smaller horizontal lines more specific locations on the floor where the
delay could occur. For instance, the horizontal line representing a delay in retrieving a patient’s medical record connects to the second floor diagonal line.
The resulting figure resembles the skeleton of a fish. Consideration of the diagram can help guide the choice of what new data to collect.
Fortunately, the quality team conducting this study had already given preliminary consideration to the When, Where, and Who questions and recorded not
only the time of day but also the day and person receiving the call. That is, their
c01.qxd
10/15/09
11:59 AM
Page 17
7. OBJECTIVES OF STATISTICS
17
current data gave them a start on determining if the time to return calls depends on when or where the call is received.
Although we go no further with this application here, the quality team next
developed more detailed diagrams to study the flow of paper between the time
the call is received and when it is returned. They then identified bottlenecks in
the flow of information that were removed and the process was improved. In
later chapters, you will learn how to compare and display data from two locations or old and new processes, but the key idea emphasized here is the purposeful collection of relevant data.
3rd Floor
1st Floor
Lab
Receptionist
X-ray
WHERE
ARE THE
DELAYS?
Records
2nd Floor
4th Floor
Figure 2 A cause-and-effect diagram for the location of delays.
7. OBJECTIVES OF STATISTICS
The subject of statistics provides the methodology to make inferences about the
population from the collection and analysis of sample data. These methods enable one to derive plausible generalizations and then assess the extent of uncertainty underlying these generalizations. Statistical concepts are also essential
during the planning stage of an investigation when decisions must be made as to
the mode and extent of the sampling process.
The major objectives of statistics are:
1. To make inferences about a population from an analysis of information contained in sample data. This includes assessments of the extent
of uncertainty involved in these inferences.
2. To design the process and the extent of sampling so that the observations form a basis for drawing valid inferences.
The design of the sampling process is an important step. A good design for
the process of data collection permits efficient inferences to be made, often with
c01.qxd
10/15/09
18
11:59 AM
Page 18
CHAPTER 1/INTRODUCTION
a straightforward analysis. Unfortunately, even the most sophisticated methods
of data analysis cannot, in themselves, salvage much information from data that
are produced by a poorly planned experiment or survey.
The early use of statistics in the compilation and passive presentation of
data has been largely superseded by the modern role of providing analytical
tools with which data can be efficiently gathered, understood, and interpreted.
Statistical concepts and methods make it possible to draw valid conclusions
about the population on the basis of a sample. Given its extended goal, the subject of statistics has penetrated all fields of human endeavor in which the evaluation of information must be grounded in data-based evidence.
The basic statistical concepts and methods described in this book form the
core in all areas of application. We present examples drawn from a wide range
of applications to help develop an appreciation of various statistical methods,
their potential uses, and their vulnerabilities to misuse.
USING STATISTICS WISELY
1. Compose a clear statement of purpose and use it to help decide upon which
variables to observe.
2. Carefully define the population of interest.
3. Whenever possible, select samples using a random device or random number table.
4. Do not unquestionably accept conclusions based on self-selected samples.
5. Remember that conclusions reached in TV, magazine, or newspaper reports
might not be as obvious as reported. When reading or listening to reports,
you must be aware that the advocate, often a politician or advertiser, may
only be presenting statistics that emphasize positive features.
KEY IDEAS
Before gathering data, on a characteristic of interest, identify a unit or sampling
unit. This is usually a person or object. The population of units is the complete
collection of units. In statistics we concentrate on the collection of values of the
characteristic, or record of a qualitative trait, evaluated for each unit in the population. We call this the statistical population or just the population.
A sample or sample data set from the population is the subset of measurements that are actually collected.
Statistics is a body of principles that helps to first design the process and extent of sampling and then guides the making of inferences about the population (inferential statistics). Descriptive statistics help summarize the sample.
Procedures for statistical inference allow us to make generalizations about the
population from the information in the sample.
A statement of purpose is a key step in designing the data collection process.
c01.qxd
10/15/09
11:59 AM
Page 19
8. REVIEW EXERCISES
19
8. REVIEW EXERCISES
1.1
A newspaper headline reads,
1.7
It is often easy to put off doing an unpleasant task.
At a Web site,5 persons can take a test and receive
a score that determines if they have a serious
problem with procrastination. Should the scores
from people who take this test on-line be considered a random sample? Explain your reasoning.
1.8
A magazine that features the latest electronics
and computer software for homes enclosed a
short questionnaire on a postcard. Readers were
asked to answer questions concerning their use
and ownership of various software and hardware
products, and to then send the card to the publisher. A summary of the results appeared in a
later issue of the magazine that used the data to
make statements such as 40% of readers have purchased program X. Identify a population and sample and comment on the representativeness of the
sample. Are readers who have not purchased any
new products mentioned in the questionnaire as
likely to respond as those who have purchased?
1.9
Each year a local weekly newspaper gives out
“Best of the City” awards in categories such as
restaurant, deli, pastry shop, and so on. Readers
are asked to fill in their favorites on a form enclosed in this free weekly paper and then send it
to the publisher. The establishment receiving the
most votes is declared the winner in its category.
Identify the population and sample and comment on the representativeness of the sample.
U.S. TEENS TRUST, FEAR THEIR PEERS
and the article explains that a telephone poll was
conducted of 1055 persons 13 to 17 years old.
Identify a statistical population and the sample.
1.2
Consider the population of all students at your
college. You want to learn about total monthly
entertainment expenses for a student.
(a) Specify the population unit.
(b) Specify the variable of interest.
(c) Specify the statistical population.
1.3
Consider the population of persons living in
Chicago. You want to learn about the proportion
which are illegal aliens.
(a) Specify the population unit.
(b) Specify the variable of interest.
(c) Specify the statistical population.
1.4
A student is asked to estimate the mean height
of all male students on campus. She decides to
use the heights of members of the basketball
team because they are conveniently printed in
the game program.
(a) Identify the statistical population and the
sample.
(b) Comment on the selection of the sample.
(c) How should a sample of males be selected?
1.5
Psychologists3 asked 46 golfers, after they played
a round, to estimate the diameter of the hole on
the green by visually selecting one of nine holes
cut in a board.
1.10 Which of the following are anecdotal and which
are based on sample?
(a) Out of 200 students questioned, 40 admitted they lied regularly.
(b) Bobbie says the produce at Market W is the
freshest in the city.
(c) Out of 50 persons interviewed at a shopping mall, 18 had made a purchase that day.
(a) Specify the population unit.
(b) Specify the statistical population and sample.
1.6
A phone survey in 20084 of 1010 adults included
a response to the number of leisure hours per
week. Identify the population unit, statistical
population, and sample.
1.11 Which of the following are anecdotal and which
are based on a sample?
(a) Tom says he gets the best prices on electronics at the www.bestelc.com Internet site.
3
J. Witt et al. “Putting to a bigger hole: Golf performance relates
to perceived size,” Psychonomic Bulletin and Review 15(3)
(2008), pp. 581–586.
4
Harris Interactive telephone survey (October 16–19, 2008).
5
http://psychologytoday.psychtests.com/tests/
procrastination_access.html
c01.qxd
10/15/09
20
11:59 AM
Page 20
CHAPTER 1/INTRODUCTION
(b) What is the proportion of the 20 experiments that give one of the students you like
and one other?
(b) Out of 22 students, 6 had multiple credit
cards.
(c) Among 55 people checking in at the airport, 12 were going to destinations outside
of the continental United States.
(c) What is the proportion of the 20 experiments that give none of the students you
like?
1.12 What is wrong with this statement of purpose?
PURPOSE: Determine if a newly designed rollerball pen is comfortable to hold when writing.
Give an improved statement of purpose.
1.13 What is wrong with this statement of purpose?
PURPOSE: Determine if it takes too long to get
cash from the automated teller machine during the
lunch hour.
1.18 According to the cause-and-effect diagram on
page 17, where are the possible delays on the
first floor?
1.19 Refer to the cause-and-effect diagram on page
17. The workers have now noticed that a delay
could occur:
(i)
Give an improved statement of purpose.
1.14 Give a statement of purpose for determining the
amount of time it takes to make hotel reservations in San Francisco using the Internet.
1.15 Thirty-five classrooms on campus are equiped
for multimedia instruction. Use Table 1, Appendix B, to select 4 of these classrooms to visit and
check whether or not the instructor is using the
equipment during that day’s first hour lecture.
1.16 Fifty band members would like to ride the band
bus to an out-of-town game. However, there is
room for only 44. Use Table 1, Appendix B, to
select the 44 persons who will go. Determine
how to make your selection by taking only a few
two-digit selections.
1.17 Eight young students need mentors. Of these,
there are three whom you enjoy being with while
you are indifferent about the others. Two of the
students will be randomly assigned to you. Label
the students you like by 0, 1, and 2 and the others by 3, 4, 5, 6, and 7. Then, the process of assigning two students at random is equivalent to
choosing two different digits from the table of
random digits and ignoring any 8 or 9. Repeat
the experiment of assigning two students 20
times by using the table of random digits. Record
the pairs of digits you draw for each experiment.
(a) What is the proportion of the 20 experiments that give two students that you
like?
On the fourth floor at the pharmacy
(ii) On the third floor at the practitioners’ station
Redraw the diagram and include this added information.
1.20 The United States Environmental Protection
Agency6 reports that in 2006, each American
generated 4.6 pounds of solid waste a day.
(a) Does this mean every single American produces the same amount of garbage? What
do you think this statement means?
(b) Was the number 4.6 obtained from a sample? Explain.
(c) How would you select a sample?
1.21 As a very extreme case of self-selection, imagine
a five-foot-high solid wood fence surrounding a
collection of Great Danes and Miniature Poodles. You want to estimate the proportion of
Great Danes inside and decide to collect your
sample by observing the first seven dogs to jump
high enough to be seen above the fence.
(a) Explain how this is a self-selected sample
that is, of course, very misleading.
(b) How is this sample selection procedure like
a call-in election poll?
6
http://www.epa.gov/epawaste/nonhaz/index.htm
c02a.qxd
10/15/09
12:02 PM
Page 21
2
Organization and
Description of Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
Main Types of Data
Describing Data by Tables and Graphs
Measures of Center
Measures of Variation
Checking the Stability of the Observations over Time
More on Graphics
Statistics in Context
Review Exercises
c02a.qxd
10/15/09
12:02 PM
Page 22
Acid Rain Is Killing Our Lakes
© SuperStock, Inc.
Acid precipitation is linked to the disappearance of sport fish and other organisms from lakes. Sources of air pollution, including automobile emissions and
the burning of fossil fuels, add to the natural acidity of precipitation. The Wisconsin Department of Natural Resources initiated a precipitation monitoring
program with the goal of developing appropriate air pollution controls to reduce the problem. The acidity of the first 50 rains monitored, measured on a pH
scale from 1 (very acidic) to 7 (basic), are summarized by the histogram.
25
10
9
4
2
3.0
3.5
4.0
4.5
5.0
5.5
6.0 pH
Histogram of acid rain data
Notice that all the rains are more acidic than normal rain, which has a pH of
5.6. (As a comparison, apples are about pH 3 and milk is about pH 6.)
Researchers in Canada have established that lake water with a pH below
5.6 may severely affect the reproduction of game fish. More research will undoubtedly improve our understanding of the acid rain problem and lead, it is
hoped, to an improved environment.
c02a.qxd
10/15/09
12:02 PM
Page 23
2. MAIN TYPES OF DATA
23
1. INTRODUCTION
In Chapter 1, we cited several examples of situations where the collection of
data by appropriate processes of experimentation or observation is essential to
acquire new knowledge. A data set may range in complexity from a few entries
to hundreds or even thousands of them. Each entry corresponds to the observation of a specified characteristic of a sampling unit. For example, a nutritionist
may provide an experimental diet to 30 undernourished children and record
their weight gains after two months. Here, children are the sampling units, and
the data set would consist of 30 measurements of weight gains. Once the data
are collected, a primary step is to organize the information and extract a descriptive summary that highlights its salient features. In this chapter, we learn
how to organize and describe a set of data by means of tables, graphs, and calculation of some numerical summary measures.
2. MAIN TYPES OF DATA
In discussing the methods for providing summary descriptions of data, it helps
to distinguish between the two basic types:
1. Qualitative or categorical data
2. Numerical or measurement data
When the characteristic under study concerns a qualitative trait that is only
classified in categories and not numerically measured, the resulting data are
called categorical data. Hair color (blond, brown, red, black), employment status (employed, unemployed), and blood type (O, A, B, AB) are but some examples. If, on the other hand, the characteristic is measured on a numerical scale,
the resulting data consist of a set of numbers and are called measurement data.
We will use the term numerical-valued variable or just variable to refer to a
characteristic that is measured on a numerical scale. The word “variable” signifies
that the measurements vary over different sampling units. In this terminology,
observations of a numerical-valued variable yield measurement data. A few examples of numerical-valued variables are the shoe size of an adult male, daily
number of traffic fatalities in a state, intensity of an earthquake, height of a 1year-old pine seedling, the time in line at an automated teller, and the number
of offspring in an animal litter.
Although in all these examples the stated characteristic can be numerically measured, a close scrutiny reveals two distinct types of underlying scale
of measurement. Shoe sizes are numbers such as 6, 6 12 , 7, 7 12 , . . . , which
proceed in steps of 12 . The count of traffic fatalities can only be an integer and
so is the number of offspring in an animal litter. These are examples of discrete variables. The name discrete draws from the fact that the scale is made
up of distinct numbers with gaps in between. On the other hand, some variables such as height, weight, and survival time can ideally take any value in an
c02a.qxd
10/15/09
24
12:02 PM
Page 24
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
interval. Since the measurement scale does not have gaps, such variables are
called continuous.
We must admit that a truly continuous scale of measurement is an idealization. Measurements actually recorded in a data set are always rounded either for
the sake of simplicity or because the measuring device has a limited accuracy.
Still, even though weights may be recorded in the nearest pounds or time
recorded in the whole hours, their actual values occur on a continuous scale so
the data are referred to as continuous. Counts are inherently discrete and
treated as such, provided that they take relatively few distinct values (e.g., the
number of children in a family or the number of traffic violations of a driver).
But when a count spans a wide range of values, it is often treated as a continuous variable. For example, the count of white blood cells, number of insects in a
colony, and number of shares of stock traded per day are strictly discrete, but for
practical purposes, they are viewed as continuous.
A summary description of categorical data is discussed in Section 3.1. The
remainder of this chapter is devoted to a descriptive study of measurement
data, both discrete and continuous. As in the case of summarization and commentary on a long, wordy document, it is difficult to prescribe concrete steps for
summary descriptions that work well for all types of measurement data. However, a few important aspects that deserve special attention are outlined here to
provide general guidelines for this process.
Describing a Data Set of Measurements
1. Summarization and description of the overall pattern.
(a) Presentation of tables and graphs.
(b) Noting important features of the graphed data including symmetry or departures from it.
(c) Scanning the graphed data to detect any observations that seem
to stick far out from the major mass of the data—the outliers.
2. Computation of numerical measures.
(a) A typical or representative value that indicates the center of the
data.
(b) The amount of spread or variation present in the data.
3. DESCRIBING DATA BY TABLES AND GRAPHS
3.1 CATEGORICAL DATA
When a qualitative trait is observed for a sample of units, each observation is
recorded as a member of one of several categories. Such data are readily organized in the form of a frequency table that shows the counts (frequencies) of
the individual categories. Our understanding of the data is further enhanced by
c02a.qxd
10/15/09
12:02 PM
Page 25
3. DESCRIBING DATA BY TABLES AND GRAPHS
25
calculation of the proportion (also called relative frequency) of observations in
each category.
Frequency in the category
Relative frequency

of a category
Total number of observations
Example 1
SOLUTION
Calculating Relative Frequencies to Summarize an Opinion Poll
A campus press polled a sample of 280 undergraduate students in order to
study student attitude toward a proposed change in the dormitory regulations. Each student was to respond as support, oppose, or neutral in regard to
the issue. The numbers were 152 support, 77 neutral, and 51 opposed. Tabulate the results and calculate the relative frequencies for the three response
categories.
Table 1 records the frequencies in the second column, and the relative frequencies are calculated in the third column. The relative frequencies show
that about 54% of the polled students supported the change, 18% opposed,
and 28% were neutral.
TABLE 1 Summary Results
of an Opinion Poll
Responses
Frequency
Support
152
Neutral
77
Oppose
51
Total
280
Relative
Frequency
152

280
77

280
51

280
.543
.275
.182
1.000
Remark: The relative frequencies provide the most relevant information
as to the pattern of the data. One should also state the sample size, which
serves as an indicator of the credibility of the relative frequencies. (More on
this in Chapter 8.)
Categorical data are often presented graphically as a pie chart in which the
segments of a circle exhibit the relative frequencies of the categories. To obtain
the angle for any category, we multiply the relative frequency by 360 degrees,
10/15/09
26
12:02 PM
Page 26
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
which corresponds to the complete circle. Although laying out the angles by
hand can be tedious, many software packages generate the chart with a single
command. Figure 1 presents a pie chart for the data in Example 1.
Oppose
18%
Support
54%
Neutral
28%
Figure 1 Pie chart of student
opinion on change in dormitory
regulations.
When questions arise that need answering but the decision makers lack precise knowledge of the state of nature or the full ramifications of their decisions,
the best procedure is often to collect more data. In the context of quality improvement, if a problem is recognized, the first step is to collect data on the
magnitude and possible causes. This information is most effectively communicated through graphical presentations.
A Pareto diagram is a powerful graphical technique for displaying events
according to their frequency. According to Pareto’s empirical law, any collection
of events consists of only a few that are major in that they are the ones that occur most of the time.
Figure 2 gives a Pareto diagram for the type of defects found in a day’s production of facial tissues. The cumulative frequency is 22 for the first cause and
20
15
Frequency
c02a.qxd
10
5
0
Tears
Holes
Folds
Other
Figure 2 Pareto diagram of facial tissue defects.
10/15/09
12:02 PM
Page 27
3. DESCRIBING DATA BY TABLES AND GRAPHS
27
22  15  37 for the first and second causes combined. This illustrates
Pareto’s rule, with two of the causes being responsible for 37 out of 50, or 74%,
of the defects.
Example 2
A Pareto Diagram Clarifies Circumstances Needing Improvement
Graduate students in a counseling course were asked to choose one of their
personal habits that needed improvement. In order to reduce the effect of
this habit, they were asked to first gather data on the frequency of the occurrence and the circumstances. One student collected the following frequency
data on fingernail biting over a two-week period.
SOLUTION
Frequency
Activity
58
21
14
7
3
12
Watching television
Reading newspaper
Talking on phone
Driving a car
Grocery shopping
Other
Make a Pareto diagram showing the relationship between nail biting and
type of activity.
The cumulative frequencies are 58, 58  21  79, and so on, out of 115.
The Pareto diagram is shown in Figure 3, where watching TV accounts for
50.4% of the instances.
60
50
Frequency
c02a.qxd
40
30
20
10
0
TV
Paper
Phone
Driving
Shopping
Other
Figure 3 Pareto diagram for nail biting example.
The next step for this person would be to try and find a substitute for
nail biting while watching television.
c02a.qxd
10/15/09
28
12:02 PM
Page 28
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
3.2 DISCRETE DATA
We next consider summary descriptions of measurement data and begin our discussion with discrete measurement scales. As explained in Section 2, a data set
is identified as discrete when the underlying scale is discrete and the distinct values observed are not too numerous.
Similar to our description of categorical data, the information in a discrete
data set can be summarized in a frequency table, or frequency distribution
that includes a calculation of the relative frequencies. In place of the qualitative
categories, we now list the distinct numerical measurements that appear in the
data set and then count their frequencies.
Example 3
Creating a Frequency Distribution
Retail stores experience their heaviest returns on December 26 and December
27 each year. Most are gifts that, for some reason, did not please the recipient.
The number of items returned, by a sample of 30 persons at a large discount department store, are observed and the data of Table 2 are obtained. Determine
the frequency distribution.
TABLE 2
1
2
2
SOLUTION
4
5
3
Number of items returned
3
1
2
2
4
3
3
2
2
4
1
1
5
3
4
1
2
3
2
4
2
1
1
5
The frequency distribution of these data is presented in Table 3. The values
are paired with the frequency and the calculated relative frequency.
TABLE 3 Frequency Distribution for
Number (x) of Items Returned
Value x
Frequency
Relative Frequency
1
2
3
4
5
7
9
6
5
3
.233
.300
.200
.167
.100
Total
30
1.000
12:02 PM
Page 29
29
3. DESCRIBING DATA BY TABLES AND GRAPHS
The frequency distribution of a discrete variable can be presented pictorially by drawing either lines or rectangles to represent the relative frequencies.
First, the distinct values of the variable are located on the horizontal axis. For a
line diagram, we draw a vertical line at each value and make the height of the
line equal to the relative frequency. A histogram employs vertical rectangles
instead of lines. These rectangles are centered at the values and their areas represent relative frequencies. Typically, the values proceed in equal steps so the
rectangles are all of the same width and their heights are proportional to the relative frequencies as well as frequencies. Figure 4(a) shows the line diagram and
4(b) the histogram of the frequency distribution of Table 3.
0.3
0.3
Relative frequency
10/15/09
Relative frequency
c02a.qxd
0.2
0.1
0
1
2
3 4 5
(a) Line diagram
x
0.2
0.1
0
1
2
3 4 5
(b) Histogram
x
Figure 4 Graphic display of the frequency distribution of data in Table 3.
3.3 DATA ON A CONTINUOUS VARIABLE
We now consider tabular and graphical presentations of data sets that contain
numerical measurements on a virtually continuous scale. Of course, the
recorded measurements are always rounded. In contrast with the discrete case, a
data set of measurements on a continuous variable may contain many distinct
values. Then, a table or plot of all distinct values and their frequencies will not
provide a condensed or informative summary of the data.
The two main graphical methods used to display a data set of measurements are the dot diagram and the histogram. Dot diagrams are employed
when there are relatively few observations (say, less than 20 or 25); histograms
are used with a larger number of observations.
Dot Diagram
When the data consist of a small set of numbers, they can be graphically represented by drawing a line with a scale covering the range of values of the measurements. Individual measurements are plotted above this line as prominent
dots. The resulting diagram is called a dot diagram.
c02a.qxd
10/15/09
30
12:02 PM
Page 30
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
Example 4
A Dot Diagram Reveals an Unusual Observation
The number of days the first six heart transplant patients at Stanford survived after their operations were 15, 3, 46, 623, 126, 64. Make a dot diagram.
SOLUTION
These survival times extended from 3 to 623 days. Drawing a line segment
from 0 to 700, we can plot the data as shown in Figure 5. This dot diagram
shows a cluster of small survival times and a single, rather large value.
0
100
200
300
400
Survival time (days)
500
600
700
Figure 5 Dot diagram for the heart transplant data.
Frequency Distribution on Intervals
When the data consist of a large number of measurements, a dot diagram may
be quite tedious to construct. More seriously, overcrowding of the dots will
cause them to smear and mar the clarity of the diagram. In such cases, it is convenient to condense the data by grouping the observations according to intervals
and recording the frequencies of the intervals. Unlike a discrete frequency distribution, where grouping naturally takes place on points, here we use intervals of
values. The main steps in this process are outlined as follows.
Constructing a Frequency Distribution
for a Continuous Variable
1. Find the minimum and the maximum values in the data set.
2. Choose intervals or cells of equal length that cover the range between
the minimum and the maximum without overlapping. These are
called class intervals, and their endpoints class boundaries.
3. Count the number of observations in the data that belong to each
class interval. The count in each class is the class frequency or cell frequency.
4. Calculate the relative frequency of each class by dividing the class frequency by the total number of observations in the data:
Relative frequency 
Class frequency
Total number of observations
The choice of the number and position of the class intervals is primarily a
matter of judgment guided by the following considerations. The number of
c02a.qxd
10/15/09
12:02 PM
Page 31
3. DESCRIBING DATA BY TABLES AND GRAPHS
31
Paying Attention
© Britt Erlanson/The Image Bank/Getty Images
Paying
attention in class. Observations on 24 rst grade students.
Paying attention in class. Observations on 24 first-grade students.
0
1
2
3
4
5
6 7 8
Minutes
9
10 11 12 13
Figure 6 Time not concentrating on the mathematics assignment (out of 20 minutes).
First-grade teachers allot a portion of each day to mathematics. An educator, concerned about how students utilize this time, selected 24 students and observed them for a total of 20 minutes spread over several
days. The number of minutes, out of 20, that the student was not on task
was recorded (courtesy of T. Romberg). These lack-of-attention times are
graphically portrayed in the dot diagram in Figure 6. The student with 13
out of 20 minutes off-task stands out enough to merit further consideration. Is this a student who finds the subject too difficult or might it be a
very bright child who is bored?
classes usually ranges from 5 to 15, depending on the number of observations in
the data. Grouping the observations sacrifices information concerning how the
observations are distributed within each cell. With too few cells, the loss of information is serious. On the other hand, if one chooses too many cells and the
c02a.qxd
10/15/09
32
12:02 PM
Page 32
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
data set is relatively small, the frequencies from one cell to the next would
jump up and down in a chaotic manner and no overall pattern would emerge.
As an initial step, frequencies may be determined with a large number of intervals that can later be combined as desired in order to obtain a smooth pattern of
the distribution.
Computers conveniently order data from smallest to largest so that the observations in any cell can easily be counted. The construction of a frequency distribution is illustrated in Example 5.
Example 5
Creating a Frequency Distribution for Hours of Sleep
Students require different amounts of sleep. A sample of 59 students at a large
midwest university reported the following hours of sleep the previous night.
TABLE 4 Hours of Sleep for Fifty-nine Students
4.5
6.0
6.7
7.3
8.0
8.5
SOLUTION
4.7
6.0
6.7
7.3
8.0
8.7
5.0
6.0
6.7
7.5
8.0
8.7
5.0
6.0
6.7
7.5
8.0
9.0
5.3
6.3
7.0
7.5
8.3
9.0
5.5
6.3
7.0
7.5
8.3
9.0
5.5
6.3
7.0
7.7
8.3
9.3
5.7
6.5
7.0
7.7
8.5
9.3
5.7
6.5
7.3
7.7
8.5
10.0
5.7
6.5
7.3
7.7
8.5
Construct a frequency distribution of the sleep data.
To construct a frequency distribution, we first notice that the minimum
hours of sleep is 4.5 and the maximum is 10.0. We choose class intervals of
length 1.2 hours as a matter of convenience.
The selection of class boundaries is a bit of fussy work. Because the data
have one decimal place, we could add a second decimal to avoid the possibility of any observation falling exactly on the boundary. For example, we could
end the first class interval at 5.45. Alternatively, and more neatly, we could
write 4.3–5.5 and make the endpoint convention that the left-hand end
point is included but not the right.
The first interval contains 5 observations so its frequency is 5 and its rel5
ative frequency is 59
 .085. Table 5 gives the frequency distribution. The
relative frequencies add to 1, as they should (up to rounding error) for any
frequency distribution. We see, for instance, that just about one-third of the
students .271 + .051 = .322 got 7.9 hours or more of sleep.
Remark: The rule requiring equal class intervals is inconvenient when
the data are spread over a wide range but are highly concentrated in a
small part of the range with relatively few numbers elsewhere. Using
smaller intervals where the data are highly concentrated and larger intervals where the data are sparse helps to reduce the loss of information due
to grouping.
c02a.qxd
10/15/09
12:02 PM
Page 33
3. DESCRIBING DATA BY TABLES AND GRAPHS
33
TABLE 5 Frequency Distribution for Hours of Sleep Data (left
endpoints included but right endpoints excluded)
Class Interval
Frequency
4.3– 5.5
5
5.5 – 6.7
15
6.7– 7.9
20
7.9– 9.1
16
9.1– 10.3
3
Total
59
Relative Frequency
5
59
15
59
20
59
16
59
3
59
 .085
 .254
 .339
 .271
 .051
1.000
In every application involving an endpoint convention, it is important that you
clearly state which endpoint is included and which is excluded. This information
should be presented in the title or in a footnote of any frequency distribution.
Histogram
A frequency distribution can be graphically presented as a histogram. To draw a
histogram, we first mark the class intervals on the horizontal axis. On each interval,
we then draw a vertical rectangle whose area represents the relative frequency—
that is, the proportion of the observations occurring in that class interval.
To create rectangles whose area is equal to relative frequency, use the rule
Height 
Relative frequency
Width of interval
The total area of all rectangles equals 1, the sum of the relative frequencies.
The total area of a histogram is 1.
The histogram for Table 5 is shown in Figure 7. For example, the rectangle
drawn on the class interval 4.3– 5.5 has area  .071  1.2  .085, which is
the relative frequency of this class. Actually, we determined the height .071 as
Height 
Relative frequency
.085

 .071
Width of interval
1.2
The units on the vertical axis can be viewed as relative frequencies per unit
of the horizontal scale. For instance, .071 is the relative frequency per hour for
the interval 4.3– 5.5.
10/15/09
34
12:02 PM
Page 34
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
0.4
Relative frequency per hour
c02a.qxd
.339
0.3
.271
.254
0.2
.085
0.1
.051
4.3
5.5
6.7
7.9
9.1
10.3
Hours sleep
Figure 7 Histogram of the sleep data of Tables 4 and 5.
Sample size  59.
Visually, we note that the rectangle having largest area, or most frequent class
interval, is 6.7– 7.9. Also, proportion .085  .254  .339 of the students slept
less than 6.7 hours.
Remark: When all class intervals have equal widths, the heights of the rectangles are proportional to the relative frequencies that the areas represent. The
formal calculation of height, as area divided by the width, is then redundant. Instead, one can mark the vertical scale according to the relative frequencies—
that is, make the heights of the rectangles equal to the relative frequencies. The
resulting picture also makes the areas represent the relative frequencies if we
read the vertical scale as if it is in units of the class interval. This leeway when
plotting the histogram is not permitted in the case of unequal class intervals.
Figure 8 shows one ingenious way of displaying two histograms for comparison.
In spite of their complicated shapes, their back-to-back plot as a “tree” allows for
easy visual comparison. Females are the clear majority in the last age groups of
the male and female age distributions.
Stem-and-Leaf Display
A stem-and-leaf display provides a more efficient variant of the histogram for
displaying data, especially when the observations are two-digit numbers. This
plot is obtained by sorting the observations into rows according to their leading
digit. The stem-and-leaf display for the data of Table 6 is shown in Table 7. To
make this display:
1. List the digits 0 through 9 in a column and draw a vertical line. These
correspond to the leading digit.
2. For each observation, record its second digit to the right of this vertical
line in the row where the first digit appears.
3. Finally, arrange the second digits in each row so they are in increasing order.
12:02 PM
Page 35
3. DESCRIBING DATA BY TABLES AND GRAPHS
35
N = 148.7 million
Male
10/15/09
Age
0
10
20
30
40
50
60
70
80
90
100 100 and over
Female
c02a.qxd
N = 153.0 million
Figure 8 Population tree (histograms) of the male and female age distributions in the United States in 2007. (Source: U.S. Bureau of the Census.)
TABLE 6 Examination Scores of 50 Students
75
86
68
49
93
84
98
78
57
92
85
64
42
37
95
83
70
73
75
99
55
71
62
48
84
66
79
78
80
72
87
90
88
53
74
TABLE 7 Stem-and-Leaf Display for
the Examination Scores
0
1
2
3
4
5
6
7
8
9
7
289
35789
022345689
01234556778899
00134456789
0023589
65
79
76
81
69
59
80
60
77
90
63
89
77
58
62
c02a.qxd
10/15/09
36
12:02 PM
Page 36
CHAPTER 2/ORGANIZATION AND DESCRIPTION OF DATA
In the stem-and-leaf display, the column of first digits to the left of the vertical line is viewed as the stem, and the second digits as the leaves. Viewed sidewise, it looks like a histogram with a cell width equal to 10. However, it is more
informative than a histogram because the actual data points are retained. In fact,
every observation can be recovered exactly from this stem-and-leaf display.
A stem-and-leaf display retains all the information in the leading digits of
the data. When the leaf unit  .01, 3.5&0 2 3 7 8 presents the data 3.50, 3.52,
3.53, 3.57, and 3.58. Leaves may also be two-digit at times. When the first leaf digit
 .01, .4&07 13 82 90 presents the data 0.407, 0.413, 0.482, and 0.490.
Further variants of the stem-and-leaf display are described in Exercises 2.25
and 2.26. This versatile display is one of the most applicable techniques of exploratory data analysis.
When…

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!
👋 Hi, how can I help?