# STAT 0024 UCL Social Statistics Assessment Task

STAT0024 Social Statistics: In-course Assessment 2021-2022The survey company you work for has been invited by a supermarket chain to bid to

design a questionnaire and carry out a sample survey to investigate the views of

their current customers about some ideas the chain has for of trying to reduce the

amount of packaging, and especially plastic packaging, used in their stores. For

example they might stop selling pre-packaged fruit and veg, or offer customers the

option of dispensing things like flour or sugar or washing powder into their own

containers. There are clear environmental benefits from such ideas, but they will

only work if there is take up from the customers. The aim of the survey is to estimate

what the take up might be in one particular store where they are thinking of running a

trial project. Your task is to produce a first draft of a plan for internal discussion

within your company. This should include the following elements.

1. Sampling

Explain how you propose to carry out the sampling. Topics you should discuss here

include, but are not necessarily limited to, how you will select your sample, how you

will administer the questionnaire, and the advantages of and possible problems with

your preferred approach.

2. Questions

Discuss the design of the questionnaire, both in general terms – broad areas of

questioning, numbers and types of questions – and by giving 3 specific examples of

questions you would include. These examples should be carefully worded and

should include the format of the response.

3. Analysis and presentation

Discuss how you propose to present the results to the supermarket’s managers. As

well as giving an overview, take one of your 3 example questions and explain in

detail how you would present the results for that question. Make up some data if that

helps. Your discussion should include some comments on quantifying the

uncertainty in the results.

General instructions

Your work should be typed with a minimum font size of 11pt and should not exceed

three A4 pages in total length. The preferred file format is pdf. The three parts of

your answer carry equal weight in the marking scheme. Overall this ICA is worth

15% of the final mark for the module.

Your work should be submitted via MOODLE no later than 4pm UK time on

Tuesday 8th March 2022. Your submitted work should contain only your

student number and not your name or other identifiable information. By

submitting your work you will be deemed to have agreed to the plagiarism and

collusion declaration that appears when you open the submission link.

STAT0024 Social Statistics

Tom Fearn

Department of Statistical Science, University College London.

Term 2: 2021–2022

Contents

1 Introduction

1.1 Preliminaries, basic literature . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Social sample surveys — Basic concepts . . . . . . . . . . . . . . . . . . .

1.3 Module Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

4

5

6

2 Planning a Social Survey

2.1 Some basic distinctions . . . . . . . . . . . . . . . .

2.1.1 Two areas of application of statistical design

2.1.2 Types of subject matters of social surveys .

2.1.3 Types of objectives . . . . . . . . . . . . . .

2.1.4 Possible sources of data . . . . . . . . . . .

2.1.5 Types of questioning . . . . . . . . . . . . .

2.2 Basics about sampling . . . . . . . . . . . . . . . .

2.2.1 Why sample? . . . . . . . . . . . . . . . . .

2.2.2 Basic sampling terminology . . . . . . . . .

2.2.3 Types of error . . . . . . . . . . . . . . . . .

2.2.4 A note on non-response bias . . . . . . . . .

2.2.5 When not to sample? . . . . . . . . . . . . .

. . . .

theory

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

8

8

9

9

10

11

13

13

13

14

15

16

3 Questionnaire Design and Data Visualisation

3.1 An introductory remark . . . . . . . . . . . .

3.2 Answer formats . . . . . . . . . . . . . . . . .

3.2.1 Open vs. closed questions . . . . . . .

3.2.2 Types of closed questions . . . . . . . .

3.2.3 “Do not know” . . . . . . . . . . . . .

3.2.4 Rating scales . . . . . . . . . . . . . .

3.3 Further aspects of questionnaire design . . . .

3.3.1 Question wording . . . . . . . . . . . .

3.3.2 Even more . . . . . . . . . . . . . . . .

3.4 Data Visualisation . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17

17

17

17

18

18

19

20

20

22

22

1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4 Measurement

4.1 Types of measurement scale . . . . . . . . . . . . . . . .

4.1.1 Basic concepts . . . . . . . . . . . . . . . . . . . .

4.1.2 The properties of the standard types of scales . .

4.2 Attitude measurement with the Likert technique . . . . .

4.2.1 Basic principle . . . . . . . . . . . . . . . . . . .

4.2.2 Item selection and check of polarity . . . . . . . .

4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . .

4.2.4 Desirable properties of measurement instruments

4.2.5 Reliability . . . . . . . . . . . . . . . . . . . . . .

4.2.6 Validity . . . . . . . . . . . . . . . . . . . . . . .

4.2.7 Test theory . . . . . . . . . . . . . . . . . . . . .

5 Introduction to Sampling Schemes

5.1 Types of sampling scheme . . . . . . .

5.2 Some history of opinion polls . . . . .

5.3 Simple random sampling: introduction

5.3.1 How to draw a random sample .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6 Sampling Theory: Mathematical Concepts and Notation

6.1 Sampling Theory Notation . . . . . . . . . . . . . . . . . . .

6.1.1 Population Values . . . . . . . . . . . . . . . . . . . .

6.1.2 Sample values . . . . . . . . . . . . . . . . . . . . . .

6.1.3 Binary variables . . . . . . . . . . . . . . . . . . . . .

6.1.4 Probability . . . . . . . . . . . . . . . . . . . . . . .

6.2 Estimates and their standard errors . . . . . . . . . . . . . .

6.2.1 A small example . . . . . . . . . . . . . . . . . . . .

6.2.2 Subjective samples . . . . . . . . . . . . . . . . . . .

6.2.3 Some additional comments . . . . . . . . . . . . . . .

6.2.4 Relation with standard model-based statistical theory

6.3 Sampling Distributions for Simple Random Sampling . . . .

6.3.1 Expectation and variance of a sample value . . . . .

6.3.2 Covariance of two sample values . . . . . . . . . . . .

6.3.3 Expectation and variance of the sample total . . . .

6.3.4 Expectation and variance of the sample mean . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

24

24

25

27

27

28

29

29

30

30

31

.

.

.

.

34

34

35

37

37

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

39

39

39

39

40

41

41

41

43

44

44

45

45

45

46

47

7 Estimators for Population-level Parameters and Sample Size Calculation

for a Simple Random Sample

7.1 Estimation of a population mean . . . . . . . . . . . . . . . . . . . . . . .

7.2 Expectation of the sample variance . . . . . . . . . . . . . . . . . . . . . .

7.3 Estimation of a population total . . . . . . . . . . . . . . . . . . . . . . . .

7.4 Estimation of a population proportion . . . . . . . . . . . . . . . . . . . .

7.5 Sample size calculations for simple random samples . . . . . . . . . . . . .

7.6 Allowing for drop out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

48

49

50

50

51

52

8 Stratified random sampling

54

8.1 General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2

8.2

8.3

8.4

8.5

How to draw a stratified random sample . . . . . . . . . . . . . . . . . . .

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimating the population total or mean . . . . . . . . . . . . . . . . . . .

Allocation of a stratified random sample . . . . . . . . . . . . . . . . . . .

8.5.1 Proportional allocation . . . . . . . . . . . . . . . . . . . . . . . . .

8.5.2 Comparison of proportional allocation with simple random sampling

8.5.3 Optimal allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.5.4 Minimise variance for fixed total cost . . . . . . . . . . . . . . . . .

8.5.5 Minimise cost for a given variance . . . . . . . . . . . . . . . . . . .

8.5.6 Neyman allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.5.7 Comparing Neyman allocation with proportional allocation . . . . .

8.5.8 Choice of total sample size . . . . . . . . . . . . . . . . . . . . . . .

8.5.9 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.5.10 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 Cluster Sampling

9.1 Types of cluster sample . . . . . . . . . . . . . . . .

9.2 Relationship with stratified sampling . . . . . . . .

9.3 Notation . . . . . . . . . . . . . . . . . . . . . . . .

9.4 SRS: Estimation of the population mean . . . . . .

9.5 Equal cluster sizes . . . . . . . . . . . . . . . . . .

9.6 PPS sampling: Estimation of the population mean

9.7 Sample size calculation for cluster sampling . . . .

10 An Introduction to Missing data

10.1 Missing data mechanisms . . . . . . . .

10.1.1 MCAR . . . . . . . . . . . . . .

10.1.2 MAR . . . . . . . . . . . . . . .

10.1.3 MNAR . . . . . . . . . . . . . .

10.1.4 Some more formal definitions .

10.2 Checking MCAR . . . . . . . . . . . .

10.3 Handling Missing Data . . . . . . . . .

10.3.1 Complete case analysis . . . . .

10.3.2 Inverse probability weighting . .

10.3.3 Imputing missing values . . . .

10.3.4 Mean imputation . . . . . . . .

10.3.5 Model-based imputation . . . .

10.3.6 Single stochastic imputation . .

10.3.7 Multiple stochastic imputation .

10.3.8 Bayesian modelling . . . . . . .

3

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

55

56

56

58

59

59

60

61

62

62

63

63

63

64

.

.

.

.

.

.

.

67

67

68

69

71

72

73

74

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

76

76

76

77

77

78

78

80

80

81

82

82

83

84

85

86

1

Introduction

1.1

Preliminaries, basic literature

These notes cover the essentials of the module. They are based on earlier notes by Rex

and Jane Galbraith, Christian Hennig, Gianluca Baio and Aidan O’Keefe.

The notes cover what you need to know to pass the exam. If you want to read more,

either because you want to know more or because you feel an alternative presentation of

some of the topics might help you to understand them, the following references might be

useful. The first one is certainly available as an e-book in the UCL library. Some of the

others may be though I have not checked.

• Survey planning mostly fairly light on maths

Kalton, G., Introduction to Survey Sampling. Sage, 2nd ed., 2021. Short and

practical, recently updated.

Converse, J. M., Presser, S., Survey Questions. Sage, 1986. A good reference for

more details about questionnaire design.

Fink, A., How to Conduct Surveys. Sage, 4th ed., 2009. Practical text mainly for

social scientists without much maths, but covering some interesting practical

issues. There are further relevant books on surveys by A. Fink, for example on

questionnaire design.

Fowler, Floyd J. Jr., Survey Research Methods. Sage, 3rd Edition, 2002. Similar

to the Fink book, and Fowler also wrote on “Improving Survey Questions”.

Hoinville, G. Jowell, R. & associates, Survey Research Practice.

Background reading on practical issues.

Gower, 1985.

Moser, C.A. & Kalton, G., Survey Methods in Social Investigation. Gower, 2nd

Edition, 1985. Classic text on survey design, background reading.

• Measurement and Scaling

Allen, M. J., Yen, W. M., Introduction to Measurement Theory. Wadsworth 1979.

Classical psychometrical text on measurement theory.

Crocker, L., Algina, J., Introduction to Classical and Modern Test Theory. Wadsworth,

2006. Covers most of the measurement and scaling chapter, despite its title.

Like Allen and Yen, which covers similar material, it is driven by psychological

applications but useful for social statistics as well.

DeVellis, R. F., Scale Development. Theory and Applications. Sage, 2nd ed., 2003.

Interesting, not very mathematical.

Hand, D. J., Measurement Theory and Practice. Wiley, 2004. Very thoughtful

and interesting interdisciplinary book on measurement theory by a leading

statistician though overlap with this module is limited.

• Sampling theory

4

Barnett, V., Sample Survey: Principles and Methods. Arnold, 1991. Overlaps with

Scheaffer et al, but covers more statistical theory.

Cochran, W.G., Sampling Techniques. Wiley, 3rd Edition, 1977. A classic text on

sampling.

Scheaffer, R.L., Mendenhall, W & Ott, L. Elementary Survey Sampling. Wadsworth

(Duxbury Press), 5th Edition, 1996. Covers much of the module, though not

measurement and scaling, with a focus on sampling theory.

• Further reading on potentially interesting issues either not covered or only partially discussed in the module

Conrad, F. G., Schober, M. F. (eds.), Envisioning The Survey Interview Of The

Future. Wiley, 2007. A collection of papers discussing the impact of new

technologies and developments on surveys.

Everitt, B. S., Dunn, G., Applied Multivariate Data Analysis. Wiley, 2nd ed., 2001.

On visualisation and analysis of multivariate data occuring in surveys and social

science. Pretty much no overlap with this module!

Little, R. and Rubin, D., Statistical Analysis with Missing Data. Wiley, 1987.

Some of these books contain a lot of examples and interesting discussions. You may also

find more information on the internet.

1.2

Social sample surveys — Basic concepts

A fundamental part of statistical science is an attempt to make inferences about real-world

behaviour using data. We might consider using data to answer questions such as:

• What proportion of the population support the current government policy on healthcare, education etc.?

• What proportion of UK households own a car?

• What is the attitude of the UK population to the introduction of new legislation on

school holidays during term time, the minimum wage, taxation etc.?

How would we attempt to answer such questions?

Sometimes, it is possible to answer questions of interest using data from an entire population. For example:

• Routine birth registration records would allow us to assess the birth rate within the

UK in a particular year

In other scenarios, it would not be feasible to use data from the entire population. For

example:

• What proportion of the UK population regularly shop at a particular supermarket?

5

We don’t routinely record such information and would asking this question to the entire

UK population be feasible/worthwhile?

The answer is almost certainly no. However, this does not imply that we would not

be able to produce an appropriate estimate of the proportion of people who shop regularly at a particular store. We need to think carefully about how we would go about

answering such a question. In short, we would need to consider surveying a suitable

sample of the population of interest. In other words, we would conduct a sample

survey.

Much of STAT0024 concerns how we approach the design of sample surveys and how

we would analyse the data resulting from such surveys, to produce accurate and robust

evidence when attempting to answer questions of interest in social research.

1.3

Module Outline

In this module, we shall concentrate on the following aspects of social statistics:

• An introduction to planning and practical aspects of social surveys, including questionnaire design;

• Methods for social measurement and scaling, particularly the measurement of attitudes using answers from questionnaires;

• Basic presentation and visualisation of data collected in social surveys;

• An introduction to statistical sampling theory for finite populations;

• An introduction to methods for dealing with missing data.

Finite population sampling theory is the core statistical theory for sample surveys. It

deals with the statistical properties of estimators calculated from samples that have been

drawn from finite populations using probability schemes for the sampling.

A major difference between the statistical theory taught in other modules and that encountered in STAT0024 is that in this module we will be drawing samples without

replacement from a finite population. In other scenarios, statistical models permit

potentially infinite repetitions of outcomes of interest. Here the number of possible samples, though it may be large, is finite. We are used to thinking of statistical parameters

such as the expected value of a normally distributed random variabl, as being hypothetical quantities. In social statistics, some, though not all, of the numbers to be estimated

by surveys really do exist. We note that traditional statistical models for infinite repetition can sometimes be used approximately for sampling surveys, if the population under

study is very large, the sample size is much smaller and the distribution of the quantity

of interest matches the model assumptions approximately. However, large parts of finite

population sampling theory do not necessarily require distributional model assumptions.

6

One important problem when working with surveys and samples is the calculation of

a suitable sample size in order to obtain a certain precision. In addition, we might wish

to compare parameter estimates under different possible sampling schemes, such as simple

random sampling and stratified sampling. Much of the module will be devoted to finite

population sampling theory and to questions of this type.

7

2

Planning a Social Survey

It is desirable that a statistician should be involved at the planning stage of any social

survey. This is so that any statistical problems with the survey design can be avoided.

Such problems could make the survey and its results unreliable or worthless, resulting in

a waste of valuable time and/or resources.

In this chapter, we discuss some basic statistical considerations relevant to planning a

sampling exercise. By sampling, we mean the collection of data from a subset of a population of interest in an effort to make inference about population-level outcomes of interest.

We shall consider types of error that can occur when sampling.

2.1

Some basic distinctions

It is generally useful to think about the following distinctions when planning a social

investigation.

2.1.1

Two areas of application of statistical design theory

Comparative experiments. Subjects or units are chosen in some convenient way and

are then allocated to receive different treatments according to some rule, often by

randomisation, with the aim of comparing the responses to the different treatments.

Example: clinical trial.

Sample surveys. Individuals or units are chosen in some way from a population with the

aim of making some inference representative of the population. Example: opinion

poll.

The more commonly made distinction here would be between comparative experiments,

where the experimenter makes some intervention by applying treatments, and observational studies, where the researcher simply observes the state of things. A sample survey

is one example of an observational study. The distinction is important because an observational study can establish associations, for example you might find that people who

express positive attitudes towards caring for the environment tend to use less energy in

their homes, but can say very little about causation. There may be underlying unobserved factors that explain the association, i.e. both may be caused by something we

haven’t measured. On the other hand one might imagine an experiment in which 100

subjects were randomised to two groups, with one group getting information about the

environmental damage due to energy production and the other group getting information

about something unrelated. By monitoring the energy consumption of the two groups

before and after the intervention, it would be possible to make a much stronger inference

than we could from the survey. Of course such an experiment would be much more costly

and time consuming to carry out than the survey, and there are many areas of social

investigation where experiments are not possible for either practical or ethical reasons,

8

but it is always worth considering whether a comparative experiment might be preferable

before opting for observation.

2.1.2

Types of subject matters of social surveys

According to Moser and Kalton (1985), the subject matters of social surveys fall usually

into one of the following categories.

Demographic characteristics. This means matters such as family and household composition, gender, marital status, age and so on. Some basic demographic factors are

acquired in almost every social survey so that their association with the variables

of primary interest can be explored.

Social environment of people. Social, economic and ecological factors to which people

are subject, including occupation and income as well as housing conditions and social

amenities.

Activities of people. This refers to activities like leisure habits, consumerism or travelling rather than occupation, which forms part of the social environment.

Opinions and attitudes. A particular problem with opinions and attitudes is that the

more complex ones, e.g. attitude towards the welfare state, cannot be established

with a single question but need to be inferred from the responses to multiple questions. More about this in Chapter 4.

2.1.3

Types of objectives

Often the choice of the adequate methodology depends on the objective of a study. Here

is one possible typology of objectives:

Description. Descriptive studies are about the collection of as precise as possible quantitative information about a population, for example for planning purposes. Official

statistics are descriptive in most cases. If sampling is used in a descriptive study,

statistical sampling theory is very important because it deals with the precision of

estimators.

Exploration. Explorative studies aim to find previously unknown patterns and at giving

the researchers new ideas and information about a population or topic that has

not necessarily been investigated adequately. Such studies could deal with, for

example, cultural changes or with the determination of potential reasons for newly

observed phenomena. An example might be to investigate people’s reasons for

wearing or not wearing face masks during an epidemic. Explorative studies often

have a tentative character. They work often with non-probabilistic convenience

samples and qualitative open questions. Sampling theory is rarely important here.

Often, explorative surveys lead to clearer scientific hypotheses and theories which

can then be examined by a more focused survey.

9

Examination of scientific hypotheses/theories. Theories to be examined by such

studies should be formulated in a testable way, for example “energy consumption

depends much more strongly on the energy price than on attitudes towards the environment”. These studies are usually more focused than descriptive or explorative

studies. Statistical theory is almost always needed, because the results should be reliable. As noted above, comparative experiments may sometimes be a better choice

for this objective than sample surveys. A “scientific hypothesis” is formulated in

terms of the subject matter. A “statistical hypothesis” to be tested by a statistical

test is not exactly the same, but is often derived from a scientific hypothesis.

Evaluation and quality control. Example: UCL course evaluation questionnaires, but

also analysis of the effects of new laws, monitoring of costs.

Decision support. Example: product planning of a company, but also opinion polls for

political parties.

2.1.4

Possible sources of data

Carrying out a survey. How to design and carry out a survey is the main theme of

this module, so there will be much more on this later. Before you embark on this

approach, however, it is worth considering if the information you seek could be

obtained more easily from other sources.

Obtaining data from documentary sources. For example, data about marital status and occupation is available from population censuses or other official sources.

Data about numbers of visitors can be obtained from cultural institutions. Even individual data such as health and income tax records are held by hospitals and other

institutions. Such data, however, are sometimes not available because of data protection rules; this is becoming increasingly true. Also, data that have been collected

for a different purpose may not really fit the problem at hand.

Direct observations. Often valuable data can be directly observed instead of being

asked by a questionnaire. Examples are numbers of people using a particular tube

line, some visible housing conditions, the analysis of shopping baskets in marketing

research or the number of recyclable items in waste bins.

When questionnaires ask survey participants for facts that can be verified by direct

observation, the answers in the questionnaire often turn out to be unreliable. Therefore, direct observation of such facts is better whenever possible. Sometimes it is

possible but expensive or difficult to observe some facts directly, for example daily

time of watching TV in a household. In such a situation, answers to the questionnaire could be “validated”, i.e. checked, for a small subsample of the respondents

by direct observation.

Using data from other surveys. An elementary step in the planning of every survey

should be the search for already published related surveys in the literature. Sometimes, there has already been a survey with the same objective. In some other

situations, relevant data may have been collected in a survey with a different aim

10

or in the framework of a more general data collection. Even though such data may

not be perfectly appropriate for a problem different from the original purpose, it

can often save a large amount of money and effort to use existing data. Even if

you decide to run your own survey after all, the results of the other surveys may

help you with the design of yours. As we will see later, it helps at the design stage

to know things like variances of the quantities you will be measuring, and the data

from other surveys asking similar questions are a good source of such information.

2.1.5

Types of questioning

Personal interview with interviewer.

Advantages:

• Comparatively small proportion of non-responses.

• Conditions under which the answers are given are known and can be controlled

to some extent. For example it might be possible to experiment with giving

respondents different amounts of background information before asking a question.

• Problems with the interviewee’s understanding of the questions can be resolved.

Disadvantages:

• Expensive.

• Interviewer effects. Answers may depend on the interviewer’s precise wording,

the way they look or speak or treat the interviewee. Not all interviewers are

reliable and may cause bias by asking neighbours or friends instead of finding

the interviewees they were supposed to meet. Good training of the interviewers

is necessary to minimise interviewer effects.

Telephone interview.

Advantages:

• Same as for personal interview but cheaper. In particular it is much easier to

make repeated attempts to contact the interviewee.

Disadvantages:

• Interviewer effects, though interviewers can be monitored more easily.

• Telephone numbers change a lot more often than addresses.

• Many people are nowadays annoyed by commercial telephone calls. Some companies disguise their advertisement calls as survey calls. Therefore many interviewees may refuse to answer. Personal interviews are treated more seriously

by the interviewees.

Mail questionnaire (postal survey).

Advantages:

11

• No interviewer effects; all interviewees are treated in the same way.

• The interviewee does not have to be at home or be willing to answer their

phone at the moment they are contacted.

• Directories of postal addresses are usually more reliable than lists of telephone

numbers or email addresses.

Disadvantages:

• Usually high non-response rate; interviewees often have to be reminded more

than once to send back the questionnaire.

• Problems with the interviewee’s understanding of the questions are usually not

resolved.

Email questionnaire Often this will be a web-based questionnaire in which people are

invited by email to participate.

Advantages:

• Cheap.

• No interviewer effects.

• The interviewee does not have to be at home when contacted.

Disadvantages:

• Usually very high non-response rate. Emails are much more easily ignored or

deleted than postal mails.

• Reliable email address directories do not exist. The population accessible by

email is often very severely biased compared with the target population the

researcher has in mind. Accessible here means that not only does a person

have an email address, but it is also possible for the researcher to find it.

• Problems with the interviewee’s understanding of the questions are usually not

resolved.

• People may drop out because of technical problems such as unstable internet

connections or because they are distracted by a text or call coming in whilst

they deal with the questionnaire on their phone. Many web-based questionnaires have design problems such as refusing to allow the user to continue

without having responded to a question to which they do not want to respond.

Asking the audience

Questions may be asked on TV or radio programmes, on websites or via social media,

and the audience is asked to answer by calling, filling in a form on the website or

some other immediate method. Here, the relation of the respondents to any well

defined population of interest is unclear and statistical inference is rarely justified.

One exception is questionnaires on commercial websites. If the target population

is customers of the website, this is the most efficient and economical way to access

12

this population. What you won’t fnd out, of course, is why people don’t use your

website.

2.2

2.2.1

Basics about sampling

Why sample?

The short answer is ‘cost’ but one must be careful about this.

• The cost per element is usually much higher in a sample survey than in a complete

enumeration.

• The total cost, however, is usually lower. This is because the size of sample needed

to give a satisfactory answer is often very much lower than the population size.

• Sometimes it is the cost to the respondent that is important. For example, to keep

the total interview time manageable, we may ask respondents a subset of the total

questions asked.

2.2.2

Basic sampling terminology

A population is a collection of elements on which a measurement is taken. e.g. in an

opinion poll

element

population

→ a voter

→ all U.K. voters

Furthermore, there is a distinction between the target population about which we seek

information and the study population which is the population we actually study. They

may or may not be the same.

e.g. in market research conducted via email with link to web questionnaire,

target population: all individuals in a particular age group; study population: all individuals who have email adresses available to the researchers

and who can access the linked questionnaire.

More precisely, in case of questionnaires the study population is actually restricted to only those individuals who would respond to the questionnaire if

asked because there is no way to gather information about potential nonrespondents.

Usually the study population is a much more limited accessible population whose properties we hope we can extrapolate to the target population. The difference between study

and target population is very basic; it can lead to severe bias, sometimes called “coverage

bias”, in generalizing study results to the target population if the study population differs

significantly from it, something which happens very often but is not easy to observe.

Sampling units are non-overlapping collections of elements that cover the entire population. A sampling unit might be an element, e.g. an individual voter), or a group of

13

elements, e.g. a household.

A sampling frame is a (typically imperfect) list of sampling units, e.g., a list of addresses,

an electoral roll, or a telephone directory. A sample is a collection of sampling units drawn

from a sampling frame.

A complete example: suppose we want to determine the proportion of unsafe car tyres.

Target Population: All tyres of all UK licensed cars

Study population: This will depend on the sampling scheme adopted, but it is hard to

imagine a scheme that could possibly access all of the target population, and easy

to imagine the biases that will arise if participation is voluntary

Sampling units: UK licensed cars

Sampling frame: Many possibilities, e.g. a list of a UK licensed cars, all cars passing a

particular point on a day, etc.

Sample: The cars on which we take measurements.

2.2.3

Types of error

Sampling error. Random variation due to sampling scheme. For probability sampling,

the statistical properties of the sampling error can be estimated. This is a core topic

of the sampling theory introduced in Chapter 5.

Non-sampling error. This mainly takes the form of bias, and arises either from problems with the design or the implementation of the sampling scheme, or from nonresponse. Some common sources of bias are the following:

Selection bias

Response bias

• frame 6= population

• probability sampling scheme not used, eg convenience sampling,

or used but not followed properly

• non–response bias

• measurements are problematic in some way e.g. poor

question wording, misinterpretation of questions, sensitivity of

information, lack of memory or improper observation.

• Interviewer bias. Interaction between interviewer

and interviewee influences the response.

Non-sampling errors are difficult to estimate. If they can be estimated at all, then it

will usually only be by a separate investigation. Particularly in the case of attitude

and opinion measurement, but also elsewhere, it may sometimes be doubtful whether

there is any objective truth — which means that bias may be hard to define, let

alone estimate.

14

2.2.4

A note on non-response bias

Non-responses are a very important problem in practice (we will give a brief introduction

to the main statistical issues related with non-response and missing data in general in

Chapter 10). Many surveys have response rates well below 50% for the questionnaire as

a whole; non-response rates for certain questions may be even higher.

Not to give an answer to a question is an essentially different behaviour than to give an

answer. Therefore it is never valid to assume that the respondents are representative for

the non-respondents and the “true” answers of the non-respondents would be distributed

similarly. Reasons for non-response can be . . .

• . . . that the interviewee didn’t find that the suggested categories for answers cover

their point of view. This can be because the interviewee holds a neutral or “do not

know” position and this category is not offered; but there are other possibilities for

a missing category,

• . . . that the interviewee didn’t understand the question,

• . . . that the interviewee didn’t want to admit their position honestly, but they also

didn’t want to lie, in which case a non-response is certainly better than a lie,

• . . . that the interviewee didn’t like the question for one reason or another, for example, he/she may have felt that his/her own viewpoint was formulated in a discrediting

way,

• . . . distraction or lack of time of the interviewee, especially if the questionnaire is a

long one.

All of these reasons could explain missing responses to single questions, but in some

situations the whole response is missing because of

• inability to contact the interviewee: unsuccessful telephone call, interviewer didn’t

meet interviewee at home, interviewee didn’t send back questionnaire.

It is usually a good strategy to contact the interviewee again to try to obtain the missing

response. Also it is worthwhile to have a look at the demographic characteristics of the

non-responding part of the population to see whether non-responses are mainly caused

by particular parts of the population, about which the survey then must be interpreted

as less informative.

Even with follow up, there will be many cases where the interviewee neither wants to give

an answer nor wants to explain why. There is always a proportion of non-responses that

is essentially irreducible, and nothing can be concluded about these people. It is therefore

important always to report the proportion of non-responses, and their characteristics, as

far as possible.

Generally, the study population to which a survey actually allows generalisation can only

include those individuals who would respond if they were asked, because we cannot study

those who do not or would not give us information. Note the use of “would” here; we

do not get the information from the non-respondents, and we cannot generalise the data

15

from the respondents to those who would not respond if asked. Most of these “potential

non-respondents” may not actually have been asked.

Experience shows that this construction is somewhat confusing, and the issue is therefore

ignored in most textbooks. In those books, the potential non-respondents are not excluded

from the study population by definition, but this results in an over-optimistic idea of the

possibility of generalisation.

2.2.5

When not to sample?

In some situations the full population is easily available; thus we do not have to sample,

although it may still be reasonable to keep the cost of the survey down. In these cases,

probability models and statistical inference are not necessary, we just need descriptive

statistics to say something sensible about the quantities under study.

Example: If every participant of a course rates the course on the UCL questionnaire

on a scale between -2 and 2, it is meaningless to ask what the precision of the average

rating is and whether it would be significantly larger than 0, because there is no larger

population in which there would be an unknown “true” value. If not all students fill in the

questionnaire or not all are present it is still not a “sample” but a problem with missing

responses. It is not clear whether the respondents are in any sense representative for the

non-respondents.

Sometimes it may be possible to obtain data for a full study population, but one that deviates in some way from the target population that the researcher had in mind originally.

In such situations it may be preferable to draw a sample from the real target population,

assuming that this is possible, rather than to use the full data from a biased population.

16

3

3.1

Questionnaire Design and Data Visualisation

An introductory remark

Before discussing detailed aspects of questionnaire design it is important to note that

there may be deep philosophical issues with the attempt to measure something objective

by questionnaires. For example, you will encounter significantly different distributions of

responses depending on the chosen wording of what seems to be more or less the same

question in terms of content. Particularly when it comes to opinions and attitudes, it is

highly problematic to assume that there is a true or correct answer for an individual or a

best way to ask the question.

3.2

3.2.1

Answer formats

Open vs. closed questions

An open question is a question to which the respondent can answer with a self-formulated

text, while a closed question asks either for a number or for a choice of one or more

categories from a pre-defined list of categories.

The advantage of open questions is that the respondent is not restricted by the pre-defined

format of the answer. In many situations, it is impossible or very difficult for the designers

of the questionnaire to predict all possible answers. Example: “How do you think the

work in your department can be improved?” The main aim of a study using this question

may be exploratory, i.e., to find new ideas for improving the work. Obviously, this does

not work with pre-defined categories for the answers.

The disadvantage of open questions compared to closed questions is that they are not

suitable for statistical evaluation without complicated and labour intensive pre-processing,

and much information is typically lost in such pre-processing. For this reason the focus

of this module will be on closed questions.

Closed questions have another advantage, namely that the offered categories for the answers support the aim that every interviewee understands the question in more or less

the same way, and they may remind the interviewee of events that they otherwise do not

recall. For example, it is easier to select the magazines and newspapers you have read

in the last year from a list than to remember all of them when confronted with an open

question.

Mixed formats are possible: “What is the most important issue for the government in the

next year? crime/environment/education/other, namely:. . . ”.

Sometimes it is useful to use open questions in a small preliminary survey and to switch to

pre-defined alternatives in the main survey, because the open question in the preliminary

survey can be used to find the alternatives that the respondents have in mind.

17

3.2.2

Types of closed questions

There are different types of closed questions:

Questions asking for a number such as “how old are you?”

Binary questions for which there are essentially only two answers. For example yes/no

or smoker/non-smoker.

Questions asking for one out of a list of non-ordered categories which is not necessarily exhaustive such as “What is the most important issue for the government

in the next year? crime/environment/education/other”

Questions asking for a position on an ordered scale such as “strongly agree / agree

/ neither agree nor disagree / disagree / strongly disagree” or “If the government

takes action against inflation, it may happen that unemployment goes up. Where

would you place yourself on a scale about what you think the government should

give priority to:

Reduce inflation 1 2 3 4 5 6 7 Reduce unemployment.”

Further types of questions can be seen as versions of the main types above.

Questions asking for more than one out of a list of categories such as “On which

of these activities did you spend more than one hour in the last three weeks? Reading/Sports/Listening to music/ Playing a musical instrument/Painting/Hiking/Meeting

friends in a pub”. Such questions, though not asked in a binary fashion, are typically evaluated as a series of binary questions: a yes/no variable is defined for each

of the categories, e.g., “reading: yes/no” and so on.

Categorized questions asking for a number such as “When did you last play a board

game? Less than three months ago/three months to one year ago/one to five years

ago/longer ago/never”. Usually this is not as suitable for statistical evaluation as

asking directly for the number, for example, statistics such as mean, median or standard deviation cannot be computed, but it can be preferable if it refers to events or

numbers that are usually not exactly memorized.

Generally, there are situations where each of these possible formats is the most useful one.

A piece of general advice is not to mix questions with different formats too much, as this

can confuse the respondents and also make the evaluation more complex.

3.2.3

“Do not know”

It is possible to add a “do not know” or “cannot choose” option to all of the formats given

above.

If no such option is offered, people still can choose not to answer the question. The

problem with this is that there may be different reasons not to give an answer of which

“do not know” is only one. See also 2.2.4.

18

Empirical evidence suggests that the number of people ticking “do not know” if such an

option is offered is larger than the number of people who refuse to answer in the case that

there is no “do not know” option. This implies that some people allow themselves to be

forced into a decisions if “do not know” is not offered. This can be seen as a reason for

or against a “do not know” option. Pressuring people to answer the question may result

in more responses, but the value of the extra responses may be doubtful.

Not to offer a “do not know” suggests to some people that everybody should have an

opinion about this question, which may be seen as ethically problematic.

Note that, in terms of data analysis, “do not know” is essentially different from a neutral

or middle position in a symmetric scale. You can have a neutral position if you have

thought quite a lot about the topic, but then “do not know” does not capture your opinion

adequately. Therefore, the proportion of “do not know” answers can be interesting, but

the actual occurrence of a “do not know” should not be treated as a number in the middle

of a scale.

3.2.4

Rating scales

Often, ordered rating scales are used to evaluate the strength of a respondent’s view on

a topic of interest. There are some decisions to be made about such scales, and there is

generally no agreement in the literature about these choices.

Descriptions or just numbers (and which numbers)? Here are two rating scales

that have appeared in examples before: “strongly agree/agree/neither agree nor

disagree/disagree/strongly disagree” and “Reduce inflation 1 2 3 4 5 6 7 Reduce unemployment.” One scale only describes the extremes and gives numbers in between,

the other one gives descriptions for all categories. I haven’t seen any empirical

evidence that suggests that this choice makes a big difference.

Many people are not familiar with rating numbers, so describing the categories may

lead to a better understanding. For more than five points on the scale, however,

finding adequate descriptions will be quite difficult. If these scales are eventually

analysed as numbers, which as discussed in Chapter 4 often happens, it is more

honest to use numbers. There are also questionnaires in which just a sequence

of boxes is offered with a description of the extremes and neither numbers nor

descriptions for the categories in between.

It is possible to use 0 for the middle category and positive and negative numbers (“-3

-2 -1 0 1 2 3” instead of “1 2 3 4 5 6 7”). This leads to mathematically equivalent,

though perhaps more intuitive, analyses. Again, this choice probably does not make

a big difference.

Rating scales are not always symmetric. An example of an asymmetric scale is

“How do you rate your general health? Very good/good/fair/poor”, but note that

this suggests that the general health is or should be good than poor. Asymmetric

scales should therefore only be used if it is generally accepted that one direction is

19

preferred or seen as more normal than the other. If in doubt it is better to use a

symmetric scale.

Should a middle (neutral) category be offered? If an odd number of points is offered in a scale, the middle one is often a neutral category such as “neither agree

nor disagree”. Some authors advocate the choice of a scale with an even number

of points with the argument that people should not be encouraged to choose the

middle category as an easy way out. As with the argument over whether or not to

include “don’t know” options it is not obvious whether this is a good idea or not.

Generally, if it is reasonable to expect that some respondents would like to choose

a middle category, then a middle category should be offered, because the reliabilty

of a forced choice is doubtful.

Some authors (e.g., Converse and Presser, 1986) suggest that one should omit the

middle category and to offer an additional question like “how strongly do you feel

about the issue?” instead. This assumes that the choice of the neutral category is

usually motivated by a weak intensity of emotion about the issue, which might not

be the case.

How many categories? Many different numbers of categories are used for ordered scales,

most often three, five, seven, or, if no middle ground is offered, two, four, six or ten.

Again, there is no generally accepted recipe here. Considerations are the degree

of differentiation a typical respondent will feel able to make, though this may vary

between respondents, and how precisely the researcher would like to discriminate

responses. The sample size is also relevant. If the sample is not large and the analysis is by category rather than via a numerical scale, then the use of large number

of categories may spread the data too thinly, so that we end up combining adjacent

categories for the analysis anyway.

3.3

Further aspects of questionnaire design

There are many further important aspects of questionnaire design, most of which lie in

the domain of psychology rather than statistics. Here is a short overview.

3.3.1

Question wording

The impact of the wording of a question on the result can be huge and it is not always

predictable. In a study, some subjects were confronted with question A

“Do you think the United States should forbid public speeches against democracy?”

and others were asked question B

“Do you think the United States should allow public speeches against democracy?”

Question A got 21.4 % yes responses, which suggests that about the same percentage

should answer “no” to question B, but actually question B yielded 47.8% “no”s. This

example is taken from p. 67 of Scheaffer, Mendenhall and Ott, 1996.

20

Many such examples are around in the literature, and it is perhaps not surprising that

many people doubt the reliability of the results of any questionnaire survey. Certainly

some media interpretations of poll results such as “75% of the people are against the

proposed EU constitution” depend heavily on the exact question wording, which is usually

not reported, and are therefore not to be taken too seriously.

Note that, having Section 3.1 in mind, it cannot be said that one of questions A and B

is better and the other one is worse in the sense that responses to one question would be

likely to be “less biased”. There is a rather subtle difference between saying “yes” to A

and saying “no” to B, because it is actually not the same to allow something explicitly,

and therefore encourage it to some degree, as it is not to forbid it. This difference could

explain the different results, at least partly. Therefore, the researcher has to decide what

they really want to know, and to take care that the question expresses this as precisely

as possible.

Here are some guidelines:

Be specific and precise The more specific and precise you are, the more likely is it

that all respondents will understand a question in more or less the same way.

Example: “Are you satisfied or dissatisfied with your canteen?” is not specific,

because this could refer to the prices, the service and/or the quality of meals. Better:

“Are you satisfied or dissatisfied with the prices of meals in your canteen?”.

“Do you favour or oppose gun control legislation?” is not precise; the answer may

depend on the precise content of a proposed gun control law.

Do not suggest a particular answer Some people are easily influenced if a question

is formulated in a non-neutral way. “Do you favour the use of capital punishment?”

can be expected to yield more answers in favour than “Do you favour or oppose the

use of capital punishment?” Another example: “Would you like to pay more tax?”

– certainly not vs. “Would you like to pay more tax to enable higher spending on

education?” – possibly.

Note that questions asking for agreement or disagreement with a statement are, on

average, more likely to receive agreement. Therefore, such questions should only be

used in a context where multiple questions of the same type about the same topic

are used, but asked so that agreement corresponds to opposite directions of opinion

in different questions.

Example from the British Social Attitudes Survey: “It is all right for a couple to live

together without intending to get married” and “People who want children ought to

get married”. Conservative people will tend to agree with the second question but

not with the first one. Having both questions in the same questionnaire will tend

to reduce the bias associated with a tendency to agree. Chapter 4 has more on the

combining of such questions in what is called a Likert scale.

Avoid multiple questions “Would you like to be rich and famous?” is not good, because the respondent might want to be rich but not to be famous.

Avoid ambivalent questions “A teenager can be just as good a parent as somebody

who is older – agree/disagree?” is not good, because “disagree” could mean that

21

the respondents see teenagers as particularly bad or particularly good parents.

Keep question wording simple A lot of examples appear in the literature, dealing

with issues such as asking “Do you think. . . ?” instead of “Is it your opinion. . . ?”,

avoiding double negatives, and so on. Moser and Kalton, 1985, cite “Has it happened

to you that over a long period of time, when you neither practised abstinence, nor

used birth control, you did not conceive?” as a particularly bad example.

3.3.2

Even more

• The question order can play a role. People tend to try to be consistent in their

responses to questions, and because they think about earlier questions first, later

answers can be influenced.

Example from United States, 1980, taken from Scheaffer, Mendenhall and Ott,

1996, p. 63.

A “ Do you think the United States should let Communist newspaper reporters

from other countries come in here and send back to their papers the news as

they see it?”

B “Do you think a Communist country like Russia should let American newspaper

reporters come in and send back to America the news as they see it?”

“yes”-answers when asked A then B: 54.7% for A, 63.7 % for B.

“yes”-answers when asked B then A: 74.6 % for A, 81.9 % for B.

A possibility to deal with this effect could be to give different respondents differently

ordered questionnaires.

• do not make the questionnaire longer than necessary. Skip any question

which is not clearly useful with respect to the aim of the study. You should neither waste the respondent’s time nor your own time needed for the analysis. Long

questionnaires also lead to more respondents failing to get to the end of them.

3.4

Data Visualisation

It is often desirable to summarise results from questionnaires, as well as data collected

from other sources),using visual displays. By visual displays we usually mean charts,

graphs and plots, although some might consider a poster, showing details of the background of the survey/questionnaire, its motivation and data collection methods etc. to

be a ‘visual display’.

Visual displays can be an excellent method of communicating questionnaire and other

survey results. However, it is important to ensure that any visual display is effective and

we outline some guidelines with regard to visual display design:

• Objective: Consider the objective of your survey/questionnaire and decide which

data are the most important with regard to meeting this objective. It is unlikely

22

that any reader wants to see reported results on lots of outcomes so one should focus

on what is important.

• Intuition and understanding: A plot or graph shoud be is designed to summarise

data in a way that easy to understand. A reader should not have to spend a long

time looking at a graph to understand the results or data that are presented. A

plot or graphical display should make it possible to view important results or data

features easily and intuitively.

• Consistency: Don’t keep changing the type of plot used just for the fun of it. If

one style works for many of your plots, stick to it and make it easier for the reader

to appreciate them.

• Use of colour: Good or bad? Generally, colour can make plots/graphs eye-catching

and allow important features of data to stand-out. However too much colour can be

a mistake. It might confuse the reader and possibly detract from the main features

displayed within a plot. Some readers may be colour blind, and so distinguishing

lines on a plot by red and green colouring is inadvisable. Finally, whatever beautiful

colours you use some readers will end up looking at a monochrome printout and so

if at all possible the graph should still be interpretable when seen on a gray scale.

• Axes: When presenting multiple plots of numerical data, eg for different subgroups,

axes should always be on exactly the same scale for different plots, so that groups

can be easily compared. If natural ranges for x and y axes exist, then these should

be used, where appropriate.

• Titles, Labels and Legends: Titles and axes labels should always be included on

plots. They should be informative, showing units of measurement where necessary.

Generally, it is a good idea that titles, labels and legends are not too lengthy.

Edward Tufte has written a number of very nice books on graphical presentation, with

both beautiful and horrid examples.

23

4

Measurement

The questionnaire has been called a “measurement instrument”. The term “measurement”

is used in a general way here. A measurement is an assignment of a number or category to

some attribute of a real world object or event. This is usually done to allow comparisons

between the objects or events, often using statistical methods.

The problems of measurement are often more complicated in the social sciences than in

physics, chemistry or engineering. What is being measured is not always well defined, and

even where it is, it is not always clear that the instrument, a questionnaire for example, is

measuring it in a valid way. A particular problem concerns the measurement of attitudes

or psychological constructs such as intelligence (from tests in which certain problems are

to be solved) but it also applies also to constructs which refer to more directly observable

quantities such as criminality, deprivation, living quality of a certain area or inflation in

economics. Criminality, for example, is made up by events that are directly observable in

principle, but the term “criminality” itself is more abstract. To define it in an observable

way, a lot of decisions have to be made. Criminality is composed of different things, and

there are many possibilities to aggregate them. For example, a definition of criminality

as a one-dimensional measure requires a decision about the mathematical aggregation of

burglaries, murders and tax fraud, and in particular about the relative weight of these

events.

This section introduces some general ideas from the theory of measurement in the social

sciences.

4.1

4.1.1

Types of measurement scale

Basic concepts

In 1946 S. S. Stevens published a very influential, though quite controversial, classification

of different types of scales of measurement (On the Theory of Scales of Measurements,

Science, 103, 677-680). He was motivated by his impression that in psychology and the

social sciences in general many arithmetical and statistical operations were being carried

out that were not valid and thus meaningless. For example, many researchers compute

and compare arithmetic means of Intelligence Quotients (IQs) for certain groups of people.

This operation implicitly assumes that one person with IQ 100 and another with IQ 140

can, as a group, be considered as exhibiting similar intelligence to two persons both with

IQ 120, or, in other words, that the IQ difference between IQ 100 and IQ 120 is as large

as the difference between IQ 120 and IQ 140.

Such an assumption is certainly meaningful for some measurements. An example is length

– the difference between 100 feet and 120 feet is exactly the same as the difference between

120 feet and 140 feet. This is because there is a natural operation of adding lengths: one

distance can be put behind the other, and together they give a distance with both lengths

added. In other words, the addition of numbers can be said to represent the addition of

lengths. But there is no such natural addition operation for IQs of different people.

24

Stevens aimed to define which arithmetic operations could be carried out on measurements

in a meaningful way. By this he meant that “meaningful” operations only should make

use of those features of a measurement which carry information. Stevens’ classification of

scale types was based on the type of transformation that preserves the information in the

measurement. Mathematical and statistical operations on a particular measurement type

are only valid if they are invariant under the associated type of transformation. First a

formal definition:

Definition: The scale type of a measurement operation is defined by the family Φ

of transformations φ by which the measurements can be transformed without losing

their original information content. Statistical statements about sets of measurements

{x1 , . . . , xn } are meaningful if they are invariant, i.e., for all φ ∈ Φ, S(x1 , . . . , xn ) holds

if and only if S(φ(x1 ), . . . , φ(xn )) holds, for some transformation S.

This definition is very abstract, but should become clearer after a few examples. We begin

by listing the types of scales, which are similar to the types of closed questions disscussed

in Chapter 3.

Nominal scales which distinguish two or more classes. Examples are male/female, tennis/football/basketball/swimming/athletics/other.

Ordinal scales for which the measurements carry an order, but there is no information

about the distance between the categories. Example: strongly agree/agree/neither

agree nor disagree/disagree/strongly disagree.

Interval scales for which comparisons of the distances between the values are meaningful, but there is no clearly defined zero point. Example: dates. The time between

1940 and 1970 is the same as between 1970 and 2000, but the zero point has no

arithmetical meaning; in most contexts the year 2000 is not in any sense twice as

large as the year 1000.

Ratio scales which additionally have a unique zero, so that ratios are meaningful. Example: lengths and distances; 200 miles is twice as far as 100 miles.

4.1.2

The properties of the standard types of scales

Here is how these scale types are characterized by the definition above (this forms a

hierarchy of scales from the least to the most informative):

(i) Nominal scales

Information: Two values are either equal or different.

Transformations preserving information: all one-to-one transformations, for example

the transformation of “tennis/football/basketball/swimming/athletics/other” to

“1/2/3/4/5/6”, but also to “4/5/1/3/6/2”.

Meaningful statements: statements including the frequencies of categories, such as

most frequent category, relative frequency (empirical probability). For example,

25

the statement “out of tennis/football/basketball/swimming/athletics/other, tennis

was chosen most often, in 44% of the cases, as favourite sport in our data” can be

transformed to “out of the categories no. 1,2,3,4,5,6, category 1 was chosen most

often, in 44% of the cases” (or category 4, if the other transformation above had

been used).

(ii) Ordinal scales

All nominal information is still valid and all meaningful statements for nominal scales are

still meaningful for ordinal ones.

Information: Two values are either equal or one of them is larger than the other.

Transformations preserving information: all order preserving transformations, e.g. the

transformation of “strongly agree/agree/neither agree nor disagree/disagree/strongly

disagree” (sa/a/n/d/sd), “1/2/3/4/5” or to “–2/–1/0/1/2” or to “1/17/120/122/2644”.

Meaningful statements: statements involving quantiles (such as the median) and rank

statistics (such as the Kruskal-Wallis or Wilcoxon-test). For example, “category d

is the median of our data” (which means that there were not more than 50% sd

and not more than 50% sa/a/n) can be transformed to “1 is the median of our data

transformed to -2/-1/0/1/2” (or 122, if the second transformation has been used).

(iii) Interval scales

All ordinal information is still valid and all meaningful statements for ordinal scales are

still meaningful for interval scales.

Information: Comparisons between differences of values are informative.

Transformations preserving information: positive linear: φ(x) = ax + b with a > 0.

Meaningful statements: comparative statements involving sums, means and variances.

Example: “the mean of (3, 4, 4, 5) is larger than that of (2, 2, 5, 5)” becomes “the

mean of (−4, −2, −2, 0) is larger than that of (−6, −6, 0, 0)” under φ(x) = 2x − 10.

Note that the mean of (120, 122, 122, 2644) is not larger than that of (17, 17, 2644, 2644),

so that mean comparisons are not invariant under all the monotone transformations

allowed for the ordinal scale.

(iv) Ratio scales

All interval information is still valid and all meaningful statements for interval scales are

still meaningful for ratio scales.

Information: Ratios between values are informative.

Transformations preserving information: positive proportional: φ(x) = ax with a > 0.

26

Meaningful statements: Statements involving products and ratios of values. Example:

“a flat in London is, on average, twice as expensive as a flat of the same size in

Stockholm” – whatever currency (proportional transformation) is used to measure

this.

This is all very elegant as a theory, but what happens in practice is that data that are

only on an ordinal scale are often analysed as though they were on an interval scale. For

example, responses on a five point ordinal scale are often coded to the integers 1 to 5 and

then means are compared across groups. The Likert scales discussed in the next section

do exactly this. Maybe it matters (it certainly does sometimes), maybe we can get away

with it so long as Stevens isn’t looking.

4.2

4.2.1

Attitude measurement with the Likert technique

Basic principle

Measuring attitudes, for example the attitude of someone towards the welfare state, is

not straighforward when the context is not a simple one. A person might feel positive

about some aspects and less positive about others. One popular technique for tackling

this problem is due to the American psychologist Rensis Likert (1903-1981). The most

basic construction of a Likert scale works as follows:

1. Construct several items (questions) that ask for several aspects of the attitude

you want to measure. The questions should all have the same answer format,

namely an ordered scale with a constant number of categories such as “strongly

agree/agree/neither agree nor disagree/disagree/strongly disagree”. A five-point

scale is used most often, but the method also works with other numbersof categories.

Example: Imagine that the questionnaire consists of three questions:

(a) People receiving social security are made to feel like second class citizens.

(b) The government should spend more money on welfare benefits for the poor,

even if it leads to higher taxes.

(c) If welfare benefits weren’t so generous, people would learn to stand on their

own feet.

with answers on a five point strongly agree/. . . /strongly disagree scale.

Normally there would be quite a few more questions, but three is enough to illustrate

the idea.

2. Now code the responses as 1/2/3/4/5 (or you could use -2,-1,0,1,2). We need to be

careful here as these questions don’t all point in the same direction. Agreement with

(a) and (b) indicates a positive attitude towards the welfare state,but agreement with

(c) suggest a negative attitude. This is called the polarity of the question. We could

say that (a) and (b) have positive polarity and (c) has negative polarity. To make

27

it so that averaging over questions makes sense we code (a) and (b) as 1/2/3/4/5

and (c) as 5/4/3/2/1. Thus strongly agree codes to 1 for (a) and (b), while strongly

disagree codes to 1 for (c). We shouldn’t try to make life simpler by making all

our questions have the same polarity but rather try to balance them, thus avoiding

bias caused by people having a preference for one end or other of the verbal scale.

Note that which direction as labelled is positive and which as negative is a matter

of choice, as is the direction of the numbering. All that matters is that we are

consistent and that we remember what we have done when it comes to interpreting

the results.

3. To compute the Likert score Li for respondent i, just compute the mean of the codes

of their answers.

If a person answers (a) with “strongly agree”, (b) with “disagree” and (c) with

“disagree”, their Likert score would be Li = (1 + 4 + 2)/3 = 2.33. Remember that

(c) has negative polarity and “disagree” is therefore coded as 2.

Note that in the literature Likert scales are often defined as sums instead of means.

However, means have two advantages:

1. means lie in the same value range as the original codes, which makes them easier to

interpret,

2. in case of missing values, or “do not know” if such a category has been offered, the

mean is more sensible. It is natural to ignore missing values for the computation

of the sum, which means that they are effectively coded as 0. If the code numbers

are all positive (1/2/3/4/5), this makes the sum score smaller than it should be,

and even in case of codes -2/1/0/1/2, treating missing values as zeros may not be

appropriate either. But if the mean is computed, the sum can be divided by the

number of questions answered, and this means that the missing values and do not

knows really do not have an influence on the final score. Had the person in the

example above not answered (c), their mean score would be Li = (1 + 4)/2 = 2.5.)

4.2.2

Item selection and check of polarity

Once we have administered the questionnaire to some subjects, possibly in a modestly

sized pre-test before running the full survey, we can try to find out whether the items are

really suitable to measure a common concept, i.e. whether they are consistent with the

general scale. This includes a check whether a wrong polarity has been chosen for some

of the items.

The standard way to do this is to compute correlation coefficients between the individual items and the overall Likert score. For two paired samples x = (x1 , . . . , xn ), y =

(y1 , . . . , yn ) (here the pair (xi , yi ) belongs to person i and n is the number of participants),

the (sample) correlation coefficient is

Pn

(xi − x̄)(yi − ȳ)

.

r(x, y) = pPn i=1

Pn

2

2

i=1 (xi − x̄)

i=1 (yi − ȳ)

28

We compute this correlation with x the data for one question and y the overall Likert

score, repeating the calculation for each question in turn. If the polarity of all items is

correct, all their correlation coefficients with the overall score should be positive. If one

is not, it is easy enough to fix the problem by reversing the coding for that item.

The correlations, after any mistakes with polarity have been fixed and the correlations

recomputed, can also be used to weed out questions that have very low correlation with

the overall score. The idea is to end up with a set of questions that are all measuring

roughly the same thing. The threshold for removing questions is pretty arbitrary, but a

correlation of less than 0.4 is sometimes used as a reason for removal.

As an improvement, the correlations could be computed not between an individual item

and the overall score, but between the item and the score which would result from aggregating only the other items. These correlations would be a bit smaller.

4.2.3

Discussion

Obviously, several subjective decisions are needed to construct a Likert scale, and although

there is a certain test of internal consistency, the quality of this measurement instrument

is not entirely convincing. It is at least easy to apply, and more complicated alternatives

cannot solve all the problems.

Note that in Section 4.2.2 it is only tested to what extent the individual items are consistent with the major tendency of all items. Thus we are not testing whether the items

measure what they are supposed to measure, just that they appear to be measuring more

or less the same thing.

Statistical methods assuming interval scale level are often applied to Likert scales, and in

fact the assumption of an interval scale is implicit in the averaging over questions. This

may or may not be appropriate.

4.2.4

Desirable properties of measurement instruments

A measurement instrument should ideally satisfy the following two properties:

Reliability. If the same situation is measured more than once, the measurement always

gives the same, or at least a very similar, result.

Validity. The measurement measures what it is supposed to measure. For example, the

welfare scale discussed above is valid if it properly represents the attitude of the

individuals toward social services and the welfare state.

Unfortunately, neither properties can be measured directly when the instrument is a

questionnaire.

29

4.2.5

Reliability

The intuitive approach to measuring reliability would be to apply the same measurement

instruments twice to the same individuals. Reliability would then be quantified by the

correlation between the pairs of measurements.

In our context the measurement instrument is a questionnaire. Obviously there are problems with this intuitive approach. If an individual has to fill in the same questionnaire

twice, it can be expected that the results of the second measurement are influenced

strongly by the first measurement. Individuals may remember their first answers and

try to appear consistent, or they may start to think more about the issues as a result of

reading the first questionnaire. Another effect could be that they do not take a second

interview with an identical questionnaire seriously.

Many of these problems have a weaker impact if the second measurement is taken much

later, for example half a year later. But then the individuals can no longer be treated

as “the same”. Whatever may have happened in the meantime may have influenced

their attitude and therefore a different measurement result may be obtained even with a

perfectly reliable measurement instrument.

Therefore, pure reliability is essentially not observable.

It is possible, though, to estimate the reliability of a measurement instrument which is

defined by the addition of several “sub-measurements” (such as the items in the Likert

scale) under some strong assumptions. The idea is that the variability of a measurement

instrument can be estimated from the variability between its parts. This will be discussed

in Section 4.2.7.

4.2.6

Validity

There is an obvious philosophical problem with the validity concept. Validity is about

the relation of our measurements to the “underlying truth” (e.g., the underlying real

attitude) that we would like to measure. But as long as we do not know how to measure

this underlying truth, we cannot observe anything objective about this relation. Often

it is not even clear whether such underlying truth exists. Therefore, pure validity is

essentially not observable either.

Some scientists in this situation choose the easy way out and define an attitude or ability

by the result of the measurement. This leads to statements like “intelligence is what

intelligence tests measure”. An intelligence test would then be perfectly valid by definition. But if intelligence were nothing more than the result of an intelligence test, there

would be no reason at all why anybody in society should be interested in measuring intelligence. Therefore, a measurement has at least to be related to some aspects of the idea

of intelligence that are relevant in the real world.

Validity can be assessed in a theoretical way by exploring how the measurement instrument is actually related to the theoretical concept of what is supposed to measure. In the

literature this is split up into several parts, for example “construct validity” and “content

30

validity”. This is in the realm of philosopy rather than statistics and we will not pursue

these ideas.

A more practical way is the comparison of the measurements with other observable criteria

that can be expected to be related to the underlying concept of interest. For example, the

results of intelligence tests could be correlated with school grades. A scale purporting to

measure the attitude toward the protection of the environment could be correlated with

observable behaviour such as use of cars, bikes or public transport and energy bills.

In the literature, this is called “predictive validity” if the observable criterion is measured

in the future or “concurrent validity”.

The essential philosophical problem with validity remains unsolved. If school grades do

not agree well with the results of intelligence tests, the intelligence test may be invalid,

but alternatively school grades could have less to do with intelligence than expected. If

the results agree, it may still be that both concepts measured the wrong thing.

4.2.7

Test theory

The term “test theory” refers to psychological tests, for which this theory was first developed. This statistical theory can be used to derive some interesting results about reliability

and validity, though the assumptions of the classical test theory are quite dubious. We

look only at some results on reliability here.

Suppose X1 and X2 are two measurements of the same property. e.g. an attitude, on the

same person, taken by the same measurement instrument. Assume

X1 = T + E1 , X2 = T + E2 , ρ(E1 , E2 ) = ρ(E1 , T ) = ρ(E2 , T ) = 0,

where ρ(X, Y ) = √ Cov(X,Y )

Var(X)Var(Y )

(1)

is the (theoretical) correlation between the random vari-

ables X and Y .

Here T is interpreted as the (non-observable) true value and E1 and E2 are the measurement errors, which are assumed as uncorrelated with the true value such that E(Ei ) =

0, i = 1, 2, which is the usual assumption for measurement errors.

Then, ρ(X1 , X2 ) is called the reliability of the measurement instrument yielding the

measurements X1 and X2 .

P

P

Now let Y1 = ki=1 Xi1 and Y2 = ki=1 Xi2 be two measurements that are constructeded by

summing up k items (“sub-measurements”) X11 , . . . , Xk1 , X12 , . . . , Xk2 , respectively. The

Y s could be Likert sum scores or indeed mean scores, because dividing all values by the

same constant does not change anything. Assume about the X11 , . . . , Xk1 , X12 , . . . , Xk2

that they are all of the form (1) with measurement errors uncorrelated to each other.

Assume that all items have the same correlation with each other, so for some constant ρ0

ρ(Xij , Xhl ) = ρ0

(2)

for i, h = 1, . . . , k and j, l = 1, 2, except when i = h and j = l and the two Xs are the

same. This is, in most situations, a quite unrealistic assumption.

31

To make the formulae in the proof below easier, assume further that

E(Xij ) = 0, Var(Xij ) = E(Xij2 ) = 1

(3)

for all i, j, implying that E(Yi ) = 0 and ρ0 = E(Xij Xhl ). The theorem below can be

proved without these simplifying assumptions, but the algebra is much messier.

Theorem Under the assumptions above,

ρ(Y1 , Y2 ) =

kρ0

.

1 + (k − 1)ρ0

Proof: Because of (3),

k X

k

X

Cov(Y1 , Y2 ) = E(Y1 Y2 ) = E

!

Xh1 Xi2

=

h=1 i=1

k X

k

X

E(Xh1 Xi2 ) = k 2 ρ0

h=1 i=1

and

Var(Y1 ) = Var(Y2 ) = E(Y12 ) = E

k X

k

X

!

Xh1 Xi1

.

h=1 i=1

The double sum has k terms with h = i, all equal to 1, and k(k − 1) terms with h 6= i, all

equal to ρ0 , so that

Var(Y1 ) = Var(Y2 ) = k + k(k − 1)ρ0

and

ρ(Y1 , Y2 ) = p

k 2 ρ0

kρ0

=p

=

.

1 + (k − 1)ρ0

Var(Y1 )Var(Y2 )

(k + k(k − 1)ρ0 )2

Cov(Y1 , Y2 )

This means that the reliabilityPof a measurement defined as a sum (or mean) of items

such as the Likert scale Y = k1 ki=1 Xi can be estimated by the so-called Cronbach’s α

α=

kr̄

,

1 + (k − 1)r̄

where r̄ is the mean of all (sample) correlation coefficients between the items Xi , Xj , i 6=

j. A general rule-of-thumb is that Cronbach’s α should be larger than 0.8 for reliable

measurement instruments.

Example: To estimate the reliability of a welfare scale, the correlation matrix of its seven

items is computed:

X1

X2

X3

X4

X5

X6

X7

X1

1.000

0.260

0.543

0.460

0.541

0.294

0.235

X2

0.260

1.000

0.263

0.468

0.403

0.153

0.149

X3

0.543

0.263

1.000

0.352

0.415

0.335

0.091

X4

0.460

0.468

0.352

1.000

0.519

0.224

0.198

X5

0.541

0.403

0.415

0.519

1.000

0.286

0.272

X6

0.294

0.153

0.335

0.224

0.286

1.000

0.171

32

X7

0.235

0.149

0.091

0.198

0.272

0.171

1.000

The mean of the off-diagonal entries is 0.316, which yields α =

not very high.

7×0.316

1+6×0.316

= 0.764. This is

Another implication of the theorem above is that the larger the number of items k, the

larger the reliability (assuming that the correlation between the items is always the same).

For example, if we had 14 items with a mean correlation of 0.316 instead of just seven,

the reliability would have been 0.866.

The increase with k can be shown by computing, for fixed r̄,

r̄(1 + (k − 1)r̄) − kr̄2

r̄ − r̄2

α (k) =

=

>0

(1 + (k − 1)r̄)2

(1 + (k − 1)r̄)2

0

because |r̄| < 1.
This suggests that all we have to do is have lots of items. However if you look back at
the assumptions that went into this calculation, the measurement errors on the individual
items are assumed independent between items. If two items are too similar, i.e. the two
questions are virtually the same, this will clearly not be true. So we need lots of essentially
different items for the theory to work. This is a lot harder to achieve.
Just to confuse matters there is an alternative formula for Cronbach’s α, given by:
!
Pk
2
s
k
j=1 j
α=
1−
.
k−1
s2Y
Here:
s2j = sample variance of j th item
s2Y = sample variance of the calculated scores
with the variances calculated across n subjects in each case.
The previous formula is unaffected by whether Y is a sum or a mean, but this one assumes
the scores Y are sums rather than means. The s2Y in the denominator would need to be
multiplied by k 2 if Y is a mean of k items rather than a sum.
Under the same unreasonable assumptions this formula also estimates the correlation
between replicate measurements of Y . To see this, note that s2j estimates the variance
of the jth item, which is assumed to be 1, and s2Y estimates the variance of Y which is
shown above to be k + k(k − 1)ρ0 . Thus α estimates
k
kρ0
k
1−
=
.
k−1
k + k(k − 1)ρ0
1 + (k − 1)ρ0
Which of the two formulas you choose to use probably depends more on which is easier
to compute in a given context than on any theoretical advantage of one or the other.
33
5
Introduction to Sampling Schemes
Much of social statistics concerns the use of a sample to provide inference about variables of interest in a population. Often we think of theoretical ‘true’ parameters of
interest in the population, and we aim to estimate such parameters using data collected
on a sample from the population. For example, we might be interested in the population
mean (µ) for a variable of interest.
If possible, an unbiased and precise estimate of the parameter of interest is desirable. We
should aim to ensure that any sampling mechanism allows us to achieve this aim. We
now briefly outline some common sampling mechanisms before further discussion of the
simplest method of probability sampling known as simple random sampling and some
of its more complex variants.
5.1
Types of sampling scheme
Availability/convenience sampling. This is when we sample only those individuals/units that are immediately available at the point in time when sampling occurs.
Clearly, this is easy to perform, though not necessarily desirable as the resulting
sample may not be representative of the population of interest.
Quota sampling. This occurs when we select a sample to ensure balance over some
pre-defined characteristics. Example: if the target population of a questionnaire is
the population of British adults, quota sampling tries to collect a sample in which
the proportions of the different age groups, males and females etc. are the same as
among all British adults.
This method sounds intuitively sensible, although if the quotas are met simply on
the basis of the availability of units, as they often are, this method would represent
availabilty sampling according to a fixed quota.
Snowball sampling. This is a sampling technique in which a sample is constructed
by existing study subjects who recruit further study subjects based upon their
acquaintences. The sample group appears to grow like a rolling snowball building
in size until enough data are gathered for research. This method is often used in
surveys of hidden or hard-to-reach populations (e.g. injecting drug users, illegal
immigrants). However, sample members are not selected from a sampling frame,
and so snowball samples are subject to numerous biases (e.g. people who have a lot
of friends/acquaintances in the population of interest might be more likely to be
recruited).
Probability sampling. Select a sample with some pre-defined probability. Common
methods:
Simple random sampling: Each sample has equal probability of being chosen.
Example: a random number generator can be used to draw a simple random
sample of employees of a company from a complete list.
34
Stratified random sampling: Draw a simple random sample within each of several “strata”. Example: for the target population of British adults, several
strata are pre-defined, such as all possible combinations of several age-groups
and male/female. The proportions of elements from these strata are fixed in
advance, possibly, but not necessarily so that they coincide with the proportions among all British adults, and simple random sampling is applied within
the strata. This is different from quota sampling where the quota are fulfilled
without applying random sampling.
Cluster sampling: Draw a sample of units each of which contains multiple elements, then either take all the elements in the sampled clusters or sample from
each of them. The key difference between this and stratified sampling is that
in the latter we sample from all strata whereas in cluster sampling we only
investigate a sample of clusters. Example: If our target population is the
inhabitants of English cities, it will be much cheaper to sample some cities
rather than having to visit them all
Systematic sampling: Take every kth element of an ordered population. Example: call every 100th number from a phone directory, possibly with a random
starting point.
5.2
Some history of opinion polls
In USA in 1936, “Literary Digest”, a popular magazine, mailed questionnaires to a list
of car and phone owners to predict the outcome of the presidential election between
Roosevelt (Democrat, incumbent) and Landon (Republican). 10.3 million questionnaires
were sent out, of which 2.3 million were returned.
Prediction: 40% Roosevelt
Result:
62% Roosevelt
v
v
60% Landon
38% Landon
Why? Possible reasons:
• List not representative (car owner 6= voter)
• Non-response bias (Roosevelt voters did not answer the questionnaire, a common
phenomenon for incumbent supporters).
The second reason seems more important, and is typical in “voluntary response” situations. People who feel strongly about issues tend to respond, and this may (or may not)
introduce bias.
Consequence: Self selected samples were superseded by “quota” samples. These are
chosen so that they are representative of different population characteristics, such as
age, gender, occupation, etc. This was partly because Gallup, based on a quota sample,
predicted the election result quite well:
56% Roosevelt
v
35
44% Landon.
Quota sampling was used in opinion polls (and is still widely used in surveys) until
1948 when in the election between Truman (Democrat) and Dewey (Republican), opinion
polls predicted a Dewey victory by 5 − 15%, but Truman won by 4%.
Why? Possible reason:
• quota sampling, if used in the context of availability sampling, does not solve all
the problems with availability sampling: although interviewers did obtain a sample
representative in terms of age, gender etc, other characteristics are important and
uncontrollable (approachability, accessibility).
In quota sampling the choice of respondents is not random but depends on who passes
by or who is accessible to the interviewer.
Nowadays, probability sampling is used when practicable. A rough general description
of the process, for drawing a sample of size n, is as follows:
For a population of size N and a fixed sample size n there are a finite number
k of possible samples, where
k=
N
Cn =
N!
.
n!(N − n)!
The number k represents the possible combinations of n elements randomly
chosen from a set of N units.
Label these samples S1 , S2 , . . . , Sk . Assign
P
probability πi to Si , where i πi = 1, but πi are not necessarily equal. Choose
sample Si with probability πi .
This has the following features:
• it does not depend on subjective decisions,
• randomisation prevents systematic deviations from representativeness,
• the sampling method allows us to calculate a measure of uncertainty.
Having said that, the 1992 British election results were not well predicted by the opinion
polls. Although most used quota sampling, some did use probability sampling. Why?
Possible reasons:
• there was a genuine last minute change,
• systematic bias in responses (“do not knows” were mostly conservatives)
• other possible biases?
There is an interesting discussion on opinion polls and general elections in the Journal of
the Royal Statistical Society Series A, volume 159 (1996) pages 1–39.
36
5.3
Simple random sampling: introduction
We will present the main characteristics of simple random sampling later on, but for
now we just consider a very basic set of features, that we use to develop the theory of
probability sampling.
Suppose we have a population of N individuals and we draw a sample of size n. There
are N Cn possible samples.
Definition If every possible sample of size n is equally likely to be chosen, the procedure
is called simple random sampling. The sample chosen according to this scheme is
called a simple random sample (abbreviation: SRS, where the final S can stand for
Sample or Sampling).
Notes:
• We sample without replacement.
• Convenience sampling (“just pick the first 10 individuals you meet”) is not SRS.
Quota sampling, which is basically convenience sampling with some balance imposed, is not SRS. If you want to be sure of balance in your random sample with
respect to some important demographic property than you should stratify by that
property.
• Simple random sampling is an equal probability selection method (EPSEM). That
is, each individual is equally likely to be chosen. But note that there are other
sampling methods in which each individual has the same chance of being chosen
(i.e. SRS is an EPSM but an EPSM is not necessarily SRS.).
• SRS is aiming to achieve representativeness by the use of randomisation. Note
that the word “random” is often wrongly used to mean “representative”. Basically
randomisation is a method used to produce a sample whose statistical properties
are known.
5.3.1
How to draw a random sample
The simplest form of method to draw a random sample consist in an experiment where
we have an urn in which we have placed a set of N objects. For example, we can consider
N indistinguishable balls that for all purposes look and feel the same; each ball contains
a small piece of paper with a unique number printed. We start by shuffling the balls in
the urn and then selecting one ball. We can check and see what number is associated
with the ball, continuing to draw withour replacement n times.
Nowadays, we do not use this actual set up (except for trivial or glamorous circumstances,
e.g. the draws of teams in groups for the football World Cup) and computer pseudorandom mechanisms are used instead. Most statistical packages have built-in algorithms
that can draw a random sample.
For example, in R we can draw a simple random sample using the command
37
sample(x=1:N, size=n, replace=FALSE)
This will take a “population” made by the numbers in the interval [1; N ] and draw n
instances from that population. Obviously the values of N and n need to be set before
calling this function. The option replace=FALSE instructs R to operate according to the
urn scheme described above: every time that a unit is selected, it is set aside and it cannot
be selected again.
38
6
6.1
Sampling Theory: Mathematical Concepts and Notation
Sampling Theory Notation
Firstly, we introduce some useful notation. Different authors and books use different
notation. We will use the convention that population values are denoted by capital letters,
and sample values are lower case. This is a standard convention in finite population
sampling theory, but has the consequence that it is now the lower case letters that denote
the random variables, in contrast to what you will have seen elsewhere.
6.1.1
Population Values
Let Y1 , Y2 , . . . , YN denote the values of some variable Y for each individual in the population. Sometimes we may need to refer to the individuals themselves (the population
units), rather than their values of Y ; these will be denoted by U1 , U2 , . . . , UN . The following notation for population quantities will be used:
Population size:
N
Population total:
T
=
PN
Population mean:
Ȳ
=
1
N
Population variance:
S2
=
Population coefficient of variation: C
j=1 Yj
PN
j=1 Yj
P
N
1
2
j=1 (Yj − Ȳ )
N −1
= S/Ȳ
Some of the literature divides by N when defining the population variance, in which case
it is usually denoted by σ 2 , so
N
1 X
σ =
(Yj − Ȳ )2
N j=1
2
and sometimes the population mean is denoted by µ rather than by Ȳ . The choice between
µ and Ȳ is simply one of notation as they represent the same quantity. The two variances
are different though, with
S2 =
N
σ2
N −1
and
σ2 =
N −1 2
S .
N
Some formulas are more conveniently expressed in terms of S 2 , others are easier if written
in terms of σ 2 . We will try to stick with Ȳ and S 2 but reserve the right to use the
alternative versions where it makes the algebra simpler.
6.1.2
Sample values
We generally use lower case y1 , y2 , . . . , yn to denote the values of Y for those individuals
selected in the sample. The sample individuals (or units) may be denoted by u1 , u2 , . . . , un .
39
Sample size:
n
Sample total:
t
Sample mean:
ȳ
Sample variance: s2
=
Pn
=
1
n
i=1
i=1
1
n−1
=
yi
Pn
yi
Pn
i=1 (yi
− ȳ)2
Make sure you distinguish clearly between upper case N, Ȳ , S 2 and lower case n, ȳ, s2 .
6.1.3
Binary variables
Some formulas simplify when the the variable Y can take one of only two values, coded
as 0 or 1, where Yj = 1 means that individual j in the population has some attribute and
Yj = 0 means they do not. Variables that are defined in this way are called binary, or
dichotomous.
The population mean Ȳ is then the proportion of individuals in the population who do
possess the attribute. This is denoted by P and can be formally written as
N
1 X
Yj = Ȳ .
P =
N j=1
Also the complementary proportion, representing those who do not possess the attribute,
is denoted by Q = 1 − P .
The corresponding sample proportions are denoted by p and q, where
n
p=
1X
yi = ȳ
n i=1
and
q = 1 − p.
The population variance then becomes
N
S
2
1 X
2
(Yj − Ȳ )
=
N − 1 j=1
1
=
N −1
N
X
!
Yj2 − N Ȳ 2
j=1
1
=
NP − NP 2
N −1
N
=
PQ
N −1
In passing from line 2 to line 3 we have used the fact that that Yj = either 0 or 1, so
when we square each Yj we get again either 0 or 1. By an identical argument, the sample
variance is
n
s2 =
pq .
n−1
40
Note that the population variance result is an example of a formula that is simpler if
we use the alternative definition of variance, since σ 2 = P Q. However we then lose the
correspondence between the population and sample results, so one can’t win either way.
6.1.4
Probability
When we discuss probability sampling, we will refer to the probability that a particular
element is chosen in the sample. More specifically, we may write Pr(ui = Uj ) to denote
the probability that the ith element in the sample is the jth element in the population.
Here probability refers to the method or mechanism of drawing the sample. So we think
of ui being randomly determined and U1 , U2 , . . . , UN being possible values.
It is common in books to write Pr(yi = Yj ) to denote this same probability, and think of
it as the probability that the ith sample value equals the jth population value. Of course,
if two individuals in the population have the same value of Y it is possible that yi = Yj
but ui 6= Uj , but by convention in sampling theory Pr(yi = Yj ) is often used to mean
Pr(ui = Uj ).
Note that in ordinary statistics, it is standard terminology to write Pr(Y = y) where Y is
a random variable and y is a possible value. But in sampling theory it is well established
notation to use y to refer to the sample Y for the population. Thus in probability sampling
of finite populations, yi is a random quantity and Yj is a possible value.
6.2
Estimates and their standard errors
We typically wish to estimate population parameters and to determine the precisions of
the estimates, usually by calculating standard errors or confidence intervals.
6.2.1
A small example
Suppose we have a population of five people: Adam, Ben, Clive, Donna and Eve. This
is obviously a toy example because one would not normally need to sample when the
population size is only 5, but small examples are useful because we can easily enumerate
all the possible samples.
We want to find out the total amount of money they have on them. In fact, Adam has
£8, Ben £8, Clive £14, Donna £6, and Eve £4, so in total they have £40. Instead of just
asking them all, we will try to estimate t...