STAT 0024 UCL Social Statistics Assessment Task

STAT0024 Social Statistics: In-course Assessment 2021-2022The survey company you work for has been invited by a supermarket chain to bid to
design a questionnaire and carry out a sample survey to investigate the views of
their current customers about some ideas the chain has for of trying to reduce the
amount of packaging, and especially plastic packaging, used in their stores. For
example they might stop selling pre-packaged fruit and veg, or offer customers the
option of dispensing things like flour or sugar or washing powder into their own
containers. There are clear environmental benefits from such ideas, but they will
only work if there is take up from the customers. The aim of the survey is to estimate
what the take up might be in one particular store where they are thinking of running a
trial project. Your task is to produce a first draft of a plan for internal discussion
within your company. This should include the following elements.
1. Sampling
Explain how you propose to carry out the sampling. Topics you should discuss here
include, but are not necessarily limited to, how you will select your sample, how you
will administer the questionnaire, and the advantages of and possible problems with
your preferred approach.
2. Questions
Discuss the design of the questionnaire, both in general terms – broad areas of
questioning, numbers and types of questions – and by giving 3 specific examples of
questions you would include. These examples should be carefully worded and
should include the format of the response.
3. Analysis and presentation
Discuss how you propose to present the results to the supermarket’s managers. As
well as giving an overview, take one of your 3 example questions and explain in
detail how you would present the results for that question. Make up some data if that
helps. Your discussion should include some comments on quantifying the
uncertainty in the results.
General instructions
Your work should be typed with a minimum font size of 11pt and should not exceed
three A4 pages in total length. The preferred file format is pdf. The three parts of
your answer carry equal weight in the marking scheme. Overall this ICA is worth
15% of the final mark for the module.
Your work should be submitted via MOODLE no later than 4pm UK time on
Tuesday 8th March 2022. Your submitted work should contain only your
student number and not your name or other identifiable information. By
submitting your work you will be deemed to have agreed to the plagiarism and
collusion declaration that appears when you open the submission link.
STAT0024 Social Statistics
Tom Fearn
Department of Statistical Science, University College London.
Term 2: 2021–2022
Contents
1 Introduction
1.1 Preliminaries, basic literature . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Social sample surveys — Basic concepts . . . . . . . . . . . . . . . . . . .
1.3 Module Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
6
2 Planning a Social Survey
2.1 Some basic distinctions . . . . . . . . . . . . . . . .
2.1.1 Two areas of application of statistical design
2.1.2 Types of subject matters of social surveys .
2.1.3 Types of objectives . . . . . . . . . . . . . .
2.1.4 Possible sources of data . . . . . . . . . . .
2.1.5 Types of questioning . . . . . . . . . . . . .
2.2 Basics about sampling . . . . . . . . . . . . . . . .
2.2.1 Why sample? . . . . . . . . . . . . . . . . .
2.2.2 Basic sampling terminology . . . . . . . . .
2.2.3 Types of error . . . . . . . . . . . . . . . . .
2.2.4 A note on non-response bias . . . . . . . . .
2.2.5 When not to sample? . . . . . . . . . . . . .
. . . .
theory
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
9
9
10
11
13
13
13
14
15
16
3 Questionnaire Design and Data Visualisation
3.1 An introductory remark . . . . . . . . . . . .
3.2 Answer formats . . . . . . . . . . . . . . . . .
3.2.1 Open vs. closed questions . . . . . . .
3.2.2 Types of closed questions . . . . . . . .
3.2.3 “Do not know” . . . . . . . . . . . . .
3.2.4 Rating scales . . . . . . . . . . . . . .
3.3 Further aspects of questionnaire design . . . .
3.3.1 Question wording . . . . . . . . . . . .
3.3.2 Even more . . . . . . . . . . . . . . . .
3.4 Data Visualisation . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
17
18
18
19
20
20
22
22
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Measurement
4.1 Types of measurement scale . . . . . . . . . . . . . . . .
4.1.1 Basic concepts . . . . . . . . . . . . . . . . . . . .
4.1.2 The properties of the standard types of scales . .
4.2 Attitude measurement with the Likert technique . . . . .
4.2.1 Basic principle . . . . . . . . . . . . . . . . . . .
4.2.2 Item selection and check of polarity . . . . . . . .
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Desirable properties of measurement instruments
4.2.5 Reliability . . . . . . . . . . . . . . . . . . . . . .
4.2.6 Validity . . . . . . . . . . . . . . . . . . . . . . .
4.2.7 Test theory . . . . . . . . . . . . . . . . . . . . .
5 Introduction to Sampling Schemes
5.1 Types of sampling scheme . . . . . . .
5.2 Some history of opinion polls . . . . .
5.3 Simple random sampling: introduction
5.3.1 How to draw a random sample .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Sampling Theory: Mathematical Concepts and Notation
6.1 Sampling Theory Notation . . . . . . . . . . . . . . . . . . .
6.1.1 Population Values . . . . . . . . . . . . . . . . . . . .
6.1.2 Sample values . . . . . . . . . . . . . . . . . . . . . .
6.1.3 Binary variables . . . . . . . . . . . . . . . . . . . . .
6.1.4 Probability . . . . . . . . . . . . . . . . . . . . . . .
6.2 Estimates and their standard errors . . . . . . . . . . . . . .
6.2.1 A small example . . . . . . . . . . . . . . . . . . . .
6.2.2 Subjective samples . . . . . . . . . . . . . . . . . . .
6.2.3 Some additional comments . . . . . . . . . . . . . . .
6.2.4 Relation with standard model-based statistical theory
6.3 Sampling Distributions for Simple Random Sampling . . . .
6.3.1 Expectation and variance of a sample value . . . . .
6.3.2 Covariance of two sample values . . . . . . . . . . . .
6.3.3 Expectation and variance of the sample total . . . .
6.3.4 Expectation and variance of the sample mean . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
24
25
27
27
28
29
29
30
30
31
.
.
.
.
34
34
35
37
37
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
39
40
41
41
41
43
44
44
45
45
45
46
47
7 Estimators for Population-level Parameters and Sample Size Calculation
for a Simple Random Sample
7.1 Estimation of a population mean . . . . . . . . . . . . . . . . . . . . . . .
7.2 Expectation of the sample variance . . . . . . . . . . . . . . . . . . . . . .
7.3 Estimation of a population total . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Estimation of a population proportion . . . . . . . . . . . . . . . . . . . .
7.5 Sample size calculations for simple random samples . . . . . . . . . . . . .
7.6 Allowing for drop out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
48
49
50
50
51
52
8 Stratified random sampling
54
8.1 General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2
8.2
8.3
8.4
8.5
How to draw a stratified random sample . . . . . . . . . . . . . . . . . . .
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimating the population total or mean . . . . . . . . . . . . . . . . . . .
Allocation of a stratified random sample . . . . . . . . . . . . . . . . . . .
8.5.1 Proportional allocation . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Comparison of proportional allocation with simple random sampling
8.5.3 Optimal allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.4 Minimise variance for fixed total cost . . . . . . . . . . . . . . . . .
8.5.5 Minimise cost for a given variance . . . . . . . . . . . . . . . . . . .
8.5.6 Neyman allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.7 Comparing Neyman allocation with proportional allocation . . . . .
8.5.8 Choice of total sample size . . . . . . . . . . . . . . . . . . . . . . .
8.5.9 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.10 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 Cluster Sampling
9.1 Types of cluster sample . . . . . . . . . . . . . . . .
9.2 Relationship with stratified sampling . . . . . . . .
9.3 Notation . . . . . . . . . . . . . . . . . . . . . . . .
9.4 SRS: Estimation of the population mean . . . . . .
9.5 Equal cluster sizes . . . . . . . . . . . . . . . . . .
9.6 PPS sampling: Estimation of the population mean
9.7 Sample size calculation for cluster sampling . . . .
10 An Introduction to Missing data
10.1 Missing data mechanisms . . . . . . . .
10.1.1 MCAR . . . . . . . . . . . . . .
10.1.2 MAR . . . . . . . . . . . . . . .
10.1.3 MNAR . . . . . . . . . . . . . .
10.1.4 Some more formal definitions .
10.2 Checking MCAR . . . . . . . . . . . .
10.3 Handling Missing Data . . . . . . . . .
10.3.1 Complete case analysis . . . . .
10.3.2 Inverse probability weighting . .
10.3.3 Imputing missing values . . . .
10.3.4 Mean imputation . . . . . . . .
10.3.5 Model-based imputation . . . .
10.3.6 Single stochastic imputation . .
10.3.7 Multiple stochastic imputation .
10.3.8 Bayesian modelling . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
56
56
58
59
59
60
61
62
62
63
63
63
64
.
.
.
.
.
.
.
67
67
68
69
71
72
73
74
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
76
76
76
77
77
78
78
80
80
81
82
82
83
84
85
86
1
Introduction
1.1
Preliminaries, basic literature
These notes cover the essentials of the module. They are based on earlier notes by Rex
and Jane Galbraith, Christian Hennig, Gianluca Baio and Aidan O’Keefe.
The notes cover what you need to know to pass the exam. If you want to read more,
either because you want to know more or because you feel an alternative presentation of
some of the topics might help you to understand them, the following references might be
useful. The first one is certainly available as an e-book in the UCL library. Some of the
others may be though I have not checked.
• Survey planning mostly fairly light on maths
Kalton, G., Introduction to Survey Sampling. Sage, 2nd ed., 2021. Short and
practical, recently updated.
Converse, J. M., Presser, S., Survey Questions. Sage, 1986. A good reference for
more details about questionnaire design.
Fink, A., How to Conduct Surveys. Sage, 4th ed., 2009. Practical text mainly for
social scientists without much maths, but covering some interesting practical
issues. There are further relevant books on surveys by A. Fink, for example on
questionnaire design.
Fowler, Floyd J. Jr., Survey Research Methods. Sage, 3rd Edition, 2002. Similar
to the Fink book, and Fowler also wrote on “Improving Survey Questions”.
Hoinville, G. Jowell, R. & associates, Survey Research Practice.
Background reading on practical issues.
Gower, 1985.
Moser, C.A. & Kalton, G., Survey Methods in Social Investigation. Gower, 2nd
Edition, 1985. Classic text on survey design, background reading.
• Measurement and Scaling
Allen, M. J., Yen, W. M., Introduction to Measurement Theory. Wadsworth 1979.
Classical psychometrical text on measurement theory.
Crocker, L., Algina, J., Introduction to Classical and Modern Test Theory. Wadsworth,
2006. Covers most of the measurement and scaling chapter, despite its title.
Like Allen and Yen, which covers similar material, it is driven by psychological
applications but useful for social statistics as well.
DeVellis, R. F., Scale Development. Theory and Applications. Sage, 2nd ed., 2003.
Interesting, not very mathematical.
Hand, D. J., Measurement Theory and Practice. Wiley, 2004. Very thoughtful
and interesting interdisciplinary book on measurement theory by a leading
statistician though overlap with this module is limited.
• Sampling theory
4
Barnett, V., Sample Survey: Principles and Methods. Arnold, 1991. Overlaps with
Scheaffer et al, but covers more statistical theory.
Cochran, W.G., Sampling Techniques. Wiley, 3rd Edition, 1977. A classic text on
sampling.
Scheaffer, R.L., Mendenhall, W & Ott, L. Elementary Survey Sampling. Wadsworth
(Duxbury Press), 5th Edition, 1996. Covers much of the module, though not
measurement and scaling, with a focus on sampling theory.
• Further reading on potentially interesting issues either not covered or only partially discussed in the module
Conrad, F. G., Schober, M. F. (eds.), Envisioning The Survey Interview Of The
Future. Wiley, 2007. A collection of papers discussing the impact of new
technologies and developments on surveys.
Everitt, B. S., Dunn, G., Applied Multivariate Data Analysis. Wiley, 2nd ed., 2001.
On visualisation and analysis of multivariate data occuring in surveys and social
science. Pretty much no overlap with this module!
Little, R. and Rubin, D., Statistical Analysis with Missing Data. Wiley, 1987.
Some of these books contain a lot of examples and interesting discussions. You may also
find more information on the internet.
1.2
Social sample surveys — Basic concepts
A fundamental part of statistical science is an attempt to make inferences about real-world
behaviour using data. We might consider using data to answer questions such as:
• What proportion of the population support the current government policy on healthcare, education etc.?
• What proportion of UK households own a car?
• What is the attitude of the UK population to the introduction of new legislation on
school holidays during term time, the minimum wage, taxation etc.?
How would we attempt to answer such questions?
Sometimes, it is possible to answer questions of interest using data from an entire population. For example:
• Routine birth registration records would allow us to assess the birth rate within the
UK in a particular year
In other scenarios, it would not be feasible to use data from the entire population. For
example:
• What proportion of the UK population regularly shop at a particular supermarket?
5
We don’t routinely record such information and would asking this question to the entire
UK population be feasible/worthwhile?
The answer is almost certainly no. However, this does not imply that we would not
be able to produce an appropriate estimate of the proportion of people who shop regularly at a particular store. We need to think carefully about how we would go about
answering such a question. In short, we would need to consider surveying a suitable
sample of the population of interest. In other words, we would conduct a sample
survey.
Much of STAT0024 concerns how we approach the design of sample surveys and how
we would analyse the data resulting from such surveys, to produce accurate and robust
evidence when attempting to answer questions of interest in social research.
1.3
Module Outline
In this module, we shall concentrate on the following aspects of social statistics:
• An introduction to planning and practical aspects of social surveys, including questionnaire design;
• Methods for social measurement and scaling, particularly the measurement of attitudes using answers from questionnaires;
• Basic presentation and visualisation of data collected in social surveys;
• An introduction to statistical sampling theory for finite populations;
• An introduction to methods for dealing with missing data.
Finite population sampling theory is the core statistical theory for sample surveys. It
deals with the statistical properties of estimators calculated from samples that have been
drawn from finite populations using probability schemes for the sampling.
A major difference between the statistical theory taught in other modules and that encountered in STAT0024 is that in this module we will be drawing samples without
replacement from a finite population. In other scenarios, statistical models permit
potentially infinite repetitions of outcomes of interest. Here the number of possible samples, though it may be large, is finite. We are used to thinking of statistical parameters
such as the expected value of a normally distributed random variabl, as being hypothetical quantities. In social statistics, some, though not all, of the numbers to be estimated
by surveys really do exist. We note that traditional statistical models for infinite repetition can sometimes be used approximately for sampling surveys, if the population under
study is very large, the sample size is much smaller and the distribution of the quantity
of interest matches the model assumptions approximately. However, large parts of finite
population sampling theory do not necessarily require distributional model assumptions.
6
One important problem when working with surveys and samples is the calculation of
a suitable sample size in order to obtain a certain precision. In addition, we might wish
to compare parameter estimates under different possible sampling schemes, such as simple
random sampling and stratified sampling. Much of the module will be devoted to finite
population sampling theory and to questions of this type.
7
2
Planning a Social Survey
It is desirable that a statistician should be involved at the planning stage of any social
survey. This is so that any statistical problems with the survey design can be avoided.
Such problems could make the survey and its results unreliable or worthless, resulting in
a waste of valuable time and/or resources.
In this chapter, we discuss some basic statistical considerations relevant to planning a
sampling exercise. By sampling, we mean the collection of data from a subset of a population of interest in an effort to make inference about population-level outcomes of interest.
We shall consider types of error that can occur when sampling.
2.1
Some basic distinctions
It is generally useful to think about the following distinctions when planning a social
investigation.
2.1.1
Two areas of application of statistical design theory
Comparative experiments. Subjects or units are chosen in some convenient way and
are then allocated to receive different treatments according to some rule, often by
randomisation, with the aim of comparing the responses to the different treatments.
Example: clinical trial.
Sample surveys. Individuals or units are chosen in some way from a population with the
aim of making some inference representative of the population. Example: opinion
poll.
The more commonly made distinction here would be between comparative experiments,
where the experimenter makes some intervention by applying treatments, and observational studies, where the researcher simply observes the state of things. A sample survey
is one example of an observational study. The distinction is important because an observational study can establish associations, for example you might find that people who
express positive attitudes towards caring for the environment tend to use less energy in
their homes, but can say very little about causation. There may be underlying unobserved factors that explain the association, i.e. both may be caused by something we
haven’t measured. On the other hand one might imagine an experiment in which 100
subjects were randomised to two groups, with one group getting information about the
environmental damage due to energy production and the other group getting information
about something unrelated. By monitoring the energy consumption of the two groups
before and after the intervention, it would be possible to make a much stronger inference
than we could from the survey. Of course such an experiment would be much more costly
and time consuming to carry out than the survey, and there are many areas of social
investigation where experiments are not possible for either practical or ethical reasons,
8
but it is always worth considering whether a comparative experiment might be preferable
before opting for observation.
2.1.2
Types of subject matters of social surveys
According to Moser and Kalton (1985), the subject matters of social surveys fall usually
into one of the following categories.
Demographic characteristics. This means matters such as family and household composition, gender, marital status, age and so on. Some basic demographic factors are
acquired in almost every social survey so that their association with the variables
of primary interest can be explored.
Social environment of people. Social, economic and ecological factors to which people
are subject, including occupation and income as well as housing conditions and social
amenities.
Activities of people. This refers to activities like leisure habits, consumerism or travelling rather than occupation, which forms part of the social environment.
Opinions and attitudes. A particular problem with opinions and attitudes is that the
more complex ones, e.g. attitude towards the welfare state, cannot be established
with a single question but need to be inferred from the responses to multiple questions. More about this in Chapter 4.
2.1.3
Types of objectives
Often the choice of the adequate methodology depends on the objective of a study. Here
is one possible typology of objectives:
Description. Descriptive studies are about the collection of as precise as possible quantitative information about a population, for example for planning purposes. Official
statistics are descriptive in most cases. If sampling is used in a descriptive study,
statistical sampling theory is very important because it deals with the precision of
estimators.
Exploration. Explorative studies aim to find previously unknown patterns and at giving
the researchers new ideas and information about a population or topic that has
not necessarily been investigated adequately. Such studies could deal with, for
example, cultural changes or with the determination of potential reasons for newly
observed phenomena. An example might be to investigate people’s reasons for
wearing or not wearing face masks during an epidemic. Explorative studies often
have a tentative character. They work often with non-probabilistic convenience
samples and qualitative open questions. Sampling theory is rarely important here.
Often, explorative surveys lead to clearer scientific hypotheses and theories which
can then be examined by a more focused survey.
9
Examination of scientific hypotheses/theories. Theories to be examined by such
studies should be formulated in a testable way, for example “energy consumption
depends much more strongly on the energy price than on attitudes towards the environment”. These studies are usually more focused than descriptive or explorative
studies. Statistical theory is almost always needed, because the results should be reliable. As noted above, comparative experiments may sometimes be a better choice
for this objective than sample surveys. A “scientific hypothesis” is formulated in
terms of the subject matter. A “statistical hypothesis” to be tested by a statistical
test is not exactly the same, but is often derived from a scientific hypothesis.
Evaluation and quality control. Example: UCL course evaluation questionnaires, but
also analysis of the effects of new laws, monitoring of costs.
Decision support. Example: product planning of a company, but also opinion polls for
political parties.
2.1.4
Possible sources of data
Carrying out a survey. How to design and carry out a survey is the main theme of
this module, so there will be much more on this later. Before you embark on this
approach, however, it is worth considering if the information you seek could be
obtained more easily from other sources.
Obtaining data from documentary sources. For example, data about marital status and occupation is available from population censuses or other official sources.
Data about numbers of visitors can be obtained from cultural institutions. Even individual data such as health and income tax records are held by hospitals and other
institutions. Such data, however, are sometimes not available because of data protection rules; this is becoming increasingly true. Also, data that have been collected
for a different purpose may not really fit the problem at hand.
Direct observations. Often valuable data can be directly observed instead of being
asked by a questionnaire. Examples are numbers of people using a particular tube
line, some visible housing conditions, the analysis of shopping baskets in marketing
research or the number of recyclable items in waste bins.
When questionnaires ask survey participants for facts that can be verified by direct
observation, the answers in the questionnaire often turn out to be unreliable. Therefore, direct observation of such facts is better whenever possible. Sometimes it is
possible but expensive or difficult to observe some facts directly, for example daily
time of watching TV in a household. In such a situation, answers to the questionnaire could be “validated”, i.e. checked, for a small subsample of the respondents
by direct observation.
Using data from other surveys. An elementary step in the planning of every survey
should be the search for already published related surveys in the literature. Sometimes, there has already been a survey with the same objective. In some other
situations, relevant data may have been collected in a survey with a different aim
10
or in the framework of a more general data collection. Even though such data may
not be perfectly appropriate for a problem different from the original purpose, it
can often save a large amount of money and effort to use existing data. Even if
you decide to run your own survey after all, the results of the other surveys may
help you with the design of yours. As we will see later, it helps at the design stage
to know things like variances of the quantities you will be measuring, and the data
from other surveys asking similar questions are a good source of such information.
2.1.5
Types of questioning
Personal interview with interviewer.
Advantages:
• Comparatively small proportion of non-responses.
• Conditions under which the answers are given are known and can be controlled
to some extent. For example it might be possible to experiment with giving
respondents different amounts of background information before asking a question.
• Problems with the interviewee’s understanding of the questions can be resolved.
Disadvantages:
• Expensive.
• Interviewer effects. Answers may depend on the interviewer’s precise wording,
the way they look or speak or treat the interviewee. Not all interviewers are
reliable and may cause bias by asking neighbours or friends instead of finding
the interviewees they were supposed to meet. Good training of the interviewers
is necessary to minimise interviewer effects.
Telephone interview.
Advantages:
• Same as for personal interview but cheaper. In particular it is much easier to
make repeated attempts to contact the interviewee.
Disadvantages:
• Interviewer effects, though interviewers can be monitored more easily.
• Telephone numbers change a lot more often than addresses.
• Many people are nowadays annoyed by commercial telephone calls. Some companies disguise their advertisement calls as survey calls. Therefore many interviewees may refuse to answer. Personal interviews are treated more seriously
by the interviewees.
Mail questionnaire (postal survey).
Advantages:
11
• No interviewer effects; all interviewees are treated in the same way.
• The interviewee does not have to be at home or be willing to answer their
phone at the moment they are contacted.
• Directories of postal addresses are usually more reliable than lists of telephone
numbers or email addresses.
Disadvantages:
• Usually high non-response rate; interviewees often have to be reminded more
than once to send back the questionnaire.
• Problems with the interviewee’s understanding of the questions are usually not
resolved.
Email questionnaire Often this will be a web-based questionnaire in which people are
invited by email to participate.
Advantages:
• Cheap.
• No interviewer effects.
• The interviewee does not have to be at home when contacted.
Disadvantages:
• Usually very high non-response rate. Emails are much more easily ignored or
deleted than postal mails.
• Reliable email address directories do not exist. The population accessible by
email is often very severely biased compared with the target population the
researcher has in mind. Accessible here means that not only does a person
have an email address, but it is also possible for the researcher to find it.
• Problems with the interviewee’s understanding of the questions are usually not
resolved.
• People may drop out because of technical problems such as unstable internet
connections or because they are distracted by a text or call coming in whilst
they deal with the questionnaire on their phone. Many web-based questionnaires have design problems such as refusing to allow the user to continue
without having responded to a question to which they do not want to respond.
Asking the audience
Questions may be asked on TV or radio programmes, on websites or via social media,
and the audience is asked to answer by calling, filling in a form on the website or
some other immediate method. Here, the relation of the respondents to any well
defined population of interest is unclear and statistical inference is rarely justified.
One exception is questionnaires on commercial websites. If the target population
is customers of the website, this is the most efficient and economical way to access
12
this population. What you won’t fnd out, of course, is why people don’t use your
website.
2.2
2.2.1
Basics about sampling
Why sample?
The short answer is ‘cost’ but one must be careful about this.
• The cost per element is usually much higher in a sample survey than in a complete
enumeration.
• The total cost, however, is usually lower. This is because the size of sample needed
to give a satisfactory answer is often very much lower than the population size.
• Sometimes it is the cost to the respondent that is important. For example, to keep
the total interview time manageable, we may ask respondents a subset of the total
questions asked.
2.2.2
Basic sampling terminology
A population is a collection of elements on which a measurement is taken. e.g. in an
opinion poll
element
population
→ a voter
→ all U.K. voters
Furthermore, there is a distinction between the target population about which we seek
information and the study population which is the population we actually study. They
may or may not be the same.
e.g. in market research conducted via email with link to web questionnaire,
target population: all individuals in a particular age group; study population: all individuals who have email adresses available to the researchers
and who can access the linked questionnaire.
More precisely, in case of questionnaires the study population is actually restricted to only those individuals who would respond to the questionnaire if
asked because there is no way to gather information about potential nonrespondents.
Usually the study population is a much more limited accessible population whose properties we hope we can extrapolate to the target population. The difference between study
and target population is very basic; it can lead to severe bias, sometimes called “coverage
bias”, in generalizing study results to the target population if the study population differs
significantly from it, something which happens very often but is not easy to observe.
Sampling units are non-overlapping collections of elements that cover the entire population. A sampling unit might be an element, e.g. an individual voter), or a group of
13
elements, e.g. a household.
A sampling frame is a (typically imperfect) list of sampling units, e.g., a list of addresses,
an electoral roll, or a telephone directory. A sample is a collection of sampling units drawn
from a sampling frame.
A complete example: suppose we want to determine the proportion of unsafe car tyres.
Target Population: All tyres of all UK licensed cars
Study population: This will depend on the sampling scheme adopted, but it is hard to
imagine a scheme that could possibly access all of the target population, and easy
to imagine the biases that will arise if participation is voluntary
Sampling units: UK licensed cars
Sampling frame: Many possibilities, e.g. a list of a UK licensed cars, all cars passing a
particular point on a day, etc.
Sample: The cars on which we take measurements.
2.2.3
Types of error
Sampling error. Random variation due to sampling scheme. For probability sampling,
the statistical properties of the sampling error can be estimated. This is a core topic
of the sampling theory introduced in Chapter 5.
Non-sampling error. This mainly takes the form of bias, and arises either from problems with the design or the implementation of the sampling scheme, or from nonresponse. Some common sources of bias are the following:
Selection bias
Response bias
• frame 6= population
• probability sampling scheme not used, eg convenience sampling,
or used but not followed properly
• non–response bias
• measurements are problematic in some way e.g. poor
question wording, misinterpretation of questions, sensitivity of
information, lack of memory or improper observation.
• Interviewer bias. Interaction between interviewer
and interviewee influences the response.
Non-sampling errors are difficult to estimate. If they can be estimated at all, then it
will usually only be by a separate investigation. Particularly in the case of attitude
and opinion measurement, but also elsewhere, it may sometimes be doubtful whether
there is any objective truth — which means that bias may be hard to define, let
alone estimate.
14
2.2.4
A note on non-response bias
Non-responses are a very important problem in practice (we will give a brief introduction
to the main statistical issues related with non-response and missing data in general in
Chapter 10). Many surveys have response rates well below 50% for the questionnaire as
a whole; non-response rates for certain questions may be even higher.
Not to give an answer to a question is an essentially different behaviour than to give an
answer. Therefore it is never valid to assume that the respondents are representative for
the non-respondents and the “true” answers of the non-respondents would be distributed
similarly. Reasons for non-response can be . . .
• . . . that the interviewee didn’t find that the suggested categories for answers cover
their point of view. This can be because the interviewee holds a neutral or “do not
know” position and this category is not offered; but there are other possibilities for
a missing category,
• . . . that the interviewee didn’t understand the question,
• . . . that the interviewee didn’t want to admit their position honestly, but they also
didn’t want to lie, in which case a non-response is certainly better than a lie,
• . . . that the interviewee didn’t like the question for one reason or another, for example, he/she may have felt that his/her own viewpoint was formulated in a discrediting
way,
• . . . distraction or lack of time of the interviewee, especially if the questionnaire is a
long one.
All of these reasons could explain missing responses to single questions, but in some
situations the whole response is missing because of
• inability to contact the interviewee: unsuccessful telephone call, interviewer didn’t
meet interviewee at home, interviewee didn’t send back questionnaire.
It is usually a good strategy to contact the interviewee again to try to obtain the missing
response. Also it is worthwhile to have a look at the demographic characteristics of the
non-responding part of the population to see whether non-responses are mainly caused
by particular parts of the population, about which the survey then must be interpreted
as less informative.
Even with follow up, there will be many cases where the interviewee neither wants to give
an answer nor wants to explain why. There is always a proportion of non-responses that
is essentially irreducible, and nothing can be concluded about these people. It is therefore
important always to report the proportion of non-responses, and their characteristics, as
far as possible.
Generally, the study population to which a survey actually allows generalisation can only
include those individuals who would respond if they were asked, because we cannot study
those who do not or would not give us information. Note the use of “would” here; we
do not get the information from the non-respondents, and we cannot generalise the data
15
from the respondents to those who would not respond if asked. Most of these “potential
non-respondents” may not actually have been asked.
Experience shows that this construction is somewhat confusing, and the issue is therefore
ignored in most textbooks. In those books, the potential non-respondents are not excluded
from the study population by definition, but this results in an over-optimistic idea of the
possibility of generalisation.
2.2.5
When not to sample?
In some situations the full population is easily available; thus we do not have to sample,
although it may still be reasonable to keep the cost of the survey down. In these cases,
probability models and statistical inference are not necessary, we just need descriptive
statistics to say something sensible about the quantities under study.
Example: If every participant of a course rates the course on the UCL questionnaire
on a scale between -2 and 2, it is meaningless to ask what the precision of the average
rating is and whether it would be significantly larger than 0, because there is no larger
population in which there would be an unknown “true” value. If not all students fill in the
questionnaire or not all are present it is still not a “sample” but a problem with missing
responses. It is not clear whether the respondents are in any sense representative for the
non-respondents.
Sometimes it may be possible to obtain data for a full study population, but one that deviates in some way from the target population that the researcher had in mind originally.
In such situations it may be preferable to draw a sample from the real target population,
assuming that this is possible, rather than to use the full data from a biased population.
16
3
3.1
Questionnaire Design and Data Visualisation
An introductory remark
Before discussing detailed aspects of questionnaire design it is important to note that
there may be deep philosophical issues with the attempt to measure something objective
by questionnaires. For example, you will encounter significantly different distributions of
responses depending on the chosen wording of what seems to be more or less the same
question in terms of content. Particularly when it comes to opinions and attitudes, it is
highly problematic to assume that there is a true or correct answer for an individual or a
best way to ask the question.
3.2
3.2.1
Answer formats
Open vs. closed questions
An open question is a question to which the respondent can answer with a self-formulated
text, while a closed question asks either for a number or for a choice of one or more
categories from a pre-defined list of categories.
The advantage of open questions is that the respondent is not restricted by the pre-defined
format of the answer. In many situations, it is impossible or very difficult for the designers
of the questionnaire to predict all possible answers. Example: “How do you think the
work in your department can be improved?” The main aim of a study using this question
may be exploratory, i.e., to find new ideas for improving the work. Obviously, this does
not work with pre-defined categories for the answers.
The disadvantage of open questions compared to closed questions is that they are not
suitable for statistical evaluation without complicated and labour intensive pre-processing,
and much information is typically lost in such pre-processing. For this reason the focus
of this module will be on closed questions.
Closed questions have another advantage, namely that the offered categories for the answers support the aim that every interviewee understands the question in more or less
the same way, and they may remind the interviewee of events that they otherwise do not
recall. For example, it is easier to select the magazines and newspapers you have read
in the last year from a list than to remember all of them when confronted with an open
question.
Mixed formats are possible: “What is the most important issue for the government in the
next year? crime/environment/education/other, namely:. . . ”.
Sometimes it is useful to use open questions in a small preliminary survey and to switch to
pre-defined alternatives in the main survey, because the open question in the preliminary
survey can be used to find the alternatives that the respondents have in mind.
17
3.2.2
Types of closed questions
There are different types of closed questions:
Questions asking for a number such as “how old are you?”
Binary questions for which there are essentially only two answers. For example yes/no
or smoker/non-smoker.
Questions asking for one out of a list of non-ordered categories which is not necessarily exhaustive such as “What is the most important issue for the government
in the next year? crime/environment/education/other”
Questions asking for a position on an ordered scale such as “strongly agree / agree
/ neither agree nor disagree / disagree / strongly disagree” or “If the government
takes action against inflation, it may happen that unemployment goes up. Where
would you place yourself on a scale about what you think the government should
give priority to:
Reduce inflation 1 2 3 4 5 6 7 Reduce unemployment.”
Further types of questions can be seen as versions of the main types above.
Questions asking for more than one out of a list of categories such as “On which
of these activities did you spend more than one hour in the last three weeks? Reading/Sports/Listening to music/ Playing a musical instrument/Painting/Hiking/Meeting
friends in a pub”. Such questions, though not asked in a binary fashion, are typically evaluated as a series of binary questions: a yes/no variable is defined for each
of the categories, e.g., “reading: yes/no” and so on.
Categorized questions asking for a number such as “When did you last play a board
game? Less than three months ago/three months to one year ago/one to five years
ago/longer ago/never”. Usually this is not as suitable for statistical evaluation as
asking directly for the number, for example, statistics such as mean, median or standard deviation cannot be computed, but it can be preferable if it refers to events or
numbers that are usually not exactly memorized.
Generally, there are situations where each of these possible formats is the most useful one.
A piece of general advice is not to mix questions with different formats too much, as this
can confuse the respondents and also make the evaluation more complex.
3.2.3
“Do not know”
It is possible to add a “do not know” or “cannot choose” option to all of the formats given
above.
If no such option is offered, people still can choose not to answer the question. The
problem with this is that there may be different reasons not to give an answer of which
“do not know” is only one. See also 2.2.4.
18
Empirical evidence suggests that the number of people ticking “do not know” if such an
option is offered is larger than the number of people who refuse to answer in the case that
there is no “do not know” option. This implies that some people allow themselves to be
forced into a decisions if “do not know” is not offered. This can be seen as a reason for
or against a “do not know” option. Pressuring people to answer the question may result
in more responses, but the value of the extra responses may be doubtful.
Not to offer a “do not know” suggests to some people that everybody should have an
opinion about this question, which may be seen as ethically problematic.
Note that, in terms of data analysis, “do not know” is essentially different from a neutral
or middle position in a symmetric scale. You can have a neutral position if you have
thought quite a lot about the topic, but then “do not know” does not capture your opinion
adequately. Therefore, the proportion of “do not know” answers can be interesting, but
the actual occurrence of a “do not know” should not be treated as a number in the middle
of a scale.
3.2.4
Rating scales
Often, ordered rating scales are used to evaluate the strength of a respondent’s view on
a topic of interest. There are some decisions to be made about such scales, and there is
generally no agreement in the literature about these choices.
Descriptions or just numbers (and which numbers)? Here are two rating scales
that have appeared in examples before: “strongly agree/agree/neither agree nor
disagree/disagree/strongly disagree” and “Reduce inflation 1 2 3 4 5 6 7 Reduce unemployment.” One scale only describes the extremes and gives numbers in between,
the other one gives descriptions for all categories. I haven’t seen any empirical
evidence that suggests that this choice makes a big difference.
Many people are not familiar with rating numbers, so describing the categories may
lead to a better understanding. For more than five points on the scale, however,
finding adequate descriptions will be quite difficult. If these scales are eventually
analysed as numbers, which as discussed in Chapter 4 often happens, it is more
honest to use numbers. There are also questionnaires in which just a sequence
of boxes is offered with a description of the extremes and neither numbers nor
descriptions for the categories in between.
It is possible to use 0 for the middle category and positive and negative numbers (“-3
-2 -1 0 1 2 3” instead of “1 2 3 4 5 6 7”). This leads to mathematically equivalent,
though perhaps more intuitive, analyses. Again, this choice probably does not make
a big difference.
Rating scales are not always symmetric. An example of an asymmetric scale is
“How do you rate your general health? Very good/good/fair/poor”, but note that
this suggests that the general health is or should be good than poor. Asymmetric
scales should therefore only be used if it is generally accepted that one direction is
19
preferred or seen as more normal than the other. If in doubt it is better to use a
symmetric scale.
Should a middle (neutral) category be offered? If an odd number of points is offered in a scale, the middle one is often a neutral category such as “neither agree
nor disagree”. Some authors advocate the choice of a scale with an even number
of points with the argument that people should not be encouraged to choose the
middle category as an easy way out. As with the argument over whether or not to
include “don’t know” options it is not obvious whether this is a good idea or not.
Generally, if it is reasonable to expect that some respondents would like to choose
a middle category, then a middle category should be offered, because the reliabilty
of a forced choice is doubtful.
Some authors (e.g., Converse and Presser, 1986) suggest that one should omit the
middle category and to offer an additional question like “how strongly do you feel
about the issue?” instead. This assumes that the choice of the neutral category is
usually motivated by a weak intensity of emotion about the issue, which might not
be the case.
How many categories? Many different numbers of categories are used for ordered scales,
most often three, five, seven, or, if no middle ground is offered, two, four, six or ten.
Again, there is no generally accepted recipe here. Considerations are the degree
of differentiation a typical respondent will feel able to make, though this may vary
between respondents, and how precisely the researcher would like to discriminate
responses. The sample size is also relevant. If the sample is not large and the analysis is by category rather than via a numerical scale, then the use of large number
of categories may spread the data too thinly, so that we end up combining adjacent
categories for the analysis anyway.
3.3
Further aspects of questionnaire design
There are many further important aspects of questionnaire design, most of which lie in
the domain of psychology rather than statistics. Here is a short overview.
3.3.1
Question wording
The impact of the wording of a question on the result can be huge and it is not always
predictable. In a study, some subjects were confronted with question A
“Do you think the United States should forbid public speeches against democracy?”
and others were asked question B
“Do you think the United States should allow public speeches against democracy?”
Question A got 21.4 % yes responses, which suggests that about the same percentage
should answer “no” to question B, but actually question B yielded 47.8% “no”s. This
example is taken from p. 67 of Scheaffer, Mendenhall and Ott, 1996.
20
Many such examples are around in the literature, and it is perhaps not surprising that
many people doubt the reliability of the results of any questionnaire survey. Certainly
some media interpretations of poll results such as “75% of the people are against the
proposed EU constitution” depend heavily on the exact question wording, which is usually
not reported, and are therefore not to be taken too seriously.
Note that, having Section 3.1 in mind, it cannot be said that one of questions A and B
is better and the other one is worse in the sense that responses to one question would be
likely to be “less biased”. There is a rather subtle difference between saying “yes” to A
and saying “no” to B, because it is actually not the same to allow something explicitly,
and therefore encourage it to some degree, as it is not to forbid it. This difference could
explain the different results, at least partly. Therefore, the researcher has to decide what
they really want to know, and to take care that the question expresses this as precisely
as possible.
Here are some guidelines:
Be specific and precise The more specific and precise you are, the more likely is it
that all respondents will understand a question in more or less the same way.
Example: “Are you satisfied or dissatisfied with your canteen?” is not specific,
because this could refer to the prices, the service and/or the quality of meals. Better:
“Are you satisfied or dissatisfied with the prices of meals in your canteen?”.
“Do you favour or oppose gun control legislation?” is not precise; the answer may
depend on the precise content of a proposed gun control law.
Do not suggest a particular answer Some people are easily influenced if a question
is formulated in a non-neutral way. “Do you favour the use of capital punishment?”
can be expected to yield more answers in favour than “Do you favour or oppose the
use of capital punishment?” Another example: “Would you like to pay more tax?”
– certainly not vs. “Would you like to pay more tax to enable higher spending on
education?” – possibly.
Note that questions asking for agreement or disagreement with a statement are, on
average, more likely to receive agreement. Therefore, such questions should only be
used in a context where multiple questions of the same type about the same topic
are used, but asked so that agreement corresponds to opposite directions of opinion
in different questions.
Example from the British Social Attitudes Survey: “It is all right for a couple to live
together without intending to get married” and “People who want children ought to
get married”. Conservative people will tend to agree with the second question but
not with the first one. Having both questions in the same questionnaire will tend
to reduce the bias associated with a tendency to agree. Chapter 4 has more on the
combining of such questions in what is called a Likert scale.
Avoid multiple questions “Would you like to be rich and famous?” is not good, because the respondent might want to be rich but not to be famous.
Avoid ambivalent questions “A teenager can be just as good a parent as somebody
who is older – agree/disagree?” is not good, because “disagree” could mean that
21
the respondents see teenagers as particularly bad or particularly good parents.
Keep question wording simple A lot of examples appear in the literature, dealing
with issues such as asking “Do you think. . . ?” instead of “Is it your opinion. . . ?”,
avoiding double negatives, and so on. Moser and Kalton, 1985, cite “Has it happened
to you that over a long period of time, when you neither practised abstinence, nor
used birth control, you did not conceive?” as a particularly bad example.
3.3.2
Even more
• The question order can play a role. People tend to try to be consistent in their
responses to questions, and because they think about earlier questions first, later
answers can be influenced.
Example from United States, 1980, taken from Scheaffer, Mendenhall and Ott,
1996, p. 63.
A “ Do you think the United States should let Communist newspaper reporters
from other countries come in here and send back to their papers the news as
they see it?”
B “Do you think a Communist country like Russia should let American newspaper
reporters come in and send back to America the news as they see it?”
“yes”-answers when asked A then B: 54.7% for A, 63.7 % for B.
“yes”-answers when asked B then A: 74.6 % for A, 81.9 % for B.
A possibility to deal with this effect could be to give different respondents differently
ordered questionnaires.
• do not make the questionnaire longer than necessary. Skip any question
which is not clearly useful with respect to the aim of the study. You should neither waste the respondent’s time nor your own time needed for the analysis. Long
questionnaires also lead to more respondents failing to get to the end of them.
3.4
Data Visualisation
It is often desirable to summarise results from questionnaires, as well as data collected
from other sources),using visual displays. By visual displays we usually mean charts,
graphs and plots, although some might consider a poster, showing details of the background of the survey/questionnaire, its motivation and data collection methods etc. to
be a ‘visual display’.
Visual displays can be an excellent method of communicating questionnaire and other
survey results. However, it is important to ensure that any visual display is effective and
we outline some guidelines with regard to visual display design:
• Objective: Consider the objective of your survey/questionnaire and decide which
data are the most important with regard to meeting this objective. It is unlikely
22
that any reader wants to see reported results on lots of outcomes so one should focus
on what is important.
• Intuition and understanding: A plot or graph shoud be is designed to summarise
data in a way that easy to understand. A reader should not have to spend a long
time looking at a graph to understand the results or data that are presented. A
plot or graphical display should make it possible to view important results or data
features easily and intuitively.
• Consistency: Don’t keep changing the type of plot used just for the fun of it. If
one style works for many of your plots, stick to it and make it easier for the reader
to appreciate them.
• Use of colour: Good or bad? Generally, colour can make plots/graphs eye-catching
and allow important features of data to stand-out. However too much colour can be
a mistake. It might confuse the reader and possibly detract from the main features
displayed within a plot. Some readers may be colour blind, and so distinguishing
lines on a plot by red and green colouring is inadvisable. Finally, whatever beautiful
colours you use some readers will end up looking at a monochrome printout and so
if at all possible the graph should still be interpretable when seen on a gray scale.
• Axes: When presenting multiple plots of numerical data, eg for different subgroups,
axes should always be on exactly the same scale for different plots, so that groups
can be easily compared. If natural ranges for x and y axes exist, then these should
be used, where appropriate.
• Titles, Labels and Legends: Titles and axes labels should always be included on
plots. They should be informative, showing units of measurement where necessary.
Generally, it is a good idea that titles, labels and legends are not too lengthy.
Edward Tufte has written a number of very nice books on graphical presentation, with
both beautiful and horrid examples.
23
4
Measurement
The questionnaire has been called a “measurement instrument”. The term “measurement”
is used in a general way here. A measurement is an assignment of a number or category to
some attribute of a real world object or event. This is usually done to allow comparisons
between the objects or events, often using statistical methods.
The problems of measurement are often more complicated in the social sciences than in
physics, chemistry or engineering. What is being measured is not always well defined, and
even where it is, it is not always clear that the instrument, a questionnaire for example, is
measuring it in a valid way. A particular problem concerns the measurement of attitudes
or psychological constructs such as intelligence (from tests in which certain problems are
to be solved) but it also applies also to constructs which refer to more directly observable
quantities such as criminality, deprivation, living quality of a certain area or inflation in
economics. Criminality, for example, is made up by events that are directly observable in
principle, but the term “criminality” itself is more abstract. To define it in an observable
way, a lot of decisions have to be made. Criminality is composed of different things, and
there are many possibilities to aggregate them. For example, a definition of criminality
as a one-dimensional measure requires a decision about the mathematical aggregation of
burglaries, murders and tax fraud, and in particular about the relative weight of these
events.
This section introduces some general ideas from the theory of measurement in the social
sciences.
4.1
4.1.1
Types of measurement scale
Basic concepts
In 1946 S. S. Stevens published a very influential, though quite controversial, classification
of different types of scales of measurement (On the Theory of Scales of Measurements,
Science, 103, 677-680). He was motivated by his impression that in psychology and the
social sciences in general many arithmetical and statistical operations were being carried
out that were not valid and thus meaningless. For example, many researchers compute
and compare arithmetic means of Intelligence Quotients (IQs) for certain groups of people.
This operation implicitly assumes that one person with IQ 100 and another with IQ 140
can, as a group, be considered as exhibiting similar intelligence to two persons both with
IQ 120, or, in other words, that the IQ difference between IQ 100 and IQ 120 is as large
as the difference between IQ 120 and IQ 140.
Such an assumption is certainly meaningful for some measurements. An example is length
– the difference between 100 feet and 120 feet is exactly the same as the difference between
120 feet and 140 feet. This is because there is a natural operation of adding lengths: one
distance can be put behind the other, and together they give a distance with both lengths
added. In other words, the addition of numbers can be said to represent the addition of
lengths. But there is no such natural addition operation for IQs of different people.
24
Stevens aimed to define which arithmetic operations could be carried out on measurements
in a meaningful way. By this he meant that “meaningful” operations only should make
use of those features of a measurement which carry information. Stevens’ classification of
scale types was based on the type of transformation that preserves the information in the
measurement. Mathematical and statistical operations on a particular measurement type
are only valid if they are invariant under the associated type of transformation. First a
formal definition:
Definition: The scale type of a measurement operation is defined by the family Φ
of transformations φ by which the measurements can be transformed without losing
their original information content. Statistical statements about sets of measurements
{x1 , . . . , xn } are meaningful if they are invariant, i.e., for all φ ∈ Φ, S(x1 , . . . , xn ) holds
if and only if S(φ(x1 ), . . . , φ(xn )) holds, for some transformation S.
This definition is very abstract, but should become clearer after a few examples. We begin
by listing the types of scales, which are similar to the types of closed questions disscussed
in Chapter 3.
Nominal scales which distinguish two or more classes. Examples are male/female, tennis/football/basketball/swimming/athletics/other.
Ordinal scales for which the measurements carry an order, but there is no information
about the distance between the categories. Example: strongly agree/agree/neither
agree nor disagree/disagree/strongly disagree.
Interval scales for which comparisons of the distances between the values are meaningful, but there is no clearly defined zero point. Example: dates. The time between
1940 and 1970 is the same as between 1970 and 2000, but the zero point has no
arithmetical meaning; in most contexts the year 2000 is not in any sense twice as
large as the year 1000.
Ratio scales which additionally have a unique zero, so that ratios are meaningful. Example: lengths and distances; 200 miles is twice as far as 100 miles.
4.1.2
The properties of the standard types of scales
Here is how these scale types are characterized by the definition above (this forms a
hierarchy of scales from the least to the most informative):
(i) Nominal scales
Information: Two values are either equal or different.
Transformations preserving information: all one-to-one transformations, for example
the transformation of “tennis/football/basketball/swimming/athletics/other” to
“1/2/3/4/5/6”, but also to “4/5/1/3/6/2”.
Meaningful statements: statements including the frequencies of categories, such as
most frequent category, relative frequency (empirical probability). For example,
25
the statement “out of tennis/football/basketball/swimming/athletics/other, tennis
was chosen most often, in 44% of the cases, as favourite sport in our data” can be
transformed to “out of the categories no. 1,2,3,4,5,6, category 1 was chosen most
often, in 44% of the cases” (or category 4, if the other transformation above had
been used).
(ii) Ordinal scales
All nominal information is still valid and all meaningful statements for nominal scales are
still meaningful for ordinal ones.
Information: Two values are either equal or one of them is larger than the other.
Transformations preserving information: all order preserving transformations, e.g. the
transformation of “strongly agree/agree/neither agree nor disagree/disagree/strongly
disagree” (sa/a/n/d/sd), “1/2/3/4/5” or to “–2/–1/0/1/2” or to “1/17/120/122/2644”.
Meaningful statements: statements involving quantiles (such as the median) and rank
statistics (such as the Kruskal-Wallis or Wilcoxon-test). For example, “category d
is the median of our data” (which means that there were not more than 50% sd
and not more than 50% sa/a/n) can be transformed to “1 is the median of our data
transformed to -2/-1/0/1/2” (or 122, if the second transformation has been used).
(iii) Interval scales
All ordinal information is still valid and all meaningful statements for ordinal scales are
still meaningful for interval scales.
Information: Comparisons between differences of values are informative.
Transformations preserving information: positive linear: φ(x) = ax + b with a > 0.
Meaningful statements: comparative statements involving sums, means and variances.
Example: “the mean of (3, 4, 4, 5) is larger than that of (2, 2, 5, 5)” becomes “the
mean of (−4, −2, −2, 0) is larger than that of (−6, −6, 0, 0)” under φ(x) = 2x − 10.
Note that the mean of (120, 122, 122, 2644) is not larger than that of (17, 17, 2644, 2644),
so that mean comparisons are not invariant under all the monotone transformations
allowed for the ordinal scale.
(iv) Ratio scales
All interval information is still valid and all meaningful statements for interval scales are
still meaningful for ratio scales.
Information: Ratios between values are informative.
Transformations preserving information: positive proportional: φ(x) = ax with a > 0.
26
Meaningful statements: Statements involving products and ratios of values. Example:
“a flat in London is, on average, twice as expensive as a flat of the same size in
Stockholm” – whatever currency (proportional transformation) is used to measure
this.
This is all very elegant as a theory, but what happens in practice is that data that are
only on an ordinal scale are often analysed as though they were on an interval scale. For
example, responses on a five point ordinal scale are often coded to the integers 1 to 5 and
then means are compared across groups. The Likert scales discussed in the next section
do exactly this. Maybe it matters (it certainly does sometimes), maybe we can get away
with it so long as Stevens isn’t looking.
4.2
4.2.1
Attitude measurement with the Likert technique
Basic principle
Measuring attitudes, for example the attitude of someone towards the welfare state, is
not straighforward when the context is not a simple one. A person might feel positive
about some aspects and less positive about others. One popular technique for tackling
this problem is due to the American psychologist Rensis Likert (1903-1981). The most
basic construction of a Likert scale works as follows:
1. Construct several items (questions) that ask for several aspects of the attitude
you want to measure. The questions should all have the same answer format,
namely an ordered scale with a constant number of categories such as “strongly
agree/agree/neither agree nor disagree/disagree/strongly disagree”. A five-point
scale is used most often, but the method also works with other numbersof categories.
Example: Imagine that the questionnaire consists of three questions:
(a) People receiving social security are made to feel like second class citizens.
(b) The government should spend more money on welfare benefits for the poor,
even if it leads to higher taxes.
(c) If welfare benefits weren’t so generous, people would learn to stand on their
own feet.
with answers on a five point strongly agree/. . . /strongly disagree scale.
Normally there would be quite a few more questions, but three is enough to illustrate
the idea.
2. Now code the responses as 1/2/3/4/5 (or you could use -2,-1,0,1,2). We need to be
careful here as these questions don’t all point in the same direction. Agreement with
(a) and (b) indicates a positive attitude towards the welfare state,but agreement with
(c) suggest a negative attitude. This is called the polarity of the question. We could
say that (a) and (b) have positive polarity and (c) has negative polarity. To make
27
it so that averaging over questions makes sense we code (a) and (b) as 1/2/3/4/5
and (c) as 5/4/3/2/1. Thus strongly agree codes to 1 for (a) and (b), while strongly
disagree codes to 1 for (c). We shouldn’t try to make life simpler by making all
our questions have the same polarity but rather try to balance them, thus avoiding
bias caused by people having a preference for one end or other of the verbal scale.
Note that which direction as labelled is positive and which as negative is a matter
of choice, as is the direction of the numbering. All that matters is that we are
consistent and that we remember what we have done when it comes to interpreting
the results.
3. To compute the Likert score Li for respondent i, just compute the mean of the codes
of their answers.
If a person answers (a) with “strongly agree”, (b) with “disagree” and (c) with
“disagree”, their Likert score would be Li = (1 + 4 + 2)/3 = 2.33. Remember that
(c) has negative polarity and “disagree” is therefore coded as 2.
Note that in the literature Likert scales are often defined as sums instead of means.
However, means have two advantages:
1. means lie in the same value range as the original codes, which makes them easier to
interpret,
2. in case of missing values, or “do not know” if such a category has been offered, the
mean is more sensible. It is natural to ignore missing values for the computation
of the sum, which means that they are effectively coded as 0. If the code numbers
are all positive (1/2/3/4/5), this makes the sum score smaller than it should be,
and even in case of codes -2/1/0/1/2, treating missing values as zeros may not be
appropriate either. But if the mean is computed, the sum can be divided by the
number of questions answered, and this means that the missing values and do not
knows really do not have an influence on the final score. Had the person in the
example above not answered (c), their mean score would be Li = (1 + 4)/2 = 2.5.)
4.2.2
Item selection and check of polarity
Once we have administered the questionnaire to some subjects, possibly in a modestly
sized pre-test before running the full survey, we can try to find out whether the items are
really suitable to measure a common concept, i.e. whether they are consistent with the
general scale. This includes a check whether a wrong polarity has been chosen for some
of the items.
The standard way to do this is to compute correlation coefficients between the individual items and the overall Likert score. For two paired samples x = (x1 , . . . , xn ), y =
(y1 , . . . , yn ) (here the pair (xi , yi ) belongs to person i and n is the number of participants),
the (sample) correlation coefficient is
Pn
(xi − x̄)(yi − ȳ)
.
r(x, y) = pPn i=1
Pn
2
2
i=1 (xi − x̄)
i=1 (yi − ȳ)
28
We compute this correlation with x the data for one question and y the overall Likert
score, repeating the calculation for each question in turn. If the polarity of all items is
correct, all their correlation coefficients with the overall score should be positive. If one
is not, it is easy enough to fix the problem by reversing the coding for that item.
The correlations, after any mistakes with polarity have been fixed and the correlations
recomputed, can also be used to weed out questions that have very low correlation with
the overall score. The idea is to end up with a set of questions that are all measuring
roughly the same thing. The threshold for removing questions is pretty arbitrary, but a
correlation of less than 0.4 is sometimes used as a reason for removal.
As an improvement, the correlations could be computed not between an individual item
and the overall score, but between the item and the score which would result from aggregating only the other items. These correlations would be a bit smaller.
4.2.3
Discussion
Obviously, several subjective decisions are needed to construct a Likert scale, and although
there is a certain test of internal consistency, the quality of this measurement instrument
is not entirely convincing. It is at least easy to apply, and more complicated alternatives
cannot solve all the problems.
Note that in Section 4.2.2 it is only tested to what extent the individual items are consistent with the major tendency of all items. Thus we are not testing whether the items
measure what they are supposed to measure, just that they appear to be measuring more
or less the same thing.
Statistical methods assuming interval scale level are often applied to Likert scales, and in
fact the assumption of an interval scale is implicit in the averaging over questions. This
may or may not be appropriate.
4.2.4
Desirable properties of measurement instruments
A measurement instrument should ideally satisfy the following two properties:
Reliability. If the same situation is measured more than once, the measurement always
gives the same, or at least a very similar, result.
Validity. The measurement measures what it is supposed to measure. For example, the
welfare scale discussed above is valid if it properly represents the attitude of the
individuals toward social services and the welfare state.
Unfortunately, neither properties can be measured directly when the instrument is a
questionnaire.
29
4.2.5
Reliability
The intuitive approach to measuring reliability would be to apply the same measurement
instruments twice to the same individuals. Reliability would then be quantified by the
correlation between the pairs of measurements.
In our context the measurement instrument is a questionnaire. Obviously there are problems with this intuitive approach. If an individual has to fill in the same questionnaire
twice, it can be expected that the results of the second measurement are influenced
strongly by the first measurement. Individuals may remember their first answers and
try to appear consistent, or they may start to think more about the issues as a result of
reading the first questionnaire. Another effect could be that they do not take a second
interview with an identical questionnaire seriously.
Many of these problems have a weaker impact if the second measurement is taken much
later, for example half a year later. But then the individuals can no longer be treated
as “the same”. Whatever may have happened in the meantime may have influenced
their attitude and therefore a different measurement result may be obtained even with a
perfectly reliable measurement instrument.
Therefore, pure reliability is essentially not observable.
It is possible, though, to estimate the reliability of a measurement instrument which is
defined by the addition of several “sub-measurements” (such as the items in the Likert
scale) under some strong assumptions. The idea is that the variability of a measurement
instrument can be estimated from the variability between its parts. This will be discussed
in Section 4.2.7.
4.2.6
Validity
There is an obvious philosophical problem with the validity concept. Validity is about
the relation of our measurements to the “underlying truth” (e.g., the underlying real
attitude) that we would like to measure. But as long as we do not know how to measure
this underlying truth, we cannot observe anything objective about this relation. Often
it is not even clear whether such underlying truth exists. Therefore, pure validity is
essentially not observable either.
Some scientists in this situation choose the easy way out and define an attitude or ability
by the result of the measurement. This leads to statements like “intelligence is what
intelligence tests measure”. An intelligence test would then be perfectly valid by definition. But if intelligence were nothing more than the result of an intelligence test, there
would be no reason at all why anybody in society should be interested in measuring intelligence. Therefore, a measurement has at least to be related to some aspects of the idea
of intelligence that are relevant in the real world.
Validity can be assessed in a theoretical way by exploring how the measurement instrument is actually related to the theoretical concept of what is supposed to measure. In the
literature this is split up into several parts, for example “construct validity” and “content
30
validity”. This is in the realm of philosopy rather than statistics and we will not pursue
these ideas.
A more practical way is the comparison of the measurements with other observable criteria
that can be expected to be related to the underlying concept of interest. For example, the
results of intelligence tests could be correlated with school grades. A scale purporting to
measure the attitude toward the protection of the environment could be correlated with
observable behaviour such as use of cars, bikes or public transport and energy bills.
In the literature, this is called “predictive validity” if the observable criterion is measured
in the future or “concurrent validity”.
The essential philosophical problem with validity remains unsolved. If school grades do
not agree well with the results of intelligence tests, the intelligence test may be invalid,
but alternatively school grades could have less to do with intelligence than expected. If
the results agree, it may still be that both concepts measured the wrong thing.
4.2.7
Test theory
The term “test theory” refers to psychological tests, for which this theory was first developed. This statistical theory can be used to derive some interesting results about reliability
and validity, though the assumptions of the classical test theory are quite dubious. We
look only at some results on reliability here.
Suppose X1 and X2 are two measurements of the same property. e.g. an attitude, on the
same person, taken by the same measurement instrument. Assume
X1 = T + E1 , X2 = T + E2 , ρ(E1 , E2 ) = ρ(E1 , T ) = ρ(E2 , T ) = 0,
where ρ(X, Y ) = √ Cov(X,Y )
Var(X)Var(Y )
(1)
is the (theoretical) correlation between the random vari-
ables X and Y .
Here T is interpreted as the (non-observable) true value and E1 and E2 are the measurement errors, which are assumed as uncorrelated with the true value such that E(Ei ) =
0, i = 1, 2, which is the usual assumption for measurement errors.
Then, ρ(X1 , X2 ) is called the reliability of the measurement instrument yielding the
measurements X1 and X2 .
P
P
Now let Y1 = ki=1 Xi1 and Y2 = ki=1 Xi2 be two measurements that are constructeded by
summing up k items (“sub-measurements”) X11 , . . . , Xk1 , X12 , . . . , Xk2 , respectively. The
Y s could be Likert sum scores or indeed mean scores, because dividing all values by the
same constant does not change anything. Assume about the X11 , . . . , Xk1 , X12 , . . . , Xk2
that they are all of the form (1) with measurement errors uncorrelated to each other.
Assume that all items have the same correlation with each other, so for some constant ρ0
ρ(Xij , Xhl ) = ρ0
(2)
for i, h = 1, . . . , k and j, l = 1, 2, except when i = h and j = l and the two Xs are the
same. This is, in most situations, a quite unrealistic assumption.
31
To make the formulae in the proof below easier, assume further that
E(Xij ) = 0, Var(Xij ) = E(Xij2 ) = 1
(3)
for all i, j, implying that E(Yi ) = 0 and ρ0 = E(Xij Xhl ). The theorem below can be
proved without these simplifying assumptions, but the algebra is much messier.
Theorem Under the assumptions above,
ρ(Y1 , Y2 ) =
kρ0
.
1 + (k − 1)ρ0
Proof: Because of (3),
k X
k
X
Cov(Y1 , Y2 ) = E(Y1 Y2 ) = E
!
Xh1 Xi2
=
h=1 i=1
k X
k
X
E(Xh1 Xi2 ) = k 2 ρ0
h=1 i=1
and
Var(Y1 ) = Var(Y2 ) = E(Y12 ) = E
k X
k
X
!
Xh1 Xi1
.
h=1 i=1
The double sum has k terms with h = i, all equal to 1, and k(k − 1) terms with h 6= i, all
equal to ρ0 , so that
Var(Y1 ) = Var(Y2 ) = k + k(k − 1)ρ0
and
ρ(Y1 , Y2 ) = p
k 2 ρ0
kρ0
=p
=
.
1 + (k − 1)ρ0
Var(Y1 )Var(Y2 )
(k + k(k − 1)ρ0 )2
Cov(Y1 , Y2 )
This means that the reliabilityPof a measurement defined as a sum (or mean) of items
such as the Likert scale Y = k1 ki=1 Xi can be estimated by the so-called Cronbach’s α
α=
kr̄
,
1 + (k − 1)r̄
where r̄ is the mean of all (sample) correlation coefficients between the items Xi , Xj , i 6=
j. A general rule-of-thumb is that Cronbach’s α should be larger than 0.8 for reliable
measurement instruments.
Example: To estimate the reliability of a welfare scale, the correlation matrix of its seven
items is computed:
X1
X2
X3
X4
X5
X6
X7
X1
1.000
0.260
0.543
0.460
0.541
0.294
0.235
X2
0.260
1.000
0.263
0.468
0.403
0.153
0.149
X3
0.543
0.263
1.000
0.352
0.415
0.335
0.091
X4
0.460
0.468
0.352
1.000
0.519
0.224
0.198
X5
0.541
0.403
0.415
0.519
1.000
0.286
0.272
X6
0.294
0.153
0.335
0.224
0.286
1.000
0.171
32
X7
0.235
0.149
0.091
0.198
0.272
0.171
1.000
The mean of the off-diagonal entries is 0.316, which yields α =
not very high.
7×0.316
1+6×0.316
= 0.764. This is
Another implication of the theorem above is that the larger the number of items k, the
larger the reliability (assuming that the correlation between the items is always the same).
For example, if we had 14 items with a mean correlation of 0.316 instead of just seven,
the reliability would have been 0.866.
The increase with k can be shown by computing, for fixed r̄,
r̄(1 + (k − 1)r̄) − kr̄2
r̄ − r̄2
α (k) =
=
>0
(1 + (k − 1)r̄)2
(1 + (k − 1)r̄)2
0
because |r̄| < 1. This suggests that all we have to do is have lots of items. However if you look back at the assumptions that went into this calculation, the measurement errors on the individual items are assumed independent between items. If two items are too similar, i.e. the two questions are virtually the same, this will clearly not be true. So we need lots of essentially different items for the theory to work. This is a lot harder to achieve. Just to confuse matters there is an alternative formula for Cronbach’s α, given by: ! Pk 2 s k j=1 j α= 1− . k−1 s2Y Here: s2j = sample variance of j th item s2Y = sample variance of the calculated scores with the variances calculated across n subjects in each case. The previous formula is unaffected by whether Y is a sum or a mean, but this one assumes the scores Y are sums rather than means. The s2Y in the denominator would need to be multiplied by k 2 if Y is a mean of k items rather than a sum. Under the same unreasonable assumptions this formula also estimates the correlation between replicate measurements of Y . To see this, note that s2j estimates the variance of the jth item, which is assumed to be 1, and s2Y estimates the variance of Y which is shown above to be k + k(k − 1)ρ0 . Thus α estimates   k kρ0 k 1− = . k−1 k + k(k − 1)ρ0 1 + (k − 1)ρ0 Which of the two formulas you choose to use probably depends more on which is easier to compute in a given context than on any theoretical advantage of one or the other. 33 5 Introduction to Sampling Schemes Much of social statistics concerns the use of a sample to provide inference about variables of interest in a population. Often we think of theoretical ‘true’ parameters of interest in the population, and we aim to estimate such parameters using data collected on a sample from the population. For example, we might be interested in the population mean (µ) for a variable of interest. If possible, an unbiased and precise estimate of the parameter of interest is desirable. We should aim to ensure that any sampling mechanism allows us to achieve this aim. We now briefly outline some common sampling mechanisms before further discussion of the simplest method of probability sampling known as simple random sampling and some of its more complex variants. 5.1 Types of sampling scheme Availability/convenience sampling. This is when we sample only those individuals/units that are immediately available at the point in time when sampling occurs. Clearly, this is easy to perform, though not necessarily desirable as the resulting sample may not be representative of the population of interest. Quota sampling. This occurs when we select a sample to ensure balance over some pre-defined characteristics. Example: if the target population of a questionnaire is the population of British adults, quota sampling tries to collect a sample in which the proportions of the different age groups, males and females etc. are the same as among all British adults. This method sounds intuitively sensible, although if the quotas are met simply on the basis of the availability of units, as they often are, this method would represent availabilty sampling according to a fixed quota. Snowball sampling. This is a sampling technique in which a sample is constructed by existing study subjects who recruit further study subjects based upon their acquaintences. The sample group appears to grow like a rolling snowball building in size until enough data are gathered for research. This method is often used in surveys of hidden or hard-to-reach populations (e.g. injecting drug users, illegal immigrants). However, sample members are not selected from a sampling frame, and so snowball samples are subject to numerous biases (e.g. people who have a lot of friends/acquaintances in the population of interest might be more likely to be recruited). Probability sampling. Select a sample with some pre-defined probability. Common methods: Simple random sampling: Each sample has equal probability of being chosen. Example: a random number generator can be used to draw a simple random sample of employees of a company from a complete list. 34 Stratified random sampling: Draw a simple random sample within each of several “strata”. Example: for the target population of British adults, several strata are pre-defined, such as all possible combinations of several age-groups and male/female. The proportions of elements from these strata are fixed in advance, possibly, but not necessarily so that they coincide with the proportions among all British adults, and simple random sampling is applied within the strata. This is different from quota sampling where the quota are fulfilled without applying random sampling. Cluster sampling: Draw a sample of units each of which contains multiple elements, then either take all the elements in the sampled clusters or sample from each of them. The key difference between this and stratified sampling is that in the latter we sample from all strata whereas in cluster sampling we only investigate a sample of clusters. Example: If our target population is the inhabitants of English cities, it will be much cheaper to sample some cities rather than having to visit them all Systematic sampling: Take every kth element of an ordered population. Example: call every 100th number from a phone directory, possibly with a random starting point. 5.2 Some history of opinion polls In USA in 1936, “Literary Digest”, a popular magazine, mailed questionnaires to a list of car and phone owners to predict the outcome of the presidential election between Roosevelt (Democrat, incumbent) and Landon (Republican). 10.3 million questionnaires were sent out, of which 2.3 million were returned. Prediction: 40% Roosevelt Result: 62% Roosevelt v v 60% Landon 38% Landon Why? Possible reasons: • List not representative (car owner 6= voter) • Non-response bias (Roosevelt voters did not answer the questionnaire, a common phenomenon for incumbent supporters). The second reason seems more important, and is typical in “voluntary response” situations. People who feel strongly about issues tend to respond, and this may (or may not) introduce bias. Consequence: Self selected samples were superseded by “quota” samples. These are chosen so that they are representative of different population characteristics, such as age, gender, occupation, etc. This was partly because Gallup, based on a quota sample, predicted the election result quite well: 56% Roosevelt v 35 44% Landon. Quota sampling was used in opinion polls (and is still widely used in surveys) until 1948 when in the election between Truman (Democrat) and Dewey (Republican), opinion polls predicted a Dewey victory by 5 − 15%, but Truman won by 4%. Why? Possible reason: • quota sampling, if used in the context of availability sampling, does not solve all the problems with availability sampling: although interviewers did obtain a sample representative in terms of age, gender etc, other characteristics are important and uncontrollable (approachability, accessibility). In quota sampling the choice of respondents is not random but depends on who passes by or who is accessible to the interviewer. Nowadays, probability sampling is used when practicable. A rough general description of the process, for drawing a sample of size n, is as follows: For a population of size N and a fixed sample size n there are a finite number k of possible samples, where k= N Cn = N! . n!(N − n)! The number k represents the possible combinations of n elements randomly chosen from a set of N units. Label these samples S1 , S2 , . . . , Sk . Assign P probability πi to Si , where i πi = 1, but πi are not necessarily equal. Choose sample Si with probability πi . This has the following features: • it does not depend on subjective decisions, • randomisation prevents systematic deviations from representativeness, • the sampling method allows us to calculate a measure of uncertainty. Having said that, the 1992 British election results were not well predicted by the opinion polls. Although most used quota sampling, some did use probability sampling. Why? Possible reasons: • there was a genuine last minute change, • systematic bias in responses (“do not knows” were mostly conservatives) • other possible biases? There is an interesting discussion on opinion polls and general elections in the Journal of the Royal Statistical Society Series A, volume 159 (1996) pages 1–39. 36 5.3 Simple random sampling: introduction We will present the main characteristics of simple random sampling later on, but for now we just consider a very basic set of features, that we use to develop the theory of probability sampling. Suppose we have a population of N individuals and we draw a sample of size n. There are N Cn possible samples. Definition If every possible sample of size n is equally likely to be chosen, the procedure is called simple random sampling. The sample chosen according to this scheme is called a simple random sample (abbreviation: SRS, where the final S can stand for Sample or Sampling). Notes: • We sample without replacement. • Convenience sampling (“just pick the first 10 individuals you meet”) is not SRS. Quota sampling, which is basically convenience sampling with some balance imposed, is not SRS. If you want to be sure of balance in your random sample with respect to some important demographic property than you should stratify by that property. • Simple random sampling is an equal probability selection method (EPSEM). That is, each individual is equally likely to be chosen. But note that there are other sampling methods in which each individual has the same chance of being chosen (i.e. SRS is an EPSM but an EPSM is not necessarily SRS.). • SRS is aiming to achieve representativeness by the use of randomisation. Note that the word “random” is often wrongly used to mean “representative”. Basically randomisation is a method used to produce a sample whose statistical properties are known. 5.3.1 How to draw a random sample The simplest form of method to draw a random sample consist in an experiment where we have an urn in which we have placed a set of N objects. For example, we can consider N indistinguishable balls that for all purposes look and feel the same; each ball contains a small piece of paper with a unique number printed. We start by shuffling the balls in the urn and then selecting one ball. We can check and see what number is associated with the ball, continuing to draw withour replacement n times. Nowadays, we do not use this actual set up (except for trivial or glamorous circumstances, e.g. the draws of teams in groups for the football World Cup) and computer pseudorandom mechanisms are used instead. Most statistical packages have built-in algorithms that can draw a random sample. For example, in R we can draw a simple random sample using the command 37 sample(x=1:N, size=n, replace=FALSE) This will take a “population” made by the numbers in the interval [1; N ] and draw n instances from that population. Obviously the values of N and n need to be set before calling this function. The option replace=FALSE instructs R to operate according to the urn scheme described above: every time that a unit is selected, it is set aside and it cannot be selected again. 38 6 6.1 Sampling Theory: Mathematical Concepts and Notation Sampling Theory Notation Firstly, we introduce some useful notation. Different authors and books use different notation. We will use the convention that population values are denoted by capital letters, and sample values are lower case. This is a standard convention in finite population sampling theory, but has the consequence that it is now the lower case letters that denote the random variables, in contrast to what you will have seen elsewhere. 6.1.1 Population Values Let Y1 , Y2 , . . . , YN denote the values of some variable Y for each individual in the population. Sometimes we may need to refer to the individuals themselves (the population units), rather than their values of Y ; these will be denoted by U1 , U2 , . . . , UN . The following notation for population quantities will be used: Population size: N Population total: T = PN Population mean: Ȳ = 1 N Population variance: S2 = Population coefficient of variation: C j=1 Yj PN j=1 Yj P N 1 2 j=1 (Yj − Ȳ ) N −1 = S/Ȳ Some of the literature divides by N when defining the population variance, in which case it is usually denoted by σ 2 , so N 1 X σ = (Yj − Ȳ )2 N j=1 2 and sometimes the population mean is denoted by µ rather than by Ȳ . The choice between µ and Ȳ is simply one of notation as they represent the same quantity. The two variances are different though, with S2 = N σ2 N −1 and σ2 = N −1 2 S . N Some formulas are more conveniently expressed in terms of S 2 , others are easier if written in terms of σ 2 . We will try to stick with Ȳ and S 2 but reserve the right to use the alternative versions where it makes the algebra simpler. 6.1.2 Sample values We generally use lower case y1 , y2 , . . . , yn to denote the values of Y for those individuals selected in the sample. The sample individuals (or units) may be denoted by u1 , u2 , . . . , un . 39 Sample size: n Sample total: t Sample mean: ȳ Sample variance: s2 = Pn = 1 n i=1 i=1 1 n−1 = yi Pn yi Pn i=1 (yi − ȳ)2 Make sure you distinguish clearly between upper case N, Ȳ , S 2 and lower case n, ȳ, s2 . 6.1.3 Binary variables Some formulas simplify when the the variable Y can take one of only two values, coded as 0 or 1, where Yj = 1 means that individual j in the population has some attribute and Yj = 0 means they do not. Variables that are defined in this way are called binary, or dichotomous. The population mean Ȳ is then the proportion of individuals in the population who do possess the attribute. This is denoted by P and can be formally written as N 1 X Yj = Ȳ . P = N j=1 Also the complementary proportion, representing those who do not possess the attribute, is denoted by Q = 1 − P . The corresponding sample proportions are denoted by p and q, where n p= 1X yi = ȳ n i=1 and q = 1 − p. The population variance then becomes N S 2 1 X 2 (Yj − Ȳ ) = N − 1 j=1 1 = N −1 N X ! Yj2 − N Ȳ 2 j=1  1 = NP − NP 2 N −1 N = PQ N −1 In passing from line 2 to line 3 we have used the fact that that Yj = either 0 or 1, so when we square each Yj we get again either 0 or 1. By an identical argument, the sample variance is n s2 = pq . n−1 40 Note that the population variance result is an example of a formula that is simpler if we use the alternative definition of variance, since σ 2 = P Q. However we then lose the correspondence between the population and sample results, so one can’t win either way. 6.1.4 Probability When we discuss probability sampling, we will refer to the probability that a particular element is chosen in the sample. More specifically, we may write Pr(ui = Uj ) to denote the probability that the ith element in the sample is the jth element in the population. Here probability refers to the method or mechanism of drawing the sample. So we think of ui being randomly determined and U1 , U2 , . . . , UN being possible values. It is common in books to write Pr(yi = Yj ) to denote this same probability, and think of it as the probability that the ith sample value equals the jth population value. Of course, if two individuals in the population have the same value of Y it is possible that yi = Yj but ui 6= Uj , but by convention in sampling theory Pr(yi = Yj ) is often used to mean Pr(ui = Uj ). Note that in ordinary statistics, it is standard terminology to write Pr(Y = y) where Y is a random variable and y is a possible value. But in sampling theory it is well established notation to use y to refer to the sample Y for the population. Thus in probability sampling of finite populations, yi is a random quantity and Yj is a possible value. 6.2 Estimates and their standard errors We typically wish to estimate population parameters and to determine the precisions of the estimates, usually by calculating standard errors or confidence intervals. 6.2.1 A small example Suppose we have a population of five people: Adam, Ben, Clive, Donna and Eve. This is obviously a toy example because one would not normally need to sample when the population size is only 5, but small examples are useful because we can easily enumerate all the possible samples. We want to find out the total amount of money they have on them. In fact, Adam has £8, Ben £8, Clive £14, Donna £6, and Eve £4, so in total they have £40. Instead of just asking them all, we will try to estimate t...

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!
? Hi, how can I help?