All maps are inaccurate but some have very useful applications: Thoughts on Complex Social Surveys
This blog post provides some thoughts on analysing data from complex social surveys, but I will begin with an extended analogy about maps.
All maps are inaccurate. Orienteering is a sport that requires navigational skills to move (usually running) from point to point in diverse and often unfamiliar terrain. It would be ridiculous to attempt to compete in an orienteering event using a road map drawn on a scale of 1:250,000, this is because 1cm of the map represents 2.5 kilometres. Similarly it would be inappropriate to drive from Edinburgh to London using orienteering maps which are commonly drawn on a scale of 1:15,000. On an orienteering map 1cm represents 150 metres of land.
Hillwalking is a popular pastime in Scotland. Despite having similar aims many hillwalkers use the standard Ordinance Survey (OS) 1:50,000 map (the Landranger Series) but others prefer the 1:25,000 OS map. These maps are not completely accurate but they have useful applications for the hillwalker. For some hillwalking excursions the extra detail offered by the 1:25,000 map is useful. For other journeys the extra detail is superfluous and having coverage of a larger geographical area is more useful. When possible I prefer to use the Harvey’s 1:25,000 Superwalker maps. This is because they are printed on waterproof paper and they tend to cover whole geographic areas so walks are usually contained on a single map. I also find the colour scheme helpful in distinguishing features (especially forests and farmland), and the enlargements (for example the 1:12,500 chart of the Aonagh Egach Ridge on the reverse of the Glen Coe map) aid navigation in difficult terrain.
The London Underground (or Tube) map is probably one of the best known schematic maps. It was designed by Harry Beck in 1931. Beck realised that because the network ran underground, the physical locations of the stations were largely irrelevant to a passenger who simply wanted to know how to get from one station to another. Therefore only the topology of the train route mattered. It would be unusual to use the Tube map as a general navigational aid but it has useful applications for travel on the London Underground.
The Tube map has undergone various evolutions, however the 1931 edition would still be an adequate guide for a journey on the Piccadilly Line from Turnpike Lane to Earls Court. By contrast a journey from Turnpike Lane station to Southwark station using the 1931 map will prove confusing since the map does not include the Jubilee Line, and Southwark station was not opened until the 1990s. A traveller using the 1931 map will not be aware that Strand station on the Northern Line was closed in the early 1970s.
Contemporary versions of the Tube map include the fare zones, which is a useful addition for journey planning. More recently editions include the Docklands Light Railway and Overground trains which extend the applications of the Tube map for journeys in the capital.
Here are two further thoughts on the accuracy of the tube map and its applications. First, when I was a schoolboy growing up in London I was amused that what appeared to me the shortest journey on the Tube map from Euston Square station to Warren Street station involved three stops and one change. I knew that in reality the stations were only less than 400 metres apart (my father was a London Taxi driver). Walking rather than taking the Tube would save both time and money.
Second, more recently I have become aware of the journey from Finchley Road tube station to Hampstead tube station which involves travelling on the Jubilee Line and making changes onto the Victoria Line and then the Northern Line. The estimated journey on the Transport for London website is about 30 minutes. Consulting a London street map reveals that the stations are less than a mile apart. A moderately fit traveller could easily walk that distance in less than half an hour. The street map (like the Tube map) is unlikely to warn the traveller that the journey is up hill however. Finchley Road underground station is 217 feet above sea level and Hampstead station is 346 feet above sea level (see http://en-gb.topographic-map.com/).
This preamble hopefully reinforces my opening point that all maps are inaccurate, but sometimes they have very useful applications. Some readers will know the statement made by the statistician George Box that all models are wrong but some are useful. This statement is especially helpful in reminding us that models are representations of the social world and not accurate depictions of the social world. Similarly a map is not the territory. When thinking about samples of social science data I find the analogy with maps useful as a heuristic device.
All samples of social science data are inaccurate, especially those that are either small or have been selected unsystematically. Some samples are both small and unsystematically selected. Small sample and unsystematic samples may prove useful in some circumstances but their design places limitations on how accurately the data represents the population being studied. Large-scale samples that are selected systematically will tend to be more accurate and better represent target populations. The usefulness of any sample of social science data, much like a map, will depend on its use (e.g. the research question that is being addressed).
Some large-scale social surveys use simple statistical techniques to select participants. The data within these surveys can be analysed relatively straightforwardly. Many more contemporary large-scale social surveys have complex designs and use more sophisticated statistical techniques to select participants. The motivation is usually to better represent the target population, to minimise the costs of data collection, and to allow meaningful analyses of sub-populations (or smaller groups).These are positive features but they come at the cost of making the data from complex surveys more difficult to analyse.
It is possible to approach the analysis of data from complex social surveys naively and treat them as if they were produced by a simple design and selection strategy. For some analyses this will be an adequate approach. This is analogous to using a sub-optimal map but still being able to arrive close enough to your desired destination.
For other studies a naïve approach to analysis will be inappropriate. Comparing naive results with results from more sophisticated analysis can help us to assess the appropriateness of naïve approaches. The difficulty is that reliable statements cannot easily be made a priori on the appropriateness of naïve approaches. To draw further on the map analogy, when using an inadequate map it is difficult to assess how close you get to the correct destination unless you have previously visited that location.
The benefit of social surveys with complex designs is that they have complex designs. The drawback of social surveys with complex designs is that they have complex designs. All maps are inaccurate but some have very useful applications. All samples of social science data are inaccurate but some have very useful applications. The consideration of the usefulness of a set of social science data requires serious methodological thought and this will most probably be best supported by exploratory investigations and sensitivity analyses.
To learn more about analysing data from both non-complex and complex social surveys come to grad school at the University of Edinburgh (http://www.sps.ed.ac.uk/gradschool).
A Note on Synthetic Administrative Social Science Data
Third parties do not have the same level of access to administrative social science data as they do to the social survey datasets, which are made available through national data services and archives. Social science survey datasets often have restrictions placed on how they are shared, which are agreed through an End User Licence1. The social science survey datasets supplied by the UK Data Service are available to UK social scientists after a simple registration process, and datasets can be instantly downloaded. It is therefore possible for third parties to duplicate analyses that are undertaken using survey datasets, and to check the validity and accuracy of previously produced results. In the absence of shared research code (e.g. syntax files and supporting documentation) a great deal of work will usually be required for even very simple analyses to be duplicated. Therefore, in practice results even from accessible survey datasets might still be hard to duplicate.
Restrictions upon access to micro-level administrative social science data preclude the instant download of datasets. Recent progress has been made in the area of developing a methodology to produce synthetic versions of administrative social science data. For example, Synthpop is a tool for producing synthetic versions of microdata containing confidential information that are safe and suitable for more general release (see Nowok et al., 2015). The key objective of generating synthetic data is to replace confidential original values with synthetic ones causing minimal distortion to the statistical information contained in the dataset. We envisage that synthetic micro-level administrative social science data will be especially beneficial for undertaking exploratory data analyses, and for training researchers.
Making synthetic versions of micro-level administrative social science datasets available is desirable because it would enable researchers to investigate the format and structure of these datasets. When deposited with research code, synthetic data will help a third party to better understand the original researchers’ workflow (e.g. the structure of the analyses and the decisions that were taken). It will also enable a third party to trial the deposited research code. The availability of synthetic data does not however offer a comprehensive solution to the problem of restricted access to administrative social science datasets. This is because it cannot be used to check the validity and accuracy of previously produced results. There is no clear, or agreed upon, formal methodology to help third parties adjudicate whether or not a difference between original results and duplicated results is a consequence of the perturbations in the synthetic data, or if they are due to something more serious (e.g. calculation errors in the original results).
Vernon Gayle, Christopher Playford, Roxanne Connelly and Alasdair Gray.
References and Notes
Nowok, B., Raab, G.M. and Dibben, C., 2015. synthpop: Bespoke creation of synthetic data in R. Package vignette http://cran. r-project. org/web/packages/synthpop/vignettes/synthpop. pdf. Accessed, pp.02-26.
Between the NEET and the tidy - Exploring ‘middle’ outcomes in Scottish school qualifications
A new research approach has uncovered four ‘latent’ groups within Scotland’s secondary schools. This typology will be highly relevant to measuring the effectiveness of educational policy reforms.
School qualifications play an important role in determining the transitions that young people make, and the educational and employment pathways that they follow. This paper is an element of a wider on-going programme of theoretically informed empirical analyses which examine young people’s educational outcomes. In this phase of the work we focus on outcomes in School Standard Grade qualifications in Scotland.
The results reported in this work clearly indicate that there are two distinctive groups of Scottish pupils with ‘middle’ or ‘moderate’ school Standard Grade outcomes. On the topic of ‘ordinary’ pupils and unspectacular educational outcomes, we are reminded of Phil Brown’s pithy statement in the l980s, that there is an invisible majority of ordinary pupils who neither leave their names engraved on the school honours board nor gouged into the tops of their desks. Paraphrasing Brown’s statement, and adding a more contemporary and geographical slant, during the school day the pupils in the middle groups are unlikely to be found drinking Iron Brew WKD in their local parks or sitting at home playing on their Xbox, however they are also unlikely to appear at university open days.
Scottish pupils could study for more than thirty different Standard Grade subjects. There were no compulsory or specified sets of Standard Grades, and pupils and parents were given a large degree of choice over which subjects a pupil studied. Each Standard Grade subject studied was awarded an individual grade on a seven point scale, the highest being grade 1, and the lowest grade 7. Because Standard Grades were ungrouped, subject-based, and graded on a 7 point scale there was no single obvious, or agreed upon, method of summarizing a pupil’s overall school Standard Grade outcomes.
A central challenge of our programme of work is developing a methodological strategy to handle the messiness and complexity of individual pupil’s outcomes in school qualifications. In this paper we employ a latent variable modelling approach. This paper is original because it uses newly available administrative data from the Scottish Qualifications Authority linked to individual and parental information from the Scottish Longitudinal Study.
The analyses uncovered four main latent educational groups.
Latent group 1 (Low Outcomes): 46% of pupils were in this group and they had very poor Standard Grade outcomes. Pupils in this group were from generally more socially disadvantaged families.
Latent group 2 (Middle Non-Science): 14% of pupils were in this group and they had moderate overall Standard Grade outcomes. They were more likely to gain a Credit pass (grade 1 or 2) in English, but were relatively less likely to gain Credit passes in Mathematics and Sciences.
Latent group 3 (Middle Science): 14% of pupils were in this group. This group also had moderate overall Standard Grade outcomes. They were unlikely to gain Credit passes (grade 1 or 2) in English and Mathematics, but were more likely to gain Credit passes in the Sciences.
Latent group 4 (High Outcomes): 27% of pupils were in this group and they had very positive overall Standard Grade outcomes. Pupils in this group were from generally more socially advantaged families.
An important aspect of these findings is that the degree of detail that was uncovered using the latent variable approach was hidden in overall analyses of Standard Grade outcomes. The approach provides an informative set of typologies that are likely to be impactful because they can be used to better understand patterns of educational outcomes. These typologies are important because they can directly inform current debates on raising standards in Scottish schools, improving pupils’ knowledge, and developing their skills.
In particular the evidence that there are hidden groups of ‘ordinary’ young people with different patterns of educational outcomes, and that these pupils may require assistance and encouragement in different areas of the school curriculum is important. This finding appeals to ‘Getting it Right for Every Child’ (GIRFEC), which is the national approach to improving the wellbeing of children and young people in Scotland, as well as the aims of the Curriculum for Excellence reforms, the strategy for developing Scotland’s young workforce, and the Westminster Government’s strengthened approach to tracking the life chances of Britain’s most disadvantaged children.
Standard Grades have now been replaced by the new ‘National’ Qualifications Framework. Scottish pupils will now study these new qualifications in the final year of compulsory schooling. The new National Qualifications are also ungrouped and awarded at the individual subject level. Schools have made different decisions regarding the number of Nationals that a pupil will study, but it is likely to be approximately six courses. The new Nationals will be available at different levels. National 5 is the higher level qualification and is roughly equivalent to a Credit pass at Standard Grade. National 4 is a lower level qualification, roughly equivalent to a General pass at Standard Grade, and involves only continuous assessment and no formal examination. National 4 qualifications will be graded as pass or fail, however the National 5 qualifications will be graded from A to D (with A being the highest grade). Therefore the latent variable modelling approach demonstrated in this paper is important because it provides a method for analysing the messy and complicated data which will emerge from the new Scottish National Qualifications.
Gayle, V., Playford, C.J., Connelly, R. and Murray, S. (2016) Between the NEET and the tidy - Exploring 'middle' outcomes in Scottish school qualifications. CPC Working Paper 76, ESRC Centre for Population Change, UK.
School examination results were historically a private matter, and the awareness of results day was usually confined to pupils, teachers and parents. School exam results are now an annual newsworthy item in Britain and every summer the British media transmit live broadcasts of groups of young people receiving their grades. This recurrent event illustrates, and reinforces, the importance of school-level qualifications in Britain.
The General Certificate of Secondary Education (GCSE) is that standard qualification undertaken by pupils in England and Wales at the end of year 11 (age 15-16). School GCSE outcomes are worth of sociological examination because in the state education system they mark the first major branching point in a young person’s educations career and play a critical role in determining pathways in education and employment.
In our newly published paper (http://www.tandfonline.com/doi/pdf/10.1080/13676261.2015.1052049), we turned our attention to exploring school GCSE attainment at the subject-area level, rather than looking at overall outcomes or outcomes in individual GCSE subjects. This is an innovative approach to studying school GCSE outcomes. The initial theoretical motivation was to explore if there were substantively interesting combinations or patterns of GCSE outcomes, which might be masked when the focus is either overall outcomes or outcomes in individual subjects. Within the sociology of youth there has been a growing interest in the experiences of ordinary pupils who have outcomes somewhere between the obviously successful and unsuccessful levels, and this group have been referred to as the ‘missing middle’.
The data used in the paper are from the Youth Cohort Study of England and Wales (YCS) which is a major longitudinal study that began in the mid-1980s. It is a large-scale nationally representative survey funded by the government and is designed to monitor the behaviour of young people as they reach the minimum school leaving age and either remain in education or enter the labour market. School GCSE outcomes are challenging to analyses because there are many GCSEs available, there is an element of pupil choice in the diet of GCSE that a pupil undertakes, some pupils study more GCSEs than others, each GCSE subject is awarded an individual grade on an alphabetical scale (A* being the highest and G being the lowest), and subject GCSE outcomes are highly correlated. We employ a latent variable approach as a practicable methodological solution to address the messy and complex nature of school GCSE outcomes.
In the paper we identify substantively interesting subject-level patterns of school-level GCSE outcomes that would be concealed in analyses of overall measures, or analyses of outcomes within individual GCSE subjects (see Table 1). The modelling process uncovers four distinctive latent educational groups. The first latent group is characterised by good GCSE outcomes, and another latent group is characterised by poor GCSE outcomes. There are two further latent groups with ‘middle’ or ‘moderate’ GCSE outcomes. These two latent groups have similar levels of overall (or agglomerate) outcomes, but one group has better outcomes in science GCSEs and the other has better outcomes in arts GCSEs.
Membership of the latent educational groups is highly stratified. Socially advantaged pupils are more likely to be assigned to group 1 ‘Good Grades’. In contrast, the pupils assigned to group 4 ‘Poor Grades’ are more likely to be from manual and routine socioeconomic backgrounds. The analyses uncovered two latent educational groups with similar levels of moderate overall school GCSE outcomes, but different overall patterns of subject level outcomes. A notable new finding is that pupils in latent educational group 2 ‘Science’, had a different gender profile to pupils in group 3 ‘Arts’, but both groups of pupils were from the same socioeconomic backgrounds.
Our paper is innovative because it documents a first attempt to explore patterns of school GCSE attainment at the subject area level in order to investigate whether there are distinct groups of pupils with ‘middle’ levels of attainment. The sociologist Phil Brown made the pithy statement that there is an invisible majority of ordinary young people who neither leave their names engraved on the school honours board nor gouged into the top of their desks. We conclude that such pupils are found in the two ‘middle’ latent educational groups. We see no obvious reasons why school exam results will not continue to be an annual newsworthy item and we suspect that the media focus is most likely to remain on pupils with exceptional outcomes rather than those with the more modest results that characterise the two ‘middle’ latent educational groups.
A new GCSE grading scheme is likely to be introduced from August 2017. A new set of grades ranging from 1 to 9 (with 9 being the highest) will replace the A*–G scheme. Early indications suggest that the older eight alphabetical grades (A*–G) will not map directly onto the new 1–9 grades, but there will be some general equivalence. Despite the potential reorganisation of GCSEs, and the proposed changes in the grading system, school level GCSEs will continue to be complicated and messy and the methodological approach used in this paper will be equally appealing for the analysis of more contemporaneous educational cohorts.
Playford, Christopher J., and Vernon Gayle. "The concealed middle? An exploration of ordinary young people and school GCSE subject area attainment." Journal of Youth Studies 19.2 (2016): 149-168. DOI: 10.1080/13676261.2015.1052049