This paper critically examines the affordances and limitations of big data for the study of crime and disorder. We hypothesize that disorder-related posts on Twitter are associated with actual police crime rates. Our results provide evidence that naturally occurring social media data may provide an alternative information source on the crime problem. This paper adds to the emerging field of computational criminology and big data in four ways: (1) it estimates the utility of social media data to explain variance in offline crime patterns; (2) it provides the first evidence of the estimation offline crime patterns using a measure of broken windows found in the textual content of social media communications; (3) it tests if the bias present in offline perceptions of disorder is present in online communications; and (4) it takes the results of experiments to critically engage with debates on big data and crime prediction.
This paper reports on a methodological experiment with ‘big data’ in the field of criminology. In particular, it provides a data-driven critical examination of the affordances and limitations of open-source communications gathered from social media interactions for the study of crime and disorder. The experiment conducted was exploratory in nature, and utilized nascent ‘computational criminological’ methods (Williams and Burnap 2015) to ethically harvest, transform, link and analyse ‘big social data’ to address the classic problem of crime pattern estimation (Braga et al. 2012). The results presented form a preliminary basis for the critical discussion of these ‘new forms of data’ and for subsequent confirmatory analysis to be conducted. The aim of the experiment was to build big data statistical models that develop previous predictive work using social media. For example, Tumasjan et al. (2010) measured Twitter sentiment in relation to candidates in the German general election concluding that this source of data was as accurate at predicting voting patterns as polls. Asur and Huberman (2010) correlated frequency of posts and sentiment related to movies on Twitter with their revenue, claiming that this method of prediction was more accurate than the Hollywood Stock Market. Sakaki et al. (2010) found that the analysis of Twitter data produced estimates of the epicentres of earthquakes more accurately than conventional geological sensor methods. These studies illustrate how social media generates ‘naturally occurring’ socially relevant data that can be used to complement and augment conventional curated data to estimate the occurrence of offline phenomena. In our experiment, we conduct an ecological analysis of crime in London using Twitter data as a predictor to test the hypothesis that crime- and disorder-related tweets are associated with actual police crime rates. Our results provide tentative evidence that statistical models based on social media data may provide an alternative source of information on the crime pattern estimation problem. This paper adds to the evidence base and debate in the emerging field of computational criminology in four ways: (1) it estimates the utility of social media data to explain variance in offline crime patterns and compares results with conventional indicators (census variables); (2) it provides the first evidence of the estimation offline crime patterns using a measure of broken windows found in the textual content of social media communications; (3) it specifically tests if the bias present in offline perceptions and reports of crime and disorder (found between low- and high-crime areas) is present in social media; and (4) it uses the results of these experiments to critically engage with debates on big data and crime estimation.
Social media communications as source of data for criminology
The majority of individuals aged below 20 in the Western world were ‘born digital’1 and will not recall a time without access to the Internet. Combined with the migration of the ‘born analogue’ generation onto the Internet, fuelled by the rise of social media, we have seen the exponential growth of online spaces for the mass sharing of opinions and sentiments. The digital revolution is generating high-volume data through multiple forms of online behaviour. The global adoption of social media over the past half a decade has seen ‘digital publics’ expand to an unprecedented level. Estimates put social media membership at approximately 2.5 billion non-unique users, with Facebook, Google+ and Twitter accounting for over half of these. These online populations produce hundreds of petabytes of information, with Facebook users alone uploading 500 terabytes of data daily. No study of contemporary society can ignore this dimension of social life. The potential value added by social media data for criminological research is that it is user-generated in real-time in voluminous amounts, and as such it can provide insight into the behaviour of specific populations on the move. This is in contrast to the necessarily retrospective snapshots provided by conventional methods such as household surveys and officially recorded data. New forms of online social data, handled by computational methods, allow criminologists to gain meaningful insights into contemporary social processes at unprecedented scale and speed, but how we marshal these new forms of data presents a key challenge (see Williams et al. 2013, Williams and Burnap 2015).
In our exploratory study with big data, we make the assumption that each Twitter user is a sensor of offline phenomena. In the vein of Raudenbush and Sampson (1999), we consider these sensors, or nodes for systematic social observation, as part of a wide sensor-net covering ecological zones (in our case London boroughs). These sensors observe natural phenomenon—the sights, sounds and feel of the streets (Abbott 1997). As in the case of ‘broken windows’ (Wilson and Kelling 1982), these can include minor public incivilities—drinking in the street, graffiti, litter—that serve as signals of the unwillingness of residents to confront strangers, intervene in a crime or call the police; cues that entice potential predators (Skogan 1990: 75). Sensors can publish information about local social and physical disorder in four ways: as victims; as first-hand witnesses; as second-hand observers (e.g. via media reports or the spread of rumour) and as perpetrators. We consider these four modes of Twitter publishing as signatures of crime and disorder. These social-actors-as-disorder-sensors have various characteristics. Some are activated (i.e. publish tweets) based on specific signs, while others are not (based on variation in perceptions of disorder).2 Data from these sensors also includes temporal and spatial information. Sensors are not always switched ‘on’, as they may be offline, working, sleeping etc. They may also act in ways that make the data difficult to interpret and validate (e.g. using sarcasm and spreading rumours). This means they produce data that are noisier than curated data. However, the number of sensors is prodigious; over 500 million tweets are broadcast daily from over 500 million accounts; 15+ million of these emanate from the United Kingdom (Library of Congress 2013; Smith 2012).
The Challenges of Big Social Data for Criminology: The 6 Vs
Criminology faces the challenge of how increasingly ubiquitous digital devices and the data they produce are reassembling its research methods apparatus. The exponential growth of social media uptake and the availability of vast amounts of information from these networks have created fundamental methodological and technical challenges. However, aside from recent papers by Chan and Bennet-Moses (2015) and Williams and Burnap (2015), big ‘social’ data have received little attention amongst criminologists, leaving the question of how as a discipline we respond to it largely unexplored. The challenges (and affordances) can be summarized as the 6 Vs: volume, variety, velocity, veracity, virtue and value.
‘Volume’ refers to the vast amount of socially relevant information uploaded on computer networks globally every second. Ninety per cent of the world’s data were created in the two years prior to 2013 (BIS 2013). This is partly due to the global adoption of social media over the past half a decade. Of the online social interactions produced on these networks, a sizable portion is relevant to criminology. For example, Williams and Burnap (2015) have examined the spread of cyberhate on Twitter following the Woolwich terror attack. A comparison with curated and administrative sources on crime reveals the scale of these new data. The most recent Crime Survey for England and Wales (CSEW, 2012–13) data file measures 113.4 megabytes in size. Since its inception in 1982 all CSEW data would not amount to more than 2 gigabytes. In terms of administrative data, the Police National Computer contains circa 9.2 million nominal records (NPIA 2009). The whole UK Data Archive currently holds between 2.2 and 15 terabytes of data. These sizes are dwarfed by the volume of social media data being produced daily that are relevant to criminology.
‘Velocity’ refers to the speed at which these new forms of data are generated and propagated. Recent social unrest illustrates how social media information can spread over large distances in very short periods of time. For example, the HMIC (2011) report Policing Public Order highlighted how the disorder in 2011 had taken on a new dimension, which involved the use of social media. In particular, its use was implicated in the UK Uncut and university tuition fees protests in London in late 2011. At the extreme end of the spectrum, social media use was also associated with the Tunisian and Egyptian Revolutions (Lotan et al. 2011; Choudhary et al. 2012).
‘Variety’ relates to the heterogeneous nature of these data, with users able to upload text, images, audio and video. This multimodal mixed dataset can be harnessed by researchers. However, unlike qualitative and quantitative data that are often labelled, coded and structured within matrices and ordered transcripts, big ‘social’ data are messy, noisy and unstructured.
‘Veracity’ relates to the quality, authenticity and accuracy of these messy data. Triangulating social media communications with more conventional sources, such as curated data, can mitigate these problems. Instead of social media acting as a surrogate for established sources, it should instead augment them, adding a hitherto unrealized longitudinal extensive dimension to existing research strategies and designs. For the first time, this allows criminologists to study social processes as they unfold in real time at the level of populations while drawing upon gold-standard static qualitative and quantitative metrics to inform interpretations. Furthermore, Williams et al. (2013) show that the near ubiquitous adoption of smartphone technology and social media amongst groups that are underrepresented in official survey collection exercises means these new data sources may provide better coverage of such populations.
‘Virtue’ relates to the ethics of using this new form of data in social research. A recent survey found that 74 per cent of social media users knew that when accepting Terms of Service they were giving permission for their information to be accessed by third parties. Eighty-two per cent of respondents were ‘not at all concerned’ or only ‘slightly concerned’ about university researchers using their social media information (however, this dropped to 56 per cent for police access) (Williams 2015). We may argue therefore that researchers in this field must accept that consent has been provided, as long as researchers adhere to basic principles of social science ethics while ensuring results are presented at an aggregate level. Additional individual-level consent should be sought if researchers wish to directly quote online communications.
Finally, ‘value’ links the preceding five Vs—only when the volume, velocity and variety of these data can be computationally handled, and the veracity and virtue established, can criminologists begin to marshal them and extract meaningful information. However, to date, few academic criminological studies have collected and analysed social media data. In order to make sense of this rich material, Burnap et al. (2014a) advocate the establishment of interdisciplinary teams of computer and social scientists using parallel computing infrastructure. Dubbed ‘computational criminology’, this interdisciplinary methodology has its roots in computational social science (Lazer et al. 2009). In their pioneering article in Science, Lazer et al. argue that corporate giants such as Facebook, Google and Twitter have been using social data with advanced computing to mine and interpret it for half a decade. Until recently, academic social scientists have been left in an ‘empirical crisis’, lacking the access, infrastructure and skills to marshal these data (Savage and Burrows 2007). In this study, computer scientists and criminologists collaborated to address the 6 Vs for the purposes of offline crime estimation using Twitter data.
Big Data and Crime Estimation
Recent studies have attempted to integrate social media data into statistical models for crime estimation. Bendler et al. (2014) examined the relationship between mobile populations as recorded by Twitter’s geotagging functionality and the co-location of different crime types. They found the absence of tweets was predictive of assaults, theft, and disturbing the peace. Similarly, Malleson and Andresen (2015) used Twitter data to measure mobile populations at risk from violent crime in Leeds. They used a variety of geographic analysis methods to model crime risk using tweets as signatures for mobile populations, noting that conventional estimation methods rely on outdated static data on residential populations (such as the census). They found alternative violent crime hotspots outside of Leeds city centre, not identifiable with conventional crime data sources, concluding Twitter data represent mobile populations at higher spatial and temporal resolutions than sources used by police.
The key limitation to these studies is their dismissal of tweet text, instead focussing purely on geolocation data. The content of tweets may be relevant to the estimation of crime patterns, and simple geolocation data fail to relate to any possible theoretical explanation aside from routine activities. In order to address the utility of tweet text in estimating crime patterns, Gerber (2014) used latent Dirichlet allocation (LDA)3 on content. Tweet text was shown to improve upon models containing conventional non-social media crime predictors for stalking, criminal damage and gambling, but decrease performance for arson, kidnapping and intimidation. Although it is the first study to examine tweet content, Gerber’s use of LDA is problematic given that it is an unsupervised method, meaning correlations between word clusters and crimes are not driven by prior theoretical insight (Chan and Bennett-Moses 2015). This resulted in correlations that appear relatively meaningless, (e.g. prostitution was correlated with the words ‘studios’, ‘continental’, ‘village’ and ‘Ukrainian’). It is unclear how terms relate to crimes, and it is not easy to understand how such work can inform criminological theory or policing practice. An improved approach would involve the classification of tweet text based on a predetermined theoretical framework. We adopted such an approach in this study, using ideas from the ‘broken windows’ thesis to guide the classification of social media content that indicated forms of neighbourhood degeneration.
Broken Windows and Big Data
‘Broken windows’ is a well known theory in criminology. The most basic formulation of this theory is that visible signs of neighbourhood degeneration are causally linked to crime (Wilson and Kelling 1982). The broken windows thesis has received considerable attention over the past three and a half decades, resulting in empirical findings that largely support its core supposition (see Skogan 2015; Welsh et al. 2015).4 Most prominent in the thesis is the hypothesized relationship between visible forms of disorder, their deleterious impact upon residents and their additional effect of drawing offenders from outside of the neighbourhood. In particular, measures of physical disorder have included reports from residents of litter, graffiti and vandalism (Sampson and Raudenbush 2004) that are taken as signatures of the breakdown of the local social order (Skogan 2015). Such measures have conventionally been developed via community-based surveys, interviews and neighbourhood audits, but these instruments capture data in a cross-sectional fashion, often precluding longitudinal analysis at smaller temporal scales. Recently, big administrative data that exhibit longitudinal features have been mined to generate measures of broken windows. Building on the ecometrics approach developed by Raudenbush and Sampson (1999), O’Brien and Sampson (2015) and O’Brien et al. (2015) constructed and validated a measure of physical disorder using a large database from Boston’s constituent relationship management (CRM) system (311 hotline) used by local residents to request city services, many of which reference physical incivilities (e.g. graffiti removal). This approach generated a large (n = 200,000+) geospatially structured dataset that could be repurposed for the estimation of crime and disorder patterns using broken windows measures at very small temporal and spatial scales. Their findings revealed that (1) administrative records, collected for the purposes other than research, could be used to reliably construct measures of broken windows, and (ii) these measures were significantly associated with levels of crime and disorder. These represent the first studies of broken windows using administrative ‘big data’, and the authors conclude: ‘Going further, there are private databases, such as Twitter, cell phone records, and Flickr photo collections that are also geocoded and might be equally informative in building innovative measures of urban social processes. These various resources could be used to develop new versions of traditionally popular measures, like we have done here, or to explore new ones that have not been previously accessible’ (O’Brien et al. 2015: 35). This paper takes on this task by testing three hypotheses.
H1: Estimation models including social media variables will increase the amount of crime variance explained compared to models that include ‘offline’ variables alone.
Previous work on using social media and mobile phone data as predictors of offline phenomena, including crime, has shown that they increase the amount of variance explained in statistical models over models using conventional offline variables alone (Asur and Huberman 2010; Gerber 2014). This hypothesis tests whether this holds true for the estimation of crime patterns in the United Kingdom while accounting for temporal variation.
H2: Twitter mentions of ‘broken windows’ indicators will be positively associated with police-recorded crime rates in low-crime areas.
H3: Twitter mentions of ‘broken windows’ indicators will be negatively or not associated with crime rates in high-crime areas.
These hypotheses are based on previous research that finds offline discussions of neighbourhood degeneration and local crime issues in Partners and Communities Together meetings are not representative of local crime problems (e.g. Brunger 2011; Sagar and Jones 2013). This is in part due to patterns of low attendance in high-crime areas, and the non-representativeness of regular meeting attendees. This can result in (1) regular reporting of criminal and sub-criminal issues at such meetings in low-crimes areas, due to socially engaged attendees who are sensitive to degeneration, and (2) systematic under-reporting of criminal and sub-criminal issues at such meetings in high-crime areas, due to lack of attendance because of a reduced sensitivity in residents to degeneration—the idea that degeneration has gone too far resulting in ‘lost neighbourhoods’ occupied by residents that have naturalized to their surroundings (Sampson 2012). Therefore, these hypotheses explicitly test whether the bias found in offline reports of crime and disorder is also present in Twitter communications.
Data and Methods
Variables were derived from three sources and were combined at the borough level for modelling: (1) a database of police-recorded crime provided by the Metropolitan Police Service; (2) the UK Census 2011; and (3) the social media network, Twitter. The Metropolitan Police Service provided circa 600,000 police-recorded crime records covering all London boroughs over a 12-month period between August 2013 and August 2014. The UK Census 2011 was accessed via the nomis web portal.5 All UK tweets were collected via the Twitter streaming Application Programming Interface using the Cardiff Online Social Media Observatory (COSMOS) software platform6 (Burnap et al. 2014a), resulting in circa 200 million tweets with location information covering a 12-month period. Borough level was selected as the unit of spatial analysis to maximize the number of geolocated tweets in the dataset7 (see section Limitations for a discussion on spatial scale).
Nine crime categories were selected for modelling from the police recorded crime database and each were summed by 28 London boroughs and by month over the study window.8 Estimating crime at the borough level allows for an ecological analysis. Raudenbush and Sampson (1999) show how observations collected at the level of ecological units (in our case London boroughs) can yield relationships with perceptions of disorder and fear of crime and crime patterns.
Social media regressors
Two regressors were derived from Twitter communications. Frequency of Twitter Posts—the 200 million geocoded tweets collected in the United Kingdom over the 12-month period were reduced to those geolocated in the 28 London boroughs over the study window (n = 8,417,438) and were summed by borough and month. Twitter Mentions of ‘Broken Windows’—tweets were classified as containing ‘broken windows’ indicators (e.g. mentions of neighbourhood degeneration) and were summed by borough and month. Our approach recognized that Twitter users act as sensors of their environment, much like a large distributed ‘social sensor-net’. Some of these sensors may publish content about the changing condition of their neighbourhood, such as directly witnessing crime, disorder and decay. They may also sense degeneration as second-order witnesses (via news reports), as victims or as perpetrators of crime. Unlike O’Brien et al.’s (2015) ecometric measurement approach, ours was a task of ‘text classification’ (van Rijsbergen 1979). This was due to the unstructured nature of Twitter communications, in contrast to the structured administrative9 data used by O’Brien et al. The process followed established automatic text classification procedures adopted in our previous work with social media data (see Burnap and Williams 2015; Burnap et al. 2014b; Sloan et al. 2015; Williams and Burnap 2015). First, mentions pertaining to ‘broken windows’ were extracted by the authors from offline interviews with victims and non-victims in local neighbourhoods.10 A coding frame for text extraction was informed by Quinton and Tuffin’s (2007) evaluation of UK National Reassurance Policing Priorities that identified common concerns from local residents in six sites, which relate to ‘broken windows’ measures (alcohol and/or drug use; litter and dog fouling; criminal damage; speeding; parking and nuisance vehicles; anti-social behaviour and juvenile nuisance). O’Brien et al.’s (2015) recent work was also used to inform text extraction. They developed validated measures of ‘broken windows’ using large-scale administrative records and identified that reports of housing issues (e.g. poor maintenance), trash and graffiti held the strongest reliability. Second, to validate that the coding was related to ‘broken windows’ indicators, extracts were independently rated using a crowdsourcing approach involving 700 human annotators sampled from the CrowdFlower11 crowdsourcing service. We required at least four human annotations per interview extract and only retained annotated text for which at least three human annotators (75%) agreed that extracts related to signatures of neighbourhood degeneration. Finally, the key terms contained within the verified classified text extracts were used to mine the Twitter dataset, resulting in a social media measure of ‘broken windows’.12 Figurative examples of tweet content containing ‘broken windows’ indicators included: ‘New graffiti at the end of my street. How did they reach that high!?’; ‘Community allotment was vandalized today. Why would someone do this?; ‘More illegal dumping in Shoreditch. When will @hackneycouncil sort this out?!’; and ‘RT if you think we should use discarded card receipts to identify litterers!’ [Includes a photo of discarded McDonalds bag with card receipt].13 Both Twitter measures—frequency and measure of broken windows—were entered as time-variant regressors.
Measures were selected based on previous literature on crime correlates (e.g. Young 2002; Chainey 2008) and included proportions of the borough populations that were black, minority ethnic, unemployed, aged 15–21 and who had no qualifications. These were entered as time-invariant regressors.
Methods of estimation
Given the requirement to incorporate the temporal variability of police-recorded crime and Twitter data with the static regressors from the census, we used linear14 random- and fixed-effects regression.15 This meant that we could explore correlations between independent regressors including tweets that have high temporal granularity and variability and census regressors that have very low temporal granularity with the dependent measures of police-recorded crime. We took measurements at each consecutive month (variable for Twitter regressors and static for census regressors) within each borough (variable for both Twitter and census regressors).16 We were therefore able to conduct an ecological analysis of London police-recorded crime using Twitter data as a predictor. Random-effects (RE) assume that the boroughs error term is not correlated with the regressors, which allows for time-invariant variables to play a role as explanatory regressors (census measures). However, violation of this assumption renders RE inconsistent because of selection bias resulting from time-invariant unobservables. Fixed-effects (FE) models are based solely on within-borough variation, allowing for the elimination of potential sources of bias by controlling for stable (observed and unobserved) ecological characteristics. However, one side effect of FE models is that they cannot be used to investigate time-invariant causes of the dependent variables. We determined whether RE or FE was more appropriate using the Hausman test. Robust standard errors were used to account for heteroskedasticity.
Table 1 reports on the results of the RE and FE models (coefficients in bold indicate those that are favoured (FE or RE) based on the Hausman tests). Model A includes only conventional census predictive regressors that have been established as correlates of certain types criminal activity in previous research (Young 2002; Chainey 2008). Model B introduces the Twitter regressors and differences in the adjusted R2 statistics17 illustrate the change in variance explained by their inclusion.18 Some of the conventional census regressors emerged as predictive in the RE models, and associations are in the direction expected based on previous research. Twitter regressors emerged as significantly associated with prevalence of crime in seven of the nine crime types. The addition of Twitter data increases the amount of variance explained in all models, corroborating hypothesis H1 and adding further evidence in support of the argument that social media communications can add explanatory value in estimating offline phenomena (Asur and Huberman 2010; Gerber 2014). Tweet frequency was positively associated with burglary in a dwelling, criminal damage, violence against the person and theft from shops, corroborating previous work that argues geolocation markers in Twitter data are useful in estimating crime patterns (Bendler et al. 2014; Malleson and Andresen 2015). Like Malleson and Andresen, this study finds the positive relationship between frequency of Twitter posts and violence against the person holds when eliminating potential sources of bias by controlling for stable (observed and unobserved) ecological characteristics. These results contradict the work of Bendler et al. (2014) who found a negative relationship existed between frequency of tweets and violence. However, in our models that take month and not hour as the temporal scale, it is likely that tweet frequency is acting as an indicator of population density, and not mobile population. This would account for the positive relationship with crimes that tend to occur in the absence of bystanders (burglary and criminal damage). As tweet frequency is not a key variable of interest in this paper, this is not a fundamental shortcoming of this exploratory study.19
Random and fixed effects models for all boroughs
|Burglary in a dwelling||Burglary in a business property||Criminal damage|
|Model A||Model B||Model A||Model B||Model A||Model B|
|Prop. no qual||0.24||1.11||0.64||1.11||−0.42***||0.09||−0.31***||0.08||−1.10||1.29||−2.14||1.23|
|Theft of personal property||Theft from a motor vehicle||Theft of a motor vehicle|
|Model A||Model B|
In addition to books you will be expected to access and read research that has been published in academic journals. Academic journals are similar to popular magazines, in that they bring out new issues at regular intervals, and each issue will have a number of articles written by researchers about their work. There are thousands of academic journals and most specialize in a particular academic subject.
The articles in academic journal can be referred to as 'academic journal articles', 'peer-reviewed articles', or just 'journal articles'. Such an article will normally include the following information:
- author and title of the article
- an abstract or brief summary of the main points and any conclusions
- A list of references at the end. This is where the author will list all the different information sources they have used (in much the same way as you will include a list of references you have used in your assignments).
What is Peer-Reviewed?
Academic journals are important when doing any academic research because they are 'Peer-Reviewed'. This is a system of quality control where articles are assessed and approved by independent experts in the same academic field before they are published. The benefits of such a process are:
- only the highest quality research is published
- you can be assured of the credibility of the research - you know that the article has been written by an expert in the field, and it has been reviewed by other respected experts
In any particular academic subject area there may be many different academic journals published worldwide. In order for you to be able to discover exactly what has been written in your topic or area different subject databases have been commercially produced, and Sheffield Hallam University subscribes to some of the top databases for your area. A database provides a searchable index to the research articles that have been published. In many cases you can also find links to the full text of the articles. A list of the available databases - with direct links to them, are given on this page.
(Text courtesy of Leeds Beckett Law Subject Guide at http://libguides.leedsbeckett.ac.uk/subject_support/law)