Data Exploration in R

Introduction

This note discusses how I approach data exploration using tools in R. In my professional experience, I frequently was handed a new data set or database from a client and had to quickly become an expert on the contents, structure, problems, and nuances of the data before my team and I started considering how we might use the information in the data to address client concerns. Many of the data sets I worked with were flat or relational structured data sets. These will be the focus here.

I hope to provide a demonstration of the type of approach I often took at the outset. This typically included reading in data, looking at high level characteristics, determining how various data sets were generated and how they fit together, exploring distributions and tabulations of variables, investigating missingness, comparing actual contents of the data to what was expected, and looking into simple relationships between variables, among many other things more specific to the data and goals.

We’ll be working with replication data for Knox, Lowe, and Mummolo (2020) (also see Knox (2019)), a paper that explores problems that arise from non-randomly selected data on police interactions.

Reading in the Data

First, we read in the actual csv data files. We will explore these in detail below. Note that for simplicity, this note focuses on the 2003 and 2004 data. Data is available for more recent years.

data2003 = read.csv(file = "./data/raw/2003.csv", header = T)
data2004 = read.csv(file = "./data/raw/2004.csv", header = T)

Next, we read in the documentation files.

variables2003 = read_excel(path = "./data/SQF-File-Documentation/2003 SQF File Spec.xlsx",sheet = "2003 SQF File Spec",range = "A4:H115")
variables2004 = read_excel(path = "./data/SQF-File-Documentation/2004 SQF File Spec.xlsx",sheet = "2004 SQF File Spec",range = "A4:H115")

values2003 = read_excel(path = "./data/SQF-File-Documentation/2003 SQF File Spec.xlsx",sheet = "Value Labels",range = "A5:C381")
values2004 = read_excel(path = "./data/SQF-File-Documentation/2004 SQF File Spec.xlsx",sheet = "Value Labels",range = "A5:C381")

#fill down variable names in values tables 
#and create values column without NAs so we can compare 2003 and 2004
values2003 = values2003 %>% fill("Field Name") %>% mutate(Values_noNA = ifelse(is.na(Value),"",Value))
values2004 = values2004 %>% fill("Field Name") %>% mutate(Values_noNA = ifelse(is.na(Value),"",Value))

Let check that the doc files indicate that the 2003 and 2004 data should be the same. We see that the key fields in the variable lists are the same: the list of variables, the position in the file, and the description or label.

variables_compare = rbind(
  sum(variables2003$Variable != variables2004$Variable),
  sum(variables2003$Position != variables2004$Position),
  sum(variables2003$Label != variables2004$Label))
colnames(variables_compare) =  c("Number of Differences")
row.names(variables_compare) =  c("Variable","Position","Label")
knitr::kable(variables_compare)

	Number of Differences
Variable	0
Position	0
Label	0

We also see that all three columns of the values documentation are the same for 2003 and 2004.

values_compare = rbind(
  sum(values2003$`Field Name` != values2004$`Field Name`),
  sum(values2003$Values_noNA != values2004$Values_noNA),
  sum(values2003$Label != values2004$Label))
colnames(values_compare) =  c("Number of Differences")
row.names(values_compare) =  c("Field Name","Values_noNA","Label")
knitr::kable(values_compare)

	Number of Differences
Field Name	0
Values_noNA	0
Label	0

Also we note that not all variables appear in the values documentation. This makes sense but is worth noting.

length(unique(values2003$`Field Name`))

## [1] 72

Now let’s take a peak at the documentation. We’ll just look at the first 10 variables in the variables documentation for now.

knitr::kable(variables2003[1:3] %>% head(10))

Variable	Position	Label
year	1	YEAR OF STOP (CCYY)
pct	2	PRECINCT OF STOP (FROM 1 TO 123)
ser_num	3	UF250 SERIAL NUMBER
datestop	4	DATE OF STOP (MM-DD-YYYY)
timestop	5	TIME OF STOP (HH:MM)
recstat	6	RECORD STATUS
inout	7	WAS STOP INSIDE OR OUTSIDE ?
trhsloc	8	WAS LOCATION HOUSING OR TRANSIT AUTHORITY ?
perobs	9	PERIOD OF OBSERVATION (MMM)
crimsusp	10	CRIME SUSPECTED

Let’s take a look at the value documentation for the key variable race.

knitr::kable(values2003[1:3] %>% filter(`Field Name` == "race"))

Field Name	Value	Label
race	NA	NOT LISTED
race	A	ASIAN/PACIFIC ISLANDER
race	B	BLACK
race	I	AMERICAN INDIAN/ALASKAN NATIVE
race	P	BLACK-HISPANIC
race	Q	WHITE-HISPANIC
race	W	WHITE
race	X	UNKNOWN
race	Z	OTHER

High-Level Data Set Information

Let’s take a look at the data set dimensions.

dims = cbind(dim(data2003),dim(data2004))
colnames(dims) =  c("2003","2004")
row.names(dims) =  c("observations","variables")
knitr::kable(dims)

	2003	2004
observations	160851	313523
variables	111	111

We see that both data sets have 111 variables / columns. Let’s check if they are the same set.

names(data2003)

##   [1] "year"     "pct"      "ser_num"  "datestop" "timestop" "recstat" 
##   [7] "inout"    "trhsloc"  "perobs"   "crimsusp" "perstop"  "typeofid"
##  [13] "explnstp" "othpers"  "arstmade" "arstoffn" "sumissue" "sumoffen"
##  [19] "compyear" "comppct"  "offunif"  "officrid" "frisked"  "searched"
##  [25] "contrabn" "adtlrept" "pistol"   "riflshot" "asltweap" "knifcuti"
##  [31] "machgun"  "othrweap" "pf_hands" "pf_wall"  "pf_grnd"  "pf_drwep"
##  [37] "pf_ptwep" "pf_baton" "pf_hcuff" "pf_pepsp" "pf_other" "radio"   
##  [43] "ac_rept"  "ac_inves" "rf_vcrim" "rf_othsw" "ac_proxm" "rf_attir"
##  [49] "cs_objcs" "cs_descr" "cs_casng" "cs_lkout" "rf_vcact" "cs_cloth"
##  [55] "cs_drgtr" "ac_evasv" "ac_assoc" "cs_furtv" "rf_rfcmp" "ac_cgdir"
##  [61] "rf_verbl" "cs_vcrim" "cs_bulge" "cs_other" "ac_incid" "ac_time" 
##  [67] "rf_knowl" "ac_stsnd" "ac_other" "sb_hdobj" "sb_outln" "sb_admis"
##  [73] "sb_other" "repcmd"   "revcmd"   "rf_furt"  "rf_bulg"  "offverb" 
##  [79] "offshld"  "sex"      "race"     "dob"      "age"      "ht_feet" 
##  [85] "ht_inch"  "weight"   "haircolr" "eyecolor" "build"    "othfeatr"
##  [91] "addrtyp"  "rescode"  "premtype" "premname" "addrnum"  "stname"  
##  [97] "stinter"  "crossst"  "aptnum"   "city"     "state"    "zip"     
## [103] "addrpct"  "sector"   "beat"     "post"     "xcoord"   "ycoord"  
## [109] "dettypcm" "linecm"   "detailcm"

table(names(data2003)%in%names(data2004))

## 
## TRUE 
##  111

So the two data sets have the same 111 variables. Do these variables all have the same data type?

types2003 = as.matrix(sapply(colnames(data2003), function(x) class(data2003[[x]])))
types2004 = as.matrix(sapply(colnames(data2004), function(x) class(data2004[[x]])))
types = cbind(names(data2003)[types2003 != types2004],types2003[types2003 != types2004],types2004[types2003 != types2004])
colnames(types) =  c("variable","2003","2004")
knitr::kable(types)

variable	2003	2004
perobs	integer	numeric
rescode	logical	integer
city	logical	character
addrpct	logical	integer
sector	logical	character
beat	logical	integer
post	logical	character
xcoord	logical	integer
ycoord	logical	integer
dettypcm	logical	character
linecm	logical	integer
detailcm	logical	integer

Most variables have the same type but not all. However, we see that in the 2003 data, all of the differing data type variables except “perobs” (which is defined in the documentation as “PERIOD OF OBSERVATION (MMM)”) consist only of NA values. This is not the case for 2004, though I do not show this here. “perobs” is a numeric in the 2004 data but a integer in the 2003 data, so we can just defer to the 2004 type of numeric.

names_dif_types = names(data2003)[types2003 != types2004]
for(i in 1:length(names_dif_types)){
  name = names_dif_types[i]
  print(name)
  if(length(table(data2003[name]))==0){
    print(table(data2003[name]))
  } else {
    print("top 10 values")
    print(sort(table(data2003[name]),decreasing = T)[1:10])
  }
  if(types2004[types2003 != types2004][i] == "numeric"){
    data2003[name] = as.numeric(as.character(data2003[[name]]))
  } else if(types2004[types2003 != types2004][i] == "integer"){
    data2003[name] = as.integer(as.character(data2003[[name]]))
  } else if(types2004[types2003 != types2004][i] == "character"){
    data2003[name] = as.character(data2003[[name]])
  }
}

## [1] "perobs"
## [1] "top 10 values"
## 
##     1     2     5     3    10    30     0    15     4    20 
## 86961 28603 21188  8559  5869  2294  2017  1443  1106  1045 
## [1] "rescode"
## < table of extent 0 >
## [1] "city"
## < table of extent 0 >
## [1] "addrpct"
## < table of extent 0 >
## [1] "sector"
## < table of extent 0 >
## [1] "beat"
## < table of extent 0 >
## [1] "post"
## < table of extent 0 >
## [1] "xcoord"
## < table of extent 0 >
## [1] "ycoord"
## < table of extent 0 >
## [1] "dettypcm"
## < table of extent 0 >
## [1] "linecm"
## < table of extent 0 >
## [1] "detailcm"
## < table of extent 0 >

print(sort(table(data2004["perobs"]),decreasing = T)[1:10])

## 
##      1      2      5      3     10      4     15     20     30      7 
## 168525  62803  42736  19201   9390   2667   2239   1433   1332    620

Now that the variable types have been standardized, let’s stack the data sets.

data = bind_rows(data2003, data2004)
dim(data)

## [1] 474374    111

Variable Contents and Distributions

Now let’s look at some values we see in these columns. We’ll look at the distributions of variables but be careful to only look at the top values for non-numeric variables with many unique values. Note that for such variables, we will need to explore further the types of values that they take. While their top values might look as expected, there is no guarantee that this will be the case for all values.

We create a function that will do all of this for us. We could run it on the entire data set. But that will create a very long output. So we’ll just run it on a subset of selected variables for purposes of demonstration.

Note that some of the variables like “age” have values like 9999 that must be errors or another way to code a missing value. We should be careful of such issues, as well as the missing values we find below. These should be investigated and their causes understood, not simply excluded. In a best case scenario we would be able to discuss these with data staff at the institution that generated and stores the data.

Distributions = function(dataset){
  variables = names(dataset)
  types = as.matrix(sapply(colnames(dataset), function(x) class(dataset[[x]])))
  for(i in 1:length(variables)){
    print(variables[i])
    if(types[i] == "character"){
        vals = values2003 %>% 
          filter(`Field Name` == variables[i]) %>% 
          select(Value,Label) %>%
          rename(Encoded_Value = Value)
        tab = as.data.frame(table(data[variables[i]])) %>%
          rename(Encoded_Value = Var1) %>%
          left_join(vals,by="Encoded_Value") %>%
          arrange(desc(Freq))
      if(dim(unique(dataset[variables[i]]))[1]<=50){print(tab)
      } else {
        print("More than 50 unique values; Top 10:")
        print(tab %>% head(10))
      }
    }
    else if(types[i] == "integer"){
      if(dim(unique(dataset[variables[i]]))[1]<=50){
        print(t(table(dataset[variables[i]])))
      } else {
        print(summary(dataset[variables[i]]))
      }
    }
    else if(types[i] == "logical"){
        print(table(dataset[variables[i]]))
    }
    else if(types[i] == "numeric"){
        print(summary(dataset[variables[i]]))
    }
  }
}

subset = data %>% select(c("race","age","datestop","build","pistol","frisked","searched","arstmade","crimsusp","inout"))
Distributions(subset)

## [1] "race"
##    Encoded_Value   Freq                          Label
## 1              B 232737                          BLACK
## 2              Q 109881                 WHITE-HISPANIC
## 3              W  46536                          WHITE
## 4              Z  44032                          OTHER
## 5              P  24637                 BLACK-HISPANIC
## 6              A  12664         ASIAN/PACIFIC ISLANDER
## 7              X   2136                        UNKNOWN
## 8              I   1577 AMERICAN INDIAN/ALASKAN NATIVE
## 9                   162                           <NA>
## 10             U     12                           <NA>
## [1] "age"
##       age         
##  Min.   :   0.00  
##  1st Qu.:  18.00  
##  Median :  23.00  
##  Mean   :  28.14  
##  3rd Qu.:  33.00  
##  Max.   :9999.00  
##  NA's   :162      
## [1] "datestop"
##     datestop       
##  Min.   : 1012003  
##  1st Qu.: 3132003  
##  Median : 6122004  
##  Mean   : 6267057  
##  3rd Qu.: 9222004  
##  Max.   :12312004  
## [1] "build"
##   Encoded_Value   Freq    Label
## 1             M 256774   MEDIUM
## 2             T 156030     THIN
## 3             H  41488    HEAVY
## 4             Z  18817  UNKNOWN
## 5             U   1103 MUSCULAR
## 6                  162     <NA>
## [1] "pistol"
##   Encoded_Value   Freq Label
## 1             N 473161    NO
## 2             Y   1213   YES
## [1] "frisked"
##   Encoded_Value   Freq Label
## 1             N 270290    NO
## 2             Y 204084   YES
## [1] "searched"
##   Encoded_Value   Freq Label
## 1             N 437521    NO
## 2             Y  36853   YES
## [1] "arstmade"
##   Encoded_Value   Freq Label
## 1             N 445534    NO
## 2             Y  28839   YES
## 3                    1  <NA>
## [1] "crimsusp"
## [1] "More than 50 unique values; Top 10:"
##                     Encoded_Value  Freq Label
## 1                             CPW 90432  <NA>
## 2                        BURGLARY 41752  <NA>
## 3                         ROBBERY 40863  <NA>
## 4               CRIMINAL TRESPASS 38645  <NA>
## 5              GRAND LARCENY AUTO 25744  <NA>
## 6                             GLA 20629  <NA>
## 7                         ASSAULT  9119  <NA>
## 8                      DRUG SALES  7997  <NA>
## 9  CRIMINAL POSSESSION OF CONTROL  7443  <NA>
## 10 CRIMINAL SALE OF CONTROLLED SU  6831  <NA>
## [1] "inout"
##   Encoded_Value   Freq   Label
## 1             O 361124 OUTSIDE
## 2             I 100072  INSIDE
## 3                13178    <NA>

Missing Data

Now, let’s also review the number of missing values in each column. We only display the columns with missing values. Recall that the total number of observations is 474374 and the total number of observations in the 2003 data is 160851. So we can see that some of these variables are missing for all of the observations in the data or all from the 2003 data. This is consistent with what we found for some of the variables in the 2003 data already.

missings = data %>%
  summarise_all(funs(sum(is.na(.))))
missings = data.frame(t(missings)) 
missings %>% rownames_to_column(var = "variable") %>% 
  filter(t.missings.>0) %>% 
  mutate(pct_all = round(t.missings. / dim(data)[1],4),
         pct_2003 = round(t.missings. / dim(data2003)[1],4),
         pct_2004 = round(t.missings. / dim(data2004)[1],4))

##    variable t.missings. pct_all pct_2003 pct_2004
## 1  crimsusp         161  0.0003   0.0010   0.0005
## 2  arstoffn          12  0.0000   0.0001   0.0000
## 3  sumoffen          10  0.0000   0.0001   0.0000
## 4    repcmd         162  0.0003   0.0010   0.0005
## 5    revcmd         347  0.0007   0.0022   0.0011
## 6       dob         162  0.0003   0.0010   0.0005
## 7       age         162  0.0003   0.0010   0.0005
## 8   ht_feet         162  0.0003   0.0010   0.0005
## 9   ht_inch         162  0.0003   0.0010   0.0005
## 10   weight         162  0.0003   0.0010   0.0005
## 11 othfeatr      474374  1.0000   2.9492   1.5130
## 12  rescode      474165  0.9996   2.9479   1.5124
## 13 premtype      474374  1.0000   2.9492   1.5130
## 14 premname          50  0.0001   0.0003   0.0002
## 15  addrnum          21  0.0000   0.0001   0.0001
## 16   stname           6  0.0000   0.0000   0.0000
## 17  stinter          14  0.0000   0.0001   0.0000
## 18   aptnum      474374  1.0000   2.9492   1.5130
## 19     city      160851  0.3391   1.0000   0.5130
## 20    state      474374  1.0000   2.9492   1.5130
## 21      zip      474374  1.0000   2.9492   1.5130
## 22  addrpct      445465  0.9391   2.7694   1.4208
## 23   sector      160851  0.3391   1.0000   0.5130
## 24     beat      460573  0.9709   2.8634   1.4690
## 25     post      160851  0.3391   1.0000   0.5130
## 26   xcoord      446785  0.9418   2.7776   1.4250
## 27   ycoord      446785  0.9418   2.7776   1.4250
## 28 dettypcm      160851  0.3391   1.0000   0.5130
## 29   linecm      445475  0.9391   2.7695   1.4209
## 30 detailcm      445475  0.9391   2.7695   1.4209

Next Steps

Next steps could include looking at relationships between specific variables, looking more closely at important variables, investigating missingness patterns, generating visualizations, creating a more focused summary of the data set contents, looking at interesting slices / aggregations of the data, generating new variables that can be used in analysis and more.

A complete data exploration usually tends into data cleaning and manipulation with the goal of creating a clean data set for analysis in which all relevant data quality issues have been addressed and manipulations have been carried out. One key item to note is that missingness, default values, and errors might not appear randomly and we need to carefully consider why such issues might have arisen. Depending on how these relate to the causal model, they could introduce bias into estimates.

We will stop here for now, but I hope to add to this and explore this data further.

References

Knox, Dean. 2019. “Replication Data for: Administrative Records Mask Racially Biased Policing.” Harvard Dataverse. https://doi.org/10.7910/DVN/KFQOCV.

Knox, Dean, Will Lowe, and Jonathan Mummolo. 2020. “Administrative Records Mask Racially Biased Policing.” American Political Science Review 114 (3): 619–37. https://doi.org/10.1017/S0003055420000039.