Introduction
This note discusses how I approach data exploration using tools in R. In my professional experience, I frequently was handed a new data set or database from a client and had to quickly become an expert on the contents, structure, problems, and nuances of the data before my team and I started considering how we might use the information in the data to address client concerns. Many of the data sets I worked with were flat or relational structured data sets. These will be the focus here.
I hope to provide a demonstration of the type of approach I often took at the outset. This typically included reading in data, looking at high level characteristics, determining how various data sets were generated and how they fit together, exploring distributions and tabulations of variables, investigating missingness, comparing actual contents of the data to what was expected, and looking into simple relationships between variables, among many other things more specific to the data and goals.
We’ll be working with replication data for Knox, Lowe, and Mummolo (2020) (also see Knox (2019)), a paper that explores problems that arise from non-randomly selected data on police interactions.
Reading in the Data
First, we read in the actual csv data files. We will explore these in detail below. Note that for simplicity, this note focuses on the 2003 and 2004 data. Data is available for more recent years.
= read.csv(file = "./data/raw/2003.csv", header = T)
data2003 = read.csv(file = "./data/raw/2004.csv", header = T) data2004
Next, we read in the documentation files.
= read_excel(path = "./data/SQF-File-Documentation/2003 SQF File Spec.xlsx",sheet = "2003 SQF File Spec",range = "A4:H115")
variables2003 = read_excel(path = "./data/SQF-File-Documentation/2004 SQF File Spec.xlsx",sheet = "2004 SQF File Spec",range = "A4:H115")
variables2004
= read_excel(path = "./data/SQF-File-Documentation/2003 SQF File Spec.xlsx",sheet = "Value Labels",range = "A5:C381")
values2003 = read_excel(path = "./data/SQF-File-Documentation/2004 SQF File Spec.xlsx",sheet = "Value Labels",range = "A5:C381")
values2004
#fill down variable names in values tables
#and create values column without NAs so we can compare 2003 and 2004
= values2003 %>% fill("Field Name") %>% mutate(Values_noNA = ifelse(is.na(Value),"",Value))
values2003 = values2004 %>% fill("Field Name") %>% mutate(Values_noNA = ifelse(is.na(Value),"",Value)) values2004
Let check that the doc files indicate that the 2003 and 2004 data should be the same. We see that the key fields in the variable lists are the same: the list of variables, the position in the file, and the description or label.
= rbind(
variables_compare sum(variables2003$Variable != variables2004$Variable),
sum(variables2003$Position != variables2004$Position),
sum(variables2003$Label != variables2004$Label))
colnames(variables_compare) = c("Number of Differences")
row.names(variables_compare) = c("Variable","Position","Label")
::kable(variables_compare) knitr
Number of Differences | |
---|---|
Variable | 0 |
Position | 0 |
Label | 0 |
We also see that all three columns of the values documentation are the same for 2003 and 2004.
= rbind(
values_compare sum(values2003$`Field Name` != values2004$`Field Name`),
sum(values2003$Values_noNA != values2004$Values_noNA),
sum(values2003$Label != values2004$Label))
colnames(values_compare) = c("Number of Differences")
row.names(values_compare) = c("Field Name","Values_noNA","Label")
::kable(values_compare) knitr
Number of Differences | |
---|---|
Field Name | 0 |
Values_noNA | 0 |
Label | 0 |
Also we note that not all variables appear in the values documentation. This makes sense but is worth noting.
length(unique(values2003$`Field Name`))
## [1] 72
Now let’s take a peak at the documentation. We’ll just look at the first 10 variables in the variables documentation for now.
::kable(variables2003[1:3] %>% head(10)) knitr
Variable | Position | Label |
---|---|---|
year | 1 | YEAR OF STOP (CCYY) |
pct | 2 | PRECINCT OF STOP (FROM 1 TO 123) |
ser_num | 3 | UF250 SERIAL NUMBER |
datestop | 4 | DATE OF STOP (MM-DD-YYYY) |
timestop | 5 | TIME OF STOP (HH:MM) |
recstat | 6 | RECORD STATUS |
inout | 7 | WAS STOP INSIDE OR OUTSIDE ? |
trhsloc | 8 | WAS LOCATION HOUSING OR TRANSIT AUTHORITY ? |
perobs | 9 | PERIOD OF OBSERVATION (MMM) |
crimsusp | 10 | CRIME SUSPECTED |
Let’s take a look at the value documentation for the key variable race.
::kable(values2003[1:3] %>% filter(`Field Name` == "race")) knitr
Field Name | Value | Label |
---|---|---|
race | NA | NOT LISTED |
race | A | ASIAN/PACIFIC ISLANDER |
race | B | BLACK |
race | I | AMERICAN INDIAN/ALASKAN NATIVE |
race | P | BLACK-HISPANIC |
race | Q | WHITE-HISPANIC |
race | W | WHITE |
race | X | UNKNOWN |
race | Z | OTHER |
High-Level Data Set Information
Let’s take a look at the data set dimensions.
= cbind(dim(data2003),dim(data2004))
dims colnames(dims) = c("2003","2004")
row.names(dims) = c("observations","variables")
::kable(dims) knitr
2003 | 2004 | |
---|---|---|
observations | 160851 | 313523 |
variables | 111 | 111 |
We see that both data sets have 111 variables / columns. Let’s check if they are the same set.
names(data2003)
## [1] "year" "pct" "ser_num" "datestop" "timestop" "recstat"
## [7] "inout" "trhsloc" "perobs" "crimsusp" "perstop" "typeofid"
## [13] "explnstp" "othpers" "arstmade" "arstoffn" "sumissue" "sumoffen"
## [19] "compyear" "comppct" "offunif" "officrid" "frisked" "searched"
## [25] "contrabn" "adtlrept" "pistol" "riflshot" "asltweap" "knifcuti"
## [31] "machgun" "othrweap" "pf_hands" "pf_wall" "pf_grnd" "pf_drwep"
## [37] "pf_ptwep" "pf_baton" "pf_hcuff" "pf_pepsp" "pf_other" "radio"
## [43] "ac_rept" "ac_inves" "rf_vcrim" "rf_othsw" "ac_proxm" "rf_attir"
## [49] "cs_objcs" "cs_descr" "cs_casng" "cs_lkout" "rf_vcact" "cs_cloth"
## [55] "cs_drgtr" "ac_evasv" "ac_assoc" "cs_furtv" "rf_rfcmp" "ac_cgdir"
## [61] "rf_verbl" "cs_vcrim" "cs_bulge" "cs_other" "ac_incid" "ac_time"
## [67] "rf_knowl" "ac_stsnd" "ac_other" "sb_hdobj" "sb_outln" "sb_admis"
## [73] "sb_other" "repcmd" "revcmd" "rf_furt" "rf_bulg" "offverb"
## [79] "offshld" "sex" "race" "dob" "age" "ht_feet"
## [85] "ht_inch" "weight" "haircolr" "eyecolor" "build" "othfeatr"
## [91] "addrtyp" "rescode" "premtype" "premname" "addrnum" "stname"
## [97] "stinter" "crossst" "aptnum" "city" "state" "zip"
## [103] "addrpct" "sector" "beat" "post" "xcoord" "ycoord"
## [109] "dettypcm" "linecm" "detailcm"
table(names(data2003)%in%names(data2004))
##
## TRUE
## 111
So the two data sets have the same 111 variables. Do these variables all have the same data type?
= as.matrix(sapply(colnames(data2003), function(x) class(data2003[[x]])))
types2003 = as.matrix(sapply(colnames(data2004), function(x) class(data2004[[x]])))
types2004 = cbind(names(data2003)[types2003 != types2004],types2003[types2003 != types2004],types2004[types2003 != types2004])
types colnames(types) = c("variable","2003","2004")
::kable(types) knitr
variable | 2003 | 2004 |
---|---|---|
perobs | integer | numeric |
rescode | logical | integer |
city | logical | character |
addrpct | logical | integer |
sector | logical | character |
beat | logical | integer |
post | logical | character |
xcoord | logical | integer |
ycoord | logical | integer |
dettypcm | logical | character |
linecm | logical | integer |
detailcm | logical | integer |
Most variables have the same type but not all. However, we see that in the 2003 data, all of the differing data type variables except “perobs” (which is defined in the documentation as “PERIOD OF OBSERVATION (MMM)”) consist only of NA values. This is not the case for 2004, though I do not show this here. “perobs” is a numeric in the 2004 data but a integer in the 2003 data, so we can just defer to the 2004 type of numeric.
= names(data2003)[types2003 != types2004]
names_dif_types for(i in 1:length(names_dif_types)){
= names_dif_types[i]
name print(name)
if(length(table(data2003[name]))==0){
print(table(data2003[name]))
else {
} print("top 10 values")
print(sort(table(data2003[name]),decreasing = T)[1:10])
}if(types2004[types2003 != types2004][i] == "numeric"){
= as.numeric(as.character(data2003[[name]]))
data2003[name] else if(types2004[types2003 != types2004][i] == "integer"){
} = as.integer(as.character(data2003[[name]]))
data2003[name] else if(types2004[types2003 != types2004][i] == "character"){
} = as.character(data2003[[name]])
data2003[name]
} }
## [1] "perobs"
## [1] "top 10 values"
##
## 1 2 5 3 10 30 0 15 4 20
## 86961 28603 21188 8559 5869 2294 2017 1443 1106 1045
## [1] "rescode"
## < table of extent 0 >
## [1] "city"
## < table of extent 0 >
## [1] "addrpct"
## < table of extent 0 >
## [1] "sector"
## < table of extent 0 >
## [1] "beat"
## < table of extent 0 >
## [1] "post"
## < table of extent 0 >
## [1] "xcoord"
## < table of extent 0 >
## [1] "ycoord"
## < table of extent 0 >
## [1] "dettypcm"
## < table of extent 0 >
## [1] "linecm"
## < table of extent 0 >
## [1] "detailcm"
## < table of extent 0 >
print(sort(table(data2004["perobs"]),decreasing = T)[1:10])
##
## 1 2 5 3 10 4 15 20 30 7
## 168525 62803 42736 19201 9390 2667 2239 1433 1332 620
Now that the variable types have been standardized, let’s stack the data sets.
= bind_rows(data2003, data2004)
data dim(data)
## [1] 474374 111
Variable Contents and Distributions
Now let’s look at some values we see in these columns. We’ll look at the distributions of variables but be careful to only look at the top values for non-numeric variables with many unique values. Note that for such variables, we will need to explore further the types of values that they take. While their top values might look as expected, there is no guarantee that this will be the case for all values.
We create a function that will do all of this for us. We could run it on the entire data set. But that will create a very long output. So we’ll just run it on a subset of selected variables for purposes of demonstration.
Note that some of the variables like “age” have values like 9999 that must be errors or another way to code a missing value. We should be careful of such issues, as well as the missing values we find below. These should be investigated and their causes understood, not simply excluded. In a best case scenario we would be able to discuss these with data staff at the institution that generated and stores the data.
= function(dataset){
Distributions = names(dataset)
variables = as.matrix(sapply(colnames(dataset), function(x) class(dataset[[x]])))
types for(i in 1:length(variables)){
print(variables[i])
if(types[i] == "character"){
= values2003 %>%
vals filter(`Field Name` == variables[i]) %>%
select(Value,Label) %>%
rename(Encoded_Value = Value)
= as.data.frame(table(data[variables[i]])) %>%
tab rename(Encoded_Value = Var1) %>%
left_join(vals,by="Encoded_Value") %>%
arrange(desc(Freq))
if(dim(unique(dataset[variables[i]]))[1]<=50){print(tab)
else {
} print("More than 50 unique values; Top 10:")
print(tab %>% head(10))
}
}else if(types[i] == "integer"){
if(dim(unique(dataset[variables[i]]))[1]<=50){
print(t(table(dataset[variables[i]])))
else {
} print(summary(dataset[variables[i]]))
}
}else if(types[i] == "logical"){
print(table(dataset[variables[i]]))
}else if(types[i] == "numeric"){
print(summary(dataset[variables[i]]))
}
} }
= data %>% select(c("race","age","datestop","build","pistol","frisked","searched","arstmade","crimsusp","inout"))
subset Distributions(subset)
## [1] "race"
## Encoded_Value Freq Label
## 1 B 232737 BLACK
## 2 Q 109881 WHITE-HISPANIC
## 3 W 46536 WHITE
## 4 Z 44032 OTHER
## 5 P 24637 BLACK-HISPANIC
## 6 A 12664 ASIAN/PACIFIC ISLANDER
## 7 X 2136 UNKNOWN
## 8 I 1577 AMERICAN INDIAN/ALASKAN NATIVE
## 9 162 <NA>
## 10 U 12 <NA>
## [1] "age"
## age
## Min. : 0.00
## 1st Qu.: 18.00
## Median : 23.00
## Mean : 28.14
## 3rd Qu.: 33.00
## Max. :9999.00
## NA's :162
## [1] "datestop"
## datestop
## Min. : 1012003
## 1st Qu.: 3132003
## Median : 6122004
## Mean : 6267057
## 3rd Qu.: 9222004
## Max. :12312004
## [1] "build"
## Encoded_Value Freq Label
## 1 M 256774 MEDIUM
## 2 T 156030 THIN
## 3 H 41488 HEAVY
## 4 Z 18817 UNKNOWN
## 5 U 1103 MUSCULAR
## 6 162 <NA>
## [1] "pistol"
## Encoded_Value Freq Label
## 1 N 473161 NO
## 2 Y 1213 YES
## [1] "frisked"
## Encoded_Value Freq Label
## 1 N 270290 NO
## 2 Y 204084 YES
## [1] "searched"
## Encoded_Value Freq Label
## 1 N 437521 NO
## 2 Y 36853 YES
## [1] "arstmade"
## Encoded_Value Freq Label
## 1 N 445534 NO
## 2 Y 28839 YES
## 3 1 <NA>
## [1] "crimsusp"
## [1] "More than 50 unique values; Top 10:"
## Encoded_Value Freq Label
## 1 CPW 90432 <NA>
## 2 BURGLARY 41752 <NA>
## 3 ROBBERY 40863 <NA>
## 4 CRIMINAL TRESPASS 38645 <NA>
## 5 GRAND LARCENY AUTO 25744 <NA>
## 6 GLA 20629 <NA>
## 7 ASSAULT 9119 <NA>
## 8 DRUG SALES 7997 <NA>
## 9 CRIMINAL POSSESSION OF CONTROL 7443 <NA>
## 10 CRIMINAL SALE OF CONTROLLED SU 6831 <NA>
## [1] "inout"
## Encoded_Value Freq Label
## 1 O 361124 OUTSIDE
## 2 I 100072 INSIDE
## 3 13178 <NA>
Missing Data
Now, let’s also review the number of missing values in each column. We only display the columns with missing values. Recall that the total number of observations is 474374 and the total number of observations in the 2003 data is 160851. So we can see that some of these variables are missing for all of the observations in the data or all from the 2003 data. This is consistent with what we found for some of the variables in the 2003 data already.
= data %>%
missings summarise_all(funs(sum(is.na(.))))
= data.frame(t(missings))
missings %>% rownames_to_column(var = "variable") %>%
missings filter(t.missings.>0) %>%
mutate(pct_all = round(t.missings. / dim(data)[1],4),
pct_2003 = round(t.missings. / dim(data2003)[1],4),
pct_2004 = round(t.missings. / dim(data2004)[1],4))
## variable t.missings. pct_all pct_2003 pct_2004
## 1 crimsusp 161 0.0003 0.0010 0.0005
## 2 arstoffn 12 0.0000 0.0001 0.0000
## 3 sumoffen 10 0.0000 0.0001 0.0000
## 4 repcmd 162 0.0003 0.0010 0.0005
## 5 revcmd 347 0.0007 0.0022 0.0011
## 6 dob 162 0.0003 0.0010 0.0005
## 7 age 162 0.0003 0.0010 0.0005
## 8 ht_feet 162 0.0003 0.0010 0.0005
## 9 ht_inch 162 0.0003 0.0010 0.0005
## 10 weight 162 0.0003 0.0010 0.0005
## 11 othfeatr 474374 1.0000 2.9492 1.5130
## 12 rescode 474165 0.9996 2.9479 1.5124
## 13 premtype 474374 1.0000 2.9492 1.5130
## 14 premname 50 0.0001 0.0003 0.0002
## 15 addrnum 21 0.0000 0.0001 0.0001
## 16 stname 6 0.0000 0.0000 0.0000
## 17 stinter 14 0.0000 0.0001 0.0000
## 18 aptnum 474374 1.0000 2.9492 1.5130
## 19 city 160851 0.3391 1.0000 0.5130
## 20 state 474374 1.0000 2.9492 1.5130
## 21 zip 474374 1.0000 2.9492 1.5130
## 22 addrpct 445465 0.9391 2.7694 1.4208
## 23 sector 160851 0.3391 1.0000 0.5130
## 24 beat 460573 0.9709 2.8634 1.4690
## 25 post 160851 0.3391 1.0000 0.5130
## 26 xcoord 446785 0.9418 2.7776 1.4250
## 27 ycoord 446785 0.9418 2.7776 1.4250
## 28 dettypcm 160851 0.3391 1.0000 0.5130
## 29 linecm 445475 0.9391 2.7695 1.4209
## 30 detailcm 445475 0.9391 2.7695 1.4209
Next Steps
Next steps could include looking at relationships between specific variables, looking more closely at important variables, investigating missingness patterns, generating visualizations, creating a more focused summary of the data set contents, looking at interesting slices / aggregations of the data, generating new variables that can be used in analysis and more.
A complete data exploration usually tends into data cleaning and manipulation with the goal of creating a clean data set for analysis in which all relevant data quality issues have been addressed and manipulations have been carried out. One key item to note is that missingness, default values, and errors might not appear randomly and we need to carefully consider why such issues might have arisen. Depending on how these relate to the causal model, they could introduce bias into estimates.
We will stop here for now, but I hope to add to this and explore this data further.