Introduction

In late 2017 I did an analysis of crime data in my neighborhood (The Heights) using the online Houston Police Department data.

This was so interesting that I foolishly decided to expand the effort to cover the whole city. After all, how hard could it be to go from analyzing one police beat with about 13,000 records, to analyzing 109 beats, with a corresponding increase in volume? This effort is still ongoing in fits and starts today, but I thought it would be useful to start documenting the journey now before the pain fades away. This is largely a tale of data cleanup - as any real data analysis project is.

As a sidenote, I spent 35 years as a Geophysicist for a major oil company, much of the time my job was involved with processing and interpreting data - seismic data, well data, and others. I also served several times on teams attempting to develop “a database for all our E&P data”. Which failed on every attempt. But one salient point we always found through interviews and literature searches, was that searching for, digitizing, cleaning, verifying, and organizing data was about 90% of what technical folks like geologists and geophysicists did. So this is nothing new to me. It is the curse of trying to interface our clean, simple models to the chaotic, messy real world. And we always have to make the data fit the pristine environment of our models. So we do cleanup.

What did I learn

So that you don’t have to read all the way to the end to get my learnings from this experience, I will place them conveniently up here near the top.

  1. The paradigm sort(unique(variable)) is a quick and easy way to look at a limited set of strings and see anomalies.

  2. Always save a copy of the raw, unmolested data after reading it in. Reading in one district takes about an hour, and I had enough changes in strategy, issues, and just plain screwups that I read them each in 3 or 4 times before I started saving a copy of the raw data.

  3. Before applying a regular expression to a file, test to see what it will do. It is often impossible to undo it.

  4. Don’t try to do things that require thinking close to bedtime. Make notes and do them tomorrow.

  5. Document why you are doing things. Before you forget.

  6. Making a checkpoint save occasionally while developing a data cleanup runstream makes it easier to back up a few steps when you inevitably hopelessly destroy your working dataset.

  7. Use of a mask variable to isolate the rows to be modified can be very readable and easy. For example, the street field may contain “Katy” or “Katy Fwy”, while the type field may or may not contain “Fwy”. I can create

mask <- df$type=="FWY"
mask2 <- df$street=="KATY FWY"

and then change “KATY FWY” to “KATY” wherever mask&mask2 is True. This also makes it easy to show what will be operated on before pulling the trigger (see 3 above).

The data

On the HPD website are html tables, both by district and by beat of crime incidents, each table representing one month. The department is organized into Regions, which contain Districts, which contain Beats. So obviously we want to read the data in by District, as that is the largest division available. There are 21 Districts (if I ignore the weird one that only operates at the airport), each with about 3-6 Beats.

#   Some data

example <- tribble(
  ~Date, ~Hour, ~Offense_Type, ~Beat, ~Premise, ~BlockRange, ~StreetName, ~Type, ~Suffix, ~'#offenses',
"1/25/2017", "17", "Theft", "2A30", "Road, Street, or Sidewalk", "400-499", "24TH", "ST", "W", "1",
"1/25/2017", "17", "Theft", "2A30", "Apartment", "800-899", "USENER", "-", "-", "1",
"1/25/2017", "19", "Theft", "2A30", "Apartment", "1600-1699", "BLOUNT", "ST", "-", "1",
"1/26/2017", "13", "Burglary", "2A30", "Residence or House", "UNK", "2", "ST", "-", "1"
)

example %>% 
  gt() %>% 
  tab_header(title="Example of some raw data") %>% 
  cols_label()
Example of some raw data
Date Hour Offense_Type Beat Premise BlockRange StreetName Type Suffix #offenses
1/25/2017 17 Theft 2A30 Road, Street, or Sidewalk 400-499 24TH ST W 1
1/25/2017 17 Theft 2A30 Apartment 800-899 USENER - - 1
1/25/2017 19 Theft 2A30 Apartment 1600-1699 BLOUNT ST - 1
1/26/2017 13 Burglary 2A30 Residence or House UNK 2 ST - 1

Some issues become immediately apparent, even with this tiny dataset. Sometimes the block range is unknown. The type and suffix fields may contain a dash - signifying no value. The suffix is actually a prefix.

So let’s work through the data.

To read in the files I used rvest to scrape the data. However there were many issues.

Reading issues

The files start at January 2009, and when I did this work, ended in October 2017.

  1. Jan-May 2009 missing the “number of offenses” column and recorded time instead of hour

  2. Months prior to June 2014 used premise codes instead of strings

  3. Tables from Jan-July, Sept and Nov 2009, and January 2010 are encoded differently and require xpath=‘/html/body/table’ instead of xpath=‘/html/body/div/table’

  4. June and July 2014 are missing the Premise column.

  5. Sporadically a table turns up with extra empty columns labeled “Fieldxx” for some number xx.

  6. August 2017 is missing in the district files, but all the beats that form it are available.

All that just to read in the silly files. Suffice it to say that I have some nasty if/else filled code to deal with all the special cases and exceptions. I won’t share that here. I don’t think it is enlightening. If you care, it is on my github site.

So let’s work through the cleanup, field by field, using district 4f as an example

how many nulls?

#################################
# How many nulls everywhere?
#################################
df2 <- readRDS(file="~/Dropbox/CrimeStats/District4fRawData.rds")

df2 %>%
    map_df(function(x) sum(is.na(x))) %>%
    gather(feature, num_nulls) %>%
    print(n = 100)
## # A tibble: 10 x 2
##    feature      num_nulls
##    <chr>            <int>
##  1 Date                 0
##  2 Hour                 0
##  3 Offense_Type         0
##  4 Beat                 0
##  5 Premise            361
##  6 Block_Range          0
##  7 Street               0
##  8 Type                 0
##  9 Suffix               0
## 10 Num_Off            913

Well this is encouraging. We knew about the Premises missing and issues with number of offenses, so that is a little reassuring. So let’s work our way through each column.

Dates

There are scattered incidents (41 out of 22,700) going back to 1916. These are not useful, so anything prior to January 2009 will be dropped, as well as a small number from the future. The whole record will be discarded since there are not very many and without a date, the data is nearly useless.

We also will convert the dates to an actual datetime, and plot up a bar chart of incidents per day as a good way to spot strange behavior.

############
##  convert date string to actual date, and then trim bad dates out
############
df2$Date <- mdy(df2$Date)

#   How many bad dates?

df2 %>% filter(!between(Date,mdy("1/1/2009"),mdy("1/1/2018")))
## # A tibble: 41 x 10
##    Date       Hour  Offense_Type Beat  Premise Block_Range Street Type 
##    <date>     <chr> <chr>        <chr> <chr>   <chr>       <chr>  <chr>
##  1 2008-12-25 22    Robbery      4F20  Conven… 9500-9599   KEMPW… DR   
##  2 2008-01-19 8     "Aggravated… 4F10  Restau… 9400-9499   LONG … RD   
##  3 2008-01-09 00    Burglary     4F10  Reside… 1200-1299   CONFE… RD   
##  4 2008-12-31 10    Burglary     4F10  Reside… 9500-9599   WESTV… ""   
##  5 2008-12-10 16    Burglary     4F10  Reside… 9800-9899   WARWA… ""   
##  6 2008-12-23 17    Burglary     4F10  Reside… 10200-10299 KNOBO… DR   
##  7 2008-01-16 12    Burglary     4F20  Reside… 3300-3399   CREST… DR   
##  8 2008-12-03 13    Burglary     4F30  Rental… 11900-11999 OLD K… RD   
##  9 2008-12-31 21    "Burglary o… 4F10  Apartm… 1500-1599   GESSN… DR   
## 10 2008-01-16 19    "Burglary o… 4F20  Other … 12000-12099 MOSS … RD   
## # … with 31 more rows, and 2 more variables: Suffix <chr>, Num_Off <int>
df2 <- df2 %>%
  filter(between(Date,mdy("1/1/2009"),mdy("1/1/2018")))

df2 %>% group_by(Date) %>%
  summarise(Total_Incidents_per_Day=n()) %>%
ggplot(aes(Total_Incidents_per_Day)) +
  geom_bar()

Hour of day

The next field is the hour of day. It should be an integer from 0 to 23, but we have read it in as a character, because of “issues”. So we will clean it up and convert to an integer. Specifically, occasionally the field contains spurious non-numeric characters, usually a single quote. No idea how that happened.

sort(unique(df2$Hour)) # look for issues
##  [1] "0"   "00"  "'00" "01"  "02"  "03"  "'03" "04"  "05"  "06"  "07" 
## [12] "08"  "09"  "'09" "1"   "10"  "'10" "11"  "12"  "'12" "13"  "14" 
## [23] "15"  "'15" "16"  "'16" "17"  "18"  "19"  "2"   "20"  "21"  "22" 
## [34] "'22" "23"  "'23" "3"   "4"   "5"   "6"   "7"   "8"   "9"
# remove non-numerics and change to integer
df2$Hour <- str_replace_all(df2$Hour,"\\D","")
df2$Hour <- as.integer(df2$Hour)

ggplot(data=df2) +
  geom_bar(aes(x=Hour)) +
  labs(title=paste("Incidents per Hour of Day, District",district))

Offense Type

Whoops! BIG problem. Apparently, prior to June 2009, Theft was recorded as Burglary. Also, after June 2009, DWI and Narcotic offenses were no longer recorded. Which makes all the data prior to June 2009 useless. So, it will be deleted.

Extra carriage return characters appear, Forcible Rape and Rape (consolidate), Murder and Murder & Manslaughter (consolidate). Burglary and Burglary of a motor vehicle – both become Burglary.

One record has an offense type of “1”. We’ll just delete that record.

First some utility functions

# The purpose of these utilities is to allow readable creation
# of regular expressions for changing strings, a way to test
# the expressions before applying, and a tool to apply them.

#     Create dictionary dataframe of pattern/replacement
#     (It sure would be nice if R had a hash table)
Makedict <- function(dictionary) {
  dict <- cbind.data.frame(split(dictionary, rep(1:2, times=length(dictionary)/2)), stringsAsFactors=F)
  names(dict) <- c("From", "To")
  return(dict)
}

#   test the searches first to see what they will find
testregex <- function(dframe, col, pat) { # input data frame and regex
  for(i in 1:length(pat[,1])) {
    print(paste("Pattern: ",pat[i,1]))
    hits <- unique(dframe[[col]][grepl(pat[i,1],dframe[[col]])])
    if (length(hits)>0){
      print(paste("   Result: ", hits))
    }
    else {
      print("No hits")
    }
  }
}

#   apply to input array
applyregex <- function(dframe, col, pat) {
  for(i in 1:length(pat[,1])) {
  dframe[[col]] <- str_replace_all(dframe[[col]],pat[i,1],pat[i,2])
  }
  return(dframe)
}

Now lets look at the offense field

sort(unique(df2$Offense_Type))
##  [1] "1"                                      
##  [2] "Aggravated Assault"                     
##  [3] "Aggravated\n    Assault"                
##  [4] "AutoTheft"                              
##  [5] "Auto Theft"                             
##  [6] "Burglary"                               
##  [7] "Burglary of a Motor Vehicle"            
##  [8] "Burglary of a\n    Motor Vehicle"       
##  [9] "Driving While Intoxicated"              
## [10] "Driving While\n    Intoxicated"         
## [11] "Forcible Rape"                          
## [12] "Murder"                                 
## [13] "Murder & Nonnegligent Manslaughter"     
## [14] "Murder &\n    Nonnegligent Manslaughter"
## [15] "Narcotic Drug Laws"                     
## [16] "Narcotic Drug\n    Laws"                
## [17] "Rape"                                   
## [18] "Robbery"                                
## [19] "Theft"
#   clean up Offense_Type and look again for issues

#   dictionary of changes
dictionary <- c("\n\\s+",                       " ", 
                "Forcible ",                    "", 
                "AutoTheft",                    "Auto Theft",
                " of a Motor Vehicle",          "",
                " & Nonnegligent Manslaughter", ""
                )
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Offense_Type", dict)
## [1] "Pattern:  \n\\s+"
## [1] "   Result:  Murder &\n    Nonnegligent Manslaughter"
## [2] "   Result:  Aggravated\n    Assault"                
## [3] "   Result:  Burglary of a\n    Motor Vehicle"       
## [4] "   Result:  Narcotic Drug\n    Laws"                
## [5] "   Result:  Driving While\n    Intoxicated"         
## [1] "Pattern:  Forcible "
## [1] "   Result:  Forcible Rape"
## [1] "Pattern:  AutoTheft"
## [1] "   Result:  AutoTheft"
## [1] "Pattern:   of a Motor Vehicle"
## [1] "   Result:  Burglary of a Motor Vehicle"
## [1] "Pattern:   & Nonnegligent Manslaughter"
## [1] "   Result:  Murder & Nonnegligent Manslaughter"
#   Apply
df2 <- applyregex(df2, "Offense_Type", dict)
#   One record has an incident type of "1". We'll just delete that record.
df2 <- df2[!df2$Offense_Type=="1",]

sort(unique(df2$Offense_Type))
## [1] "Aggravated Assault"        "Auto Theft"               
## [3] "Burglary"                  "Driving While Intoxicated"
## [5] "Murder"                    "Narcotic Drug Laws"       
## [7] "Rape"                      "Robbery"                  
## [9] "Theft"
#   Bar chart of various incidents
ggplot(data=df2) +
  geom_bar(aes(x=Offense_Type)) +
  labs(title=paste("Incidents per Offense Type, District",district)) +
  coord_flip()

# Plot of incidents per month vs date
per_month = df2 %>% 
  mutate(mon = as.numeric(format(Date, "%m")), yr = as.numeric(format(Date, "%Y"))) %>%
  mutate(YrMon=yr+mon/12) %>%
  group_by(YrMon, Offense_Type) %>%
  summarize(total=n())

ggplot(data=per_month, mapping=aes(x=YrMon, y=total, color=Offense_Type)) +
  geom_point() +
  geom_line() +
  labs(title="Total incidents", y="Total incidents per month", x="Averaged Monthly") 

#   filter out past 2010 to get a better view of what is going on
per_month = df2 %>% 
  mutate(mon = as.numeric(format(Date, "%m")), yr = as.numeric(format(Date, "%Y"))) %>%
  mutate(YrMon=yr+mon/12) %>%
  filter(YrMon<2011) %>%
  group_by(YrMon, Offense_Type) %>%
  summarize(total=n())
ggplot(data=per_month, mapping=aes(x=YrMon, y=total, color=Offense_Type)) +
  geom_point() +
  geom_line() +
  labs(title="Total incidents expanded", y="Total incidents per month", x="Averaged Monthly") 

#     Get rid of all data prior to June 2009, since it cannot be compared with later data

df2 <- df2 %>%
  filter(between(Date,mdy("6/1/2009"),mdy("1/1/2018")))
sort(unique(df2$Offense_Type))
## [1] "Aggravated Assault" "Auto Theft"         "Burglary"          
## [4] "Murder"             "Rape"               "Robbery"           
## [7] "Theft"

Beat field

Should be pretty simple, but for some strange reason has the occasional odd single quote floating around.

#################################
# Beat
#################################

sort(unique(df2$Beat))
## [1] "4F10"  "'4F10" "4F20"  "'4F20" "4F30"  "'4F30"
# remove non-alphanumerics
df2$Beat <- str_replace_all(df2$Beat,"\\W","")

sort(unique(df2$Beat))
## [1] "4F10" "4F20" "4F30"

Premise field

Where did the crime occur - what sort of place? There are 132 premise descriptions! Everything from “Residence or House” to “Vacant Bank”. Not too many incidents in that last one, as you can imagine. So in addition to cleaning up data entry errors, I created a table of only 7 premises - the statistics are completely dominated by the top 10-15. My shortened set is Parking, Residence, Business, Street, Unk, Other, and Garage.

1.6% of the premise fields are blank. Almost all of them are from June and July 2014, but there are a few others scattered about. To remove the NA, to make it easier to use the data, change to the string UNK.

Also make Premise == blank “UNK”, and fix typo errors. Lots of typos.

#################################
# Premise
#################################

per_month = df2 %>% 
  mutate(mon = as.numeric(format(Date, "%m")), yr = as.numeric(format(Date, "%Y"))) %>%
  mutate(YrMon=yr+mon/12) %>%
  group_by(YrMon) %>%
  summarize(count=sum(is.na(Premise)))

ggplot(data=per_month, mapping=aes(x=YrMon, y=count)) +
  geom_point() +
  labs(title="Blank Premise fields per month", y="Total blanks per month", x="Averaged Monthly") 

#   make more manageable by changing NULL to UNK
df2$Premise[is.na(df2$Premise)] <- "Other or Unknown"

#   Clean up obvious issues

dictionary <- c("\n\\s+",    " ",  # carriage returns
                "^$",        "Other or Unknown", 
                "  ",        " ",  #  double space
                " and ",     " & ",
                " And ",     " & ",
                "/",         " or ",
                ", ",        ",",
                "Amuse. Park,Bowl.","Amusement Park,Bowling",
                "ment Rent", "ment or Rent",
                "Saving ",   "Savings "
                )
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  \n\\s+"
##  [1] "   Result:  Grocery Store or Supermarket\n  Parking Lot"  
##  [2] "   Result:  Miscellaneous Business\n  (Non-Specific)"     
##  [3] "   Result:  Bank or Savings Institution Parking\n  Lot"   
##  [4] "   Result:  Vehicle/Auto Sales/Lease/Auto Parts\n  Store" 
##  [5] "   Result:  Parks and Recreation, Zoo, Swimming\n  Pool"  
##  [6] "   Result:  Toys, Arts Craft,Musical, Bike,Pet\n  Stores" 
##  [7] "   Result:  Electronics Store, Electrical\n  Supplies"    
##  [8] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool, Spa"
##  [9] "   Result:  Church, Synagogue, or Temple\n  Parking Lot"  
## [10] "   Result:  Amusement Park, Bowling Alley,\n  Skate Rink" 
## [11] "   Result:  Vacant Single Occ\n  Resd(House,Townhs,Dplex)"
## [12] "   Result:  Utility Company, Electric, Gas,\n  Water"     
## [13] "   Result:  Vacant Other Residential\n  (Apartment,Dorms)"
## [14] "   Result:  Vacant Other Out\n  Build/Monument/UnderConst"
## [15] "   Result:  Vacant Storage Fac\n  (Barn,Garage,Warehouse)"
## [16] "   Result:  Factory, Manufacturing, or\n  Industrial"     
## [17] "   Result:  Daycare, Child Care, or\n  Kindergarten"      
## [18] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool"     
## [19] "   Result:  Furniture, Appliances, Radios, TV\n  Store"   
## [1] "Pattern:  ^$"
## [1] "   Result:  "
## [1] "Pattern:    "
##  [1] "   Result:  Grocery Store or Supermarket\n  Parking Lot"  
##  [2] "   Result:  Miscellaneous Business\n  (Non-Specific)"     
##  [3] "   Result:  Bank or Savings Institution Parking\n  Lot"   
##  [4] "   Result:  Vehicle/Auto Sales/Lease/Auto Parts\n  Store" 
##  [5] "   Result:  Parks and Recreation, Zoo, Swimming\n  Pool"  
##  [6] "   Result:  Toys, Arts Craft,Musical, Bike,Pet\n  Stores" 
##  [7] "   Result:  Electronics Store, Electrical\n  Supplies"    
##  [8] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool, Spa"
##  [9] "   Result:  Church, Synagogue, or Temple\n  Parking Lot"  
## [10] "   Result:  Amusement Park, Bowling Alley,\n  Skate Rink" 
## [11] "   Result:  Vacant Single Occ\n  Resd(House,Townhs,Dplex)"
## [12] "   Result:  Utility Company, Electric, Gas,\n  Water"     
## [13] "   Result:  Vacant Other Residential\n  (Apartment,Dorms)"
## [14] "   Result:  Vacant Other Out\n  Build/Monument/UnderConst"
## [15] "   Result:  Vacant Storage Fac\n  (Barn,Garage,Warehouse)"
## [16] "   Result:  Factory, Manufacturing, or\n  Industrial"     
## [17] "   Result:  Daycare, Child Care, or\n  Kindergarten"      
## [18] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool"     
## [19] "   Result:  Furniture, Appliances, Radios, TV\n  Store"   
## [1] "Pattern:   and "
## [1] "   Result:  Savings and Loan Institutions"              
## [2] "   Result:  Barber and Beauty Shops"                    
## [3] "   Result:  Park and Ride Terminal"                     
## [4] "   Result:  Parks and Recreation, Zoo, Swimming\n  Pool"
## [5] "   Result:  Parks and Recreation, Zoo, Swimming Pool"   
## [1] "Pattern:   And "
## [1] "   Result:  Barber And Beauty Shops"
## [1] "Pattern:  /"
##  [1] "   Result:  Road/Street/Sidewalk"                         
##  [2] "   Result:  Commercial Parking Lot/Garage"                
##  [3] "   Result:  Grocery/Supermarket"                          
##  [4] "   Result:  Service/Gas Station"                          
##  [5] "   Result:  Residence/House"                              
##  [6] "   Result:  Bar/Night Club"                               
##  [7] "   Result:  Restaurant/Cafeteria Parking Lot"             
##  [8] "   Result:  Garage/Carport"                               
##  [9] "   Result:  Other/Unknown"                                
## [10] "   Result:  Apartment/Rental Office"                      
## [11] "   Result:  Restaurant/Cafeteria"                         
## [12] "   Result:  Church/Synagogue/Temple"                      
## [13] "   Result:  Vehicle/Auto Sales/Lease/Auto Parts Store"    
## [14] "   Result:  Grocery/Supermarket Parking Lot"              
## [15] "   Result:  Hotel/Motel Parking Lot"                      
## [16] "   Result:  Pawn/Resale Shop/Flea Market"                 
## [17] "   Result:  Laundry/Dry Cleaners/Washaterias"             
## [18] "   Result:  Hotel/Motel/Etc."                             
## [19] "   Result:  Church/Synagogue/Temple Parking Lot"          
## [20] "   Result:  Factory/Manufacturing/Industrial"             
## [21] "   Result:  Department/Discount Store"                    
## [22] "   Result:  Bank/Saving Institution Parking Lot"          
## [23] "   Result:  Drug Store/Medical Supply"                    
## [24] "   Result:  Bar/Night Club Parking Lot"                   
## [25] "   Result:  Social Services/Public Charities"             
## [26] "   Result:  Laundry/Dry Cleaners Parking Lot"             
## [27] "   Result:  Field/Woods"                                  
## [28] "   Result:  Government/Public Building"                   
## [29] "   Result:  Convention Center/Exhibit Halls"              
## [30] "   Result:  Vacant School/College"                        
## [31] "   Result:  University/College"                           
## [32] "   Result:  Maintenance/Building Services"                
## [33] "   Result:  Highway/Freeway"                              
## [34] "   Result:  Sporting Goods/Gun Shops"                     
## [35] "   Result:  Vacant Industrial/Manufacturing/Industrial"   
## [36] "   Result:  Pool Hall/Game Room"                          
## [37] "   Result:  Daycare/Child Care/Kindergarten"              
## [38] "   Result:  Stadium/Sprts Arena/Race Track"               
## [39] "   Result:  Vehicle/Auto Sales/Lease/Auto Parts\n  Store" 
## [40] "   Result:  Vacant Other Out\n  Build/Monument/UnderConst"
## [1] "Pattern:  , "
##  [1] "   Result:  Electronics Store, Electrical Sup."                                  
##  [2] "   Result:  Parks & Recreation, Zoo, Swim Pool"                                  
##  [3] "   Result:  Vacant Single Occupancy Residence (Houses,Townhouses,Duplexes, Etc.)"
##  [4] "   Result:  Furniture, Appliances, Radios, Tv"                                   
##  [5] "   Result:  Libraries, Museums"                                                  
##  [6] "   Result:  Garden Supply, Nursery, Florist"                                     
##  [7] "   Result:  Road, Street, or Sidewalk"                                           
##  [8] "   Result:  Other, Unknown, or Not Listed"                                       
##  [9] "   Result:  Field, Woods, Forest, Park"                                          
## [10] "   Result:  Parks and Recreation, Zoo, Swimming\n  Pool"                         
## [11] "   Result:  Hotel, Motel, Inn, Etc."                                             
## [12] "   Result:  Church, Synagogue, or Temple Parking Lot"                            
## [13] "   Result:  Vacant Church, Synagogue, or Temple"                                 
## [14] "   Result:  Furniture, Appliances, Radios, TV Store"                             
## [15] "   Result:  Parks and Recreation, Zoo, Swimming Pool"                            
## [16] "   Result:  Physician, Doctor, Dentist's Office"                                 
## [17] "   Result:  Gym, Recreat, Club House, Indoor Pool, Spa"                          
## [18] "   Result:  Stadium, Sports Arena, Race Track"                                   
## [19] "   Result:  Garden Supply, Nursery, or Florist"                                  
## [20] "   Result:  Toys, Arts Craft,Musical, Bike,Pet\n  Stores"                        
## [21] "   Result:  Electronics Store, Electrical\n  Supplies"                           
## [22] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool, Spa"                       
## [23] "   Result:  Laundry, Dry Cleaners, Washaterias"                                  
## [24] "   Result:  Church, Synagogue, or Temple\n  Parking Lot"                         
## [25] "   Result:  Amusement Park, Bowling Alley,\n  Skate Rink"                        
## [26] "   Result:  Pawn, Resale Shop, or Flea Market"                                   
## [27] "   Result:  Electronics Store, Electrical Supplies"                              
## [28] "   Result:  Church, Synagogue, or Temple"                                        
## [29] "   Result:  Amusement Park, Bowling Alley, Skate Rink"                           
## [30] "   Result:  Toys, Arts Craft,Musical, Bike,Pet Stores"                           
## [31] "   Result:  Factory, Manufacturing, or Industrial"                               
## [32] "   Result:  Utility Company, Electric, Gas, Water"                               
## [33] "   Result:  Utility Company, Electric, Gas,\n  Water"                            
## [34] "   Result:  Factory, Manufacturing, or\n  Industrial"                            
## [35] "   Result:  Vacant Hotel, Motel, Etc."                                           
## [36] "   Result:  Daycare, Child Care, or\n  Kindergarten"                             
## [37] "   Result:  Gym, Recreat, Club House, Indoor Pool"                               
## [38] "   Result:  Gym, Recreat, Club House, Indoor\n  Pool"                            
## [39] "   Result:  Furniture, Appliances, Radios, TV\n  Store"                          
## [40] "   Result:  Theatres, Dinner Theaters, Auditoriums"                              
## [41] "   Result:  Daycare, Child Care, or Kindergarten"                                
## [1] "Pattern:  Amuse. Park,Bowl."
## [1] "   Result:  Amuse. Park,Bowl. Alley,Skate Rink"
## [1] "Pattern:  ment Rent"
## [1] "   Result:  Apartment Rental Office"
## [1] "Pattern:  Saving "
## [1] "   Result:  Bank/Saving Institution Parking Lot"
#   Apply
df2 <- applyregex(df2, "Premise", dict)

dictionary <- c(
                "rch,Syn",            "rch or Syn",
                " Or ",               " or ",
                "Sup\\.",             "Supplies",
                "Daycare or Child",   "Daycare,Child",
                "Factory or Manu",    "Factory, Manu",
                "Field or Woods",     "Field,Woods,Forest,Park",
                "Tv",                 "TV Store",
                "Grocery Store",      "Grocery",
                "Pool$",              "Pool,Spa",
                "Hse,Indr",           "House,Indoor"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  rch,Syn"
## [1] "   Result:  Church,Synagogue,or Temple Parking Lot"
## [2] "   Result:  Vacant Church,Synagogue,or Temple"     
## [3] "   Result:  Church,Synagogue,or Temple"            
## [1] "Pattern:   Or "
## [1] "   Result:  Commercial Or Training School"
## [1] "Pattern:  Sup\\."
## [1] "   Result:  Electronics Store,Electrical Sup." 
## [2] "   Result:  Book,Record,Stationary,Office Sup."
## [1] "Pattern:  Daycare or Child"
## [1] "   Result:  Daycare or Child Care or Kindergarten"
## [1] "Pattern:  Factory or Manu"
## [1] "   Result:  Factory or Manufacturing or Industrial"
## [1] "Pattern:  Field or Woods"
## [1] "   Result:  Field or Woods"
## [1] "Pattern:  Tv"
## [1] "   Result:  Furniture,Appliances,Radios,Tv"
## [1] "Pattern:  Grocery Store"
## [1] "   Result:  Grocery Store or Supermarket Parking Lot"
## [2] "   Result:  Grocery Store or Supermarket"            
## [3] "   Result:  Vacant Grocery Store or Supermarket"     
## [1] "Pattern:  Pool$"
## [1] "   Result:  Parks & Recreation,Zoo,Swim Pool"    
## [2] "   Result:  Parks & Recreation,Zoo,Swimming Pool"
## [3] "   Result:  Gym,Recreat,Club House,Indoor Pool"  
## [1] "Pattern:  Hse,Indr"
## [1] "   Result:  Gym,Recreat,Club Hse,Indr Pool,Spa"
#   Apply
df2 <- applyregex(df2, "Premise", dict)


dictionary <- c(
                "Misc\\.",               "Miscellaneous",
                "Hme",                   "Home ",
                "Etc\\.",                "Etc",
                "^Other,Unknown, or Not Listed$","Other or Unknown",
                "^Not Listed$",            "Other or Unknown",
                "Swim ",                 "Swimming ",
                "wn,Re",                 "wn or Re",
                "Physician's Office",    "Physician,Doctor,Dentist's Office",
                " Of ",                  " of ",
                "ad,St",                 "ad or St"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  Misc\\."
## [1] "   Result:  Misc. Business (Non-Specific)"
## [1] "Pattern:  Hme"
## [1] "   Result:  Multi-Plex Hme(Duplex,Triplex Etc)"
## [1] "Pattern:  Etc\\."
## [1] "   Result:  Hotel or Motel or Etc."                                                            
## [2] "   Result:  Vacant Single Occupancy Residence (Houses,Townhouses,Duplexes,Etc.)"               
## [3] "   Result:  Vacant Storage Facility (Barns,Garages,Warehouses,Etc.)"                           
## [4] "   Result:  Vacant Other Structure (Out Buildings,Monuments,Buildings Under Construction,Etc.)"
## [5] "   Result:  Hotel,Motel,Inn,Etc."                                                              
## [6] "   Result:  Multi-Plex Home (Duplex,Triplex Etc.)"                                             
## [7] "   Result:  Vacant Hotel,Motel,Etc."                                                           
## [1] "Pattern:  ^Other,Unknown, or Not Listed$"
## [1] "No hits"
## [1] "Pattern:  ^Not Listed$"
## [1] "   Result:  Not Listed"
## [1] "Pattern:  Swim "
## [1] "   Result:  Parks & Recreation,Zoo,Swim Pool,Spa"
## [1] "Pattern:  wn,Re"
## [1] "   Result:  Pawn,Resale Shop,or Flea Market"
## [1] "Pattern:  Physician's Office"
## [1] "   Result:  Physician's Office"
## [1] "Pattern:   Of "
## [1] "No hits"
## [1] "Pattern:  ad,St"
## [1] "   Result:  Road,Street,or Sidewalk"
#   Apply
df2 <- applyregex(df2, "Premise", dict)

dictionary <- c(
                "Sprts",           "Sports",
                "ts Cra",          "ts & Cra",
                "Apartment,Dorms", "Apartment,Inn,Dorms,Boarding House",
                "Occ Resd\\(House,Townhs,Dplex\\)","Occupancy Residence \\(Houses,Townhouses,Duplexes,Etc\\)",
                "Fac \\(Barn,Garage,Warehouse\\)","Facility \\(Barns,Garages,Warehouses,Etc\\)",
                "cord,Stat",        "cord or Stat",
                "care,Chil",        "care or Chil",
                "tory,Manu",        "tory or Manu",
                ",Pet$",            ",Pet Stores",
                "ium,Spor",         "ium or Spor"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  Sprts"
## [1] "   Result:  Stadium or Sprts Arena or Race Track"
## [1] "Pattern:  ts Cra"
## [1] "   Result:  Toys,Arts Craft,Musical,Bike,Pet Stores"
## [1] "Pattern:  Apartment,Dorms"
## [1] "   Result:  Vacant Other Residential (Apartment,Dorms)"
## [1] "Pattern:  Occ Resd\\(House,Townhs,Dplex\\)"
## [1] "   Result:  Vacant Single Occ Resd(House,Townhs,Dplex)"
## [1] "Pattern:  Fac \\(Barn,Garage,Warehouse\\)"
## [1] "   Result:  Vacant Storage Fac (Barn,Garage,Warehouse)"
## [1] "Pattern:  cord,Stat"
## [1] "   Result:  Book,Record,Stationary,Office Supplies"
## [1] "Pattern:  care,Chil"
## [1] "   Result:  Daycare,Child Care or Kindergarten"
## [2] "   Result:  Daycare,Child Care,or Kindergarten"
## [1] "Pattern:  tory,Manu"
## [1] "   Result:  Factory,Manufacturing,or Industrial"
## [1] "Pattern:  ,Pet$"
## [1] "   Result:  Toys,Arts & Craft,Musical,Bike,Pet"
## [1] "Pattern:  ium,Spor"
## [1] "   Result:  Stadium,Sports Arena,Race Track"
#   Apply
df2 <- applyregex(df2, "Premise", dict)

dictionary <- c(
                "Motel,Inn,Etc",        "Motel,Etc",
                "Out Build or Monument or UnderConst", "Structure (Out Buildings,Monuments,Buildings Under Construction,Etc)",
                "(\\w),or ",            "\\1 or ",
                "Contra-Flow or Managed or HOV Lanes", "Contra-Flow or Hov",
                "Lake or Pond or Waterway or Bayou or River", "Lake or Waterway or Bayou",
                "Veh\\.", "Vehicle",
                "Auditor\\.", "Auditoriums",
                "Blding$", "Building",
                " or University$", "",
                "Factory, Manufacturing or Industrial", "Factory or Manufacturing Building",
                "Factory or Manufacturing or Industrial", "Factory or Manufacturing Building",
                "Vacant Industrial or Manufacturing or Industrial", "Vacant Industrial or Manufacturing Building",
                "Light Rail Vehicle", "Light Rail (Metro Rail) Vehicle",
                "(\\w),(\\w)",          "\\1 or \\2"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  Motel,Inn,Etc"
## [1] "   Result:  Hotel,Motel,Inn,Etc"
## [1] "Pattern:  Out Build or Monument or UnderConst"
## [1] "   Result:  Vacant Other Out Build or Monument or UnderConst"
## [1] "Pattern:  (\\w),or "
## [1] "   Result:  Road or Street,or Sidewalk"               
## [2] "   Result:  Other,Unknown,or Not Listed"              
## [3] "   Result:  Church or Synagogue,or Temple Parking Lot"
## [4] "   Result:  Vacant Church or Synagogue,or Temple"     
## [5] "   Result:  Garden Supply,Nursery,or Florist"         
## [6] "   Result:  Pawn or Resale Shop,or Flea Market"       
## [7] "   Result:  Church or Synagogue,or Temple"            
## [8] "   Result:  Factory or Manufacturing,or Industrial"   
## [9] "   Result:  Daycare or Child Care,or Kindergarten"    
## [1] "Pattern:  Contra-Flow or Managed or HOV Lanes"
## [1] "No hits"
## [1] "Pattern:  Lake or Pond or Waterway or Bayou or River"
## [1] "No hits"
## [1] "Pattern:  Veh\\."
## [1] "No hits"
## [1] "Pattern:  Auditor\\."
## [1] "No hits"
## [1] "Pattern:  Blding$"
## [1] "No hits"
## [1] "Pattern:   or University$"
## [1] "No hits"
## [1] "Pattern:  Factory, Manufacturing or Industrial"
## [1] "   Result:  Factory, Manufacturing or Industrial"
## [1] "Pattern:  Factory or Manufacturing or Industrial"
## [1] "No hits"
## [1] "Pattern:  Vacant Industrial or Manufacturing or Industrial"
## [1] "   Result:  Vacant Industrial or Manufacturing or Industrial"
## [1] "Pattern:  Light Rail Vehicle"
## [1] "No hits"
## [1] "Pattern:  (\\w),(\\w)"
##  [1] "   Result:  Multi-Plex Home (Duplex,Triplex Etc)"                                             
##  [2] "   Result:  Electronics Store,Electrical Supplies"                                            
##  [3] "   Result:  Vacant Other Residential (Apartment,Inn,Dorms,Boarding House)"                    
##  [4] "   Result:  Gym,Recreat,Club House,Indoor Pool,Spa"                                           
##  [5] "   Result:  Parks & Recreation,Zoo,Swimming Pool,Spa"                                         
##  [6] "   Result:  Toys,Arts & Craft,Musical,Bike,Pet Stores"                                        
##  [7] "   Result:  Field,Woods,Forest,Park"                                                          
##  [8] "   Result:  Utility Company,Electric,Gas,Water"                                               
##  [9] "   Result:  Vacant Single Occupancy Residence (Houses,Townhouses,Duplexes,Etc)"               
## [10] "   Result:  Vacant Storage Facility (Barns,Garages,Warehouses,Etc)"                           
## [11] "   Result:  Physician,Doctor,Dentist's Office"                                                
## [12] "   Result:  Furniture,Appliances,Radios,TV Store"                                             
## [13] "   Result:  Libraries,Museums"                                                                
## [14] "   Result:  Amusement Park,Bowling Alley,Skate Rink"                                          
## [15] "   Result:  Vacant Other Structure (Out Buildings,Monuments,Buildings Under Construction,Etc)"
## [16] "   Result:  Garden Supply,Nursery,Florist"                                                    
## [17] "   Result:  Book,Record or Stationary,Office Supplies"                                        
## [18] "   Result:  Road or Street,or Sidewalk"                                                       
## [19] "   Result:  Other,Unknown,or Not Listed"                                                      
## [20] "   Result:  Hotel,Motel,Inn,Etc"                                                              
## [21] "   Result:  Church or Synagogue,or Temple Parking Lot"                                        
## [22] "   Result:  Vacant Church or Synagogue,or Temple"                                             
## [23] "   Result:  Stadium or Sports Arena,Race Track"                                               
## [24] "   Result:  Garden Supply,Nursery,or Florist"                                                 
## [25] "   Result:  Laundry,Dry Cleaners,Washaterias"                                                 
## [26] "   Result:  Pawn or Resale Shop,or Flea Market"                                               
## [27] "   Result:  Church or Synagogue,or Temple"                                                    
## [28] "   Result:  Factory or Manufacturing,or Industrial"                                           
## [29] "   Result:  Vacant Hotel,Motel,Etc"                                                           
## [30] "   Result:  Daycare or Child Care,or Kindergarten"                                            
## [31] "   Result:  Theatres,Dinner Theaters,Auditoriums"                                             
## [32] "   Result:  Package Facility (FedEx,UPS,DHL)"
#   Apply
df2 <- applyregex(df2, "Premise", dict)

dictionary <- c(
                "Lake or Pond or Waterway or Bayou or River", "Lake or Waterway or Bayou",
                "Other or Unknown or Not Listed", "Other or Unknown"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Premise", dict)
## [1] "Pattern:  Lake or Pond or Waterway or Bayou or River"
## [1] "No hits"
## [1] "Pattern:  Other or Unknown or Not Listed"
## [1] "   Result:  Other or Unknown or Not Listed"
#   Apply
df2 <- applyregex(df2, "Premise", dict)

sort(unique(df2$Premise))
##   [1] "Adult Book Store or Newsstand"                                                             
##   [2] "Airport Terminal"                                                                          
##   [3] "Alley"                                                                                     
##   [4] "Amusement Park or Bowling Alley or Skate Rink"                                             
##   [5] "Apartment"                                                                                 
##   [6] "Apartment Laundry"                                                                         
##   [7] "Apartment or Rental Office"                                                                
##   [8] "Apartment Parking Lot"                                                                     
##   [9] "Auto Repair"                                                                               
##  [10] "Bank"                                                                                      
##  [11] "Bank or Savings Institution Parking Lot"                                                   
##  [12] "Barber & Beauty Shops"                                                                     
##  [13] "Bar or Night Club"                                                                         
##  [14] "Bar or Night Club Parking Lot"                                                             
##  [15] "Body Shop"                                                                                 
##  [16] "Book or Record or Stationary or Office Supplies"                                           
##  [17] "Bus Station"                                                                               
##  [18] "Bus Stop"                                                                                  
##  [19] "Car Wash"                                                                                  
##  [20] "Check Cashing Places"                                                                      
##  [21] "Church or Synagogue or Temple"                                                             
##  [22] "Church or Synagogue or Temple Parking Lot"                                                 
##  [23] "Clothing Store"                                                                            
##  [24] "Commercial Building"                                                                       
##  [25] "Commercial or Training School"                                                             
##  [26] "Commercial Parking Lot or Garage"                                                          
##  [27] "Condominium"                                                                               
##  [28] "Construction Site"                                                                         
##  [29] "Convenience Store"                                                                         
##  [30] "Convenience Store Parking Lot"                                                             
##  [31] "Convention Center or Exhibit Halls"                                                        
##  [32] "Daycare or Child Care or Kindergarten"                                                     
##  [33] "Department or Discount Store"                                                              
##  [34] "Driveway"                                                                                  
##  [35] "Drug Store or Medical Supply"                                                              
##  [36] "Electronics Store or Electrical Supplies"                                                  
##  [37] "Elementary School"                                                                         
##  [38] "Factory or Manufacturing Building"                                                         
##  [39] "Field or Woods or Forest or Park"                                                          
##  [40] "Freeway Service Road"                                                                      
##  [41] "Furniture or Appliances or Radios or TV Store"                                             
##  [42] "Garage or Carport"                                                                         
##  [43] "Garden Supply or Nursery or Florist"                                                       
##  [44] "Government or Public Building"                                                             
##  [45] "Grocery or Supermarket"                                                                    
##  [46] "Grocery or Supermarket Parking Lot"                                                        
##  [47] "Gym or Recreat or Club House or Indoor Pool or Spa"                                        
##  [48] "High Rise"                                                                                 
##  [49] "High School"                                                                               
##  [50] "Highway or Freeway"                                                                        
##  [51] "Hospital"                                                                                  
##  [52] "Hospital Parking Lot"                                                                      
##  [53] "Hotel or Motel or Etc"                                                                     
##  [54] "Hotel or Motel Parking Lot"                                                                
##  [55] "Illicit Massage Parlor or Spa"                                                             
##  [56] "Jail or Prison"                                                                            
##  [57] "Jewelry Stores"                                                                            
##  [58] "Laundry or Dry Cleaners or Washaterias"                                                    
##  [59] "Laundry or Dry Cleaners Parking Lot"                                                       
##  [60] "Libraries or Museums"                                                                      
##  [61] "Liquor Store"                                                                              
##  [62] "Liquor Store Parking Lot"                                                                  
##  [63] "Maintenance or Building Services"                                                          
##  [64] "Mall Common Area"                                                                          
##  [65] "Mall Parking Lot"                                                                          
##  [66] "Middle School"                                                                             
##  [67] "Miscellaneous Business (Non-Specific)"                                                     
##  [68] "Mobile Home"                                                                               
##  [69] "Multi-Plex Home (Duplex or Triplex Etc)"                                                   
##  [70] "Nursing Home"                                                                              
##  [71] "Office Building"                                                                           
##  [72] "Other or Unknown"                                                                          
##  [73] "Other Parking Lot"                                                                         
##  [74] "Package Facility (FedEx or UPS or DHL)"                                                    
##  [75] "Park & Ride Terminal"                                                                      
##  [76] "Parks & Recreation or Zoo or Swimming Pool or Spa"                                         
##  [77] "Pawn or Resale Shop or Flea Market"                                                        
##  [78] "Physician or Doctor or Dentist's Office"                                                   
##  [79] "Police Station"                                                                            
##  [80] "Pool Hall or Game Room"                                                                    
##  [81] "Private School"                                                                            
##  [82] "Rehabilitation Center"                                                                     
##  [83] "Rental Storage Facility"                                                                   
##  [84] "Residence or House"                                                                        
##  [85] "Restaurant or Cafeteria"                                                                   
##  [86] "Restaurant or Cafeteria Parking Lot"                                                       
##  [87] "Road or Street or Sidewalk"                                                                
##  [88] "Savings & Loan Institutions"                                                               
##  [89] "Service or Gas Station"                                                                    
##  [90] "Sexually Oriented Club"                                                                    
##  [91] "Social Services or Public Charities"                                                       
##  [92] "Specialty Store (Non-Specific)"                                                            
##  [93] "Sporting Goods or Gun Shops"                                                               
##  [94] "Stadium or Sports Arena or Race Track"                                                     
##  [95] "Strip Business Center Parking Lot"                                                         
##  [96] "Theatres or Dinner Theaters or Auditoriums"                                                
##  [97] "Tollway"                                                                                   
##  [98] "Toys or Arts & Craft or Musical or Bike or Pet Stores"                                     
##  [99] "University or College"                                                                     
## [100] "Utility Company or Electric or Gas or Water"                                               
## [101] "Vacant Building (Commercial)"                                                              
## [102] "Vacant Church or Synagogue or Temple"                                                      
## [103] "Vacant Grocery or Supermarket"                                                             
## [104] "Vacant Hotel or Motel or Etc"                                                              
## [105] "Vacant Industrial or Manufacturing Building"                                               
## [106] "Vacant Other Residential (Apartment or Inn or Dorms or Boarding House)"                    
## [107] "Vacant Other Structure (Out Buildings or Monuments or Buildings Under Construction or Etc)"
## [108] "Vacant Restaurant"                                                                         
## [109] "Vacant School or College"                                                                  
## [110] "Vacant Single Occupancy Residence (Houses or Townhouses or Duplexes or Etc)"               
## [111] "Vacant Storage Facility (Barns or Garages or Warehouses or Etc)"                           
## [112] "Vehicle or Auto Sales or Lease or Auto Parts Store"                                        
## [113] "Video Rental & Sales"                                                                      
## [114] "Warehouse"
ggplot(data=df2) +
  geom_bar(aes(x=Premise)) +
  labs(title=paste("Incidents per premise, District",district)) +
  coord_flip()

reorder_size <- function(x) {
  factor(x, levels = names(sort(table(x))))
}
ggplot(df2, aes(reorder_size(df2$Premise))) + geom_bar() + coord_flip()

# Plot top 20 premises in order of prominence

df2 %>% group_by(Premise) %>% 
  count() %>% 
  arrange(desc(n)) %>%
  head(20) %>% 
  ungroup() %>% 
  mutate(Premise=factor(Premise)) %>% 
  ggplot(aes(fct_reorder(Premise, n))) +
    geom_bar(stat="identity", aes(y=n)) +
    coord_flip()

a <- left_join(df2,ptab,by="Premise")
# make sure we matched everything - answer here should be zero
sum(is.na(a$ShortPremise))
## [1] 0
sort(table(a$ShortPremise),decreasing=TRUE)
## 
##   Parking Residence  Business       Unk    Street     Other    Garage 
##      8629      5330      4990      1258      1043       367       151
df2 <- a
rm(a)

Block range

Sometimes these end up in E notation, so I will need to repair that. Also replace a blank field with “UNK”.

#################################
# Block_Range
#################################
sort(unique(df2$Block_Range))
##   [1] "0-99"        "10000-10099" "1000-1099"   "100-199"     "10100-10199"
##   [6] "10200-10299" "10300-10399" "10400-10499" "10500-10599" "10600-10699"
##  [11] "10700-10799" "10800-10899" "10900-10999" "11000-11099" "1100-1199"  
##  [16] "11100-11199" "11200-11299" "11300-11399" "11400-11499" "11500-11599"
##  [21] "11600-11699" "11700-11799" "11800-11899" "11900-11999" "12000-12099"
##  [26] "1200-1299"   "12100-12199" "12200-12299" "12300-12399" "12400-12499"
##  [31] "12500-12599" "12600-12699" "12800-12899" "12900-12999" "13000-13099"
##  [36] "1300-1399"   "13100-13199" "13200-13299" "13300-13399" "13400-13499"
##  [41] "13500-13599" "13600-13699" "13700-13799" "13800-13899" "13900-13999"
##  [46] "14000-14099" "1400-1499"   "14200-14299" "14300-14399" "14400-14499"
##  [51] "14500-14599" "14600-14699" "14700-14799" "14800-14899" "14900-14999"
##  [56] "15000-15099" "1500-1599"   "15100-15199" "15200-15299" "15300-15399"
##  [61] "15400-15499" "15500-15599" "15700-15799" "15800-15899" "15900-15999"
##  [66] "16000-16099" "1600-1699"   "16100-16199" "16200-16299" "16300-16399"
##  [71] "16400-16499" "16500-16599" "16600-16699" "16700-16799" "16800-16899"
##  [76] "16900-16999" "17000-17099" "1700-1799"   "17100-17199" "17200-17299"
##  [81] "17300-17399" "17400-17499" "17500-17599" "17600-17699" "17700-17799"
##  [86] "17800-17899" "17900-17999" "18000-18099" "1800-1899"   "18100-18199"
##  [91] "18200-18299" "18300-18399" "18400-18499" "18500-18599" "18600-18699"
##  [96] "18700-18799" "18800-18899" "18900-18999" "19000-19099" "1900-1999"  
## [101] "19100-19199" "19200-19299" "19500-19599" "2000-2099"   "200-299"    
## [106] "20200-20299" "20900-20999" "2100-2199"   "21100-21199" "21300-21399"
## [111] "21900-21999" "2200-2299"   "22300-22399" "22900-22999" "2300-2399"  
## [116] "2400-2499"   "24700-24799" "2500-2599"   "2600-2699"   "2700-2799"  
## [121] "2800-2899"   "2900-2999"   "3000-3099"   "300-399"     "3100-3199"  
## [126] "3200-3299"   "3300-3399"   "3400-3499"   "34800-34899" "3500-3599"  
## [131] "3600-3699"   "3700-3799"   "3800-3899"   "3900-3999"   "4000-4099"  
## [136] "400-499"     "4100-4199"   "4200-4299"   "4300-4399"   "4400-4499"  
## [141] "4500-4599"   "4600-4699"   "4700-4799"   "4800-4899"   "4900-4999"  
## [146] "5000-5099"   "500-599"     "5100-5199"   "5200-5299"   "5300-5399"  
## [151] "5400-5499"   "5500-5599"   "5600-5699"   "5700-5799"   "5900-5999"  
## [156] "6000-6099"   "600-699"     "6100-6199"   "6200-6299"   "6300-6399"  
## [161] "6400-6499"   "6500-6599"   "6600-6699"   "6800-6899"   "6900-6999"  
## [166] "700-799"     "7200-7299"   "7400-7499"   "7500-7599"   "7600-7699"  
## [171] "7800-7899"   "8000-8099"   "800-899"     "8100-8199"   "8200-8299"  
## [176] "8300-8399"   "8500-8599"   "8700-8799"   "8800-8899"   "8900-8999"  
## [181] "900-999"     "9300-9399"   "9400-9499"   "9500-9599"   "9600-9699"  
## [186] "9700-9799"   "9800-9899"   "9900-9999"   "UNK"
df2$Block_Range <- str_replace_all(df2$Block_Range, "1\\.1103e\\+006-1\\.1104e\\+006", "")
df2$Block_Range <- sub("^$","UNK", df2$Block_Range)

Type

This is the type of street: ST, DR, RD, FWY, etc. I’ll clean up a few obvious issues here, and get rid of the dash representing nothing.

#################################
# Type
#################################

sort(unique(df2$Type))
##  [1] "-"       "AVE"     "BLVD"    "CIR"     "CT"      "DR"      "FWY"    
##  [8] "FWY SER" "HWY"     "LN"      "PKWY"    "PL"      "RAMP"    "RD"     
## [15] "ST"      "WALK"    "WAY"
df2$Type <- str_replace_all(df2$Type," ","") 
df2$Type <- str_replace_all(df2$Type,"-","") 
df2$Type <- str_replace_all(df2$Type,"FWY SER","FWY") 
df2$Type <- str_replace_all(df2$Type,"FWYSER","FWY") 
df2$Type <- str_replace_all(df2$Type,"^CIR$","CIRCLE") 

Street names

Saving the best for last - fully half the code for reading in and cleaning up a district file is cleaning up the street names. And the most labor intensive. My plan this year is to try to train a neural net to fix spelling errors in street names, since I have a huge, hand-curated dataset to use for training. But that is the future.

Spelling accuracy

How accurately do the street names need to be spelled, and why?

One goal is to have a local file of block, street that I can use to geocode without having to go online. But for that to work, the street names need to be accurate. Which raises the question - what is the authoritative source for the names?

I have 3 sources for local street names. Google maps, the U.S. Census Bureau geocoder - based on the Tiger files, and the City of Houston GIS department.

I choose to make the city GIS data authoritative - it does seem that they would take special care since the city has street naming authority and those names get propagated to the USPS and to 911.

Interestingly, the census usually, but not always, agrees with the city. Google, more often, does not. Road vs. Drive, Beverlyhill vs. Beverly Hill, Martin Luther King Jr vs. Martin Luther King. Later, when geocoding, these differences will be a source of much frustration and gnashing of teeth.

Spelling matters - there are streets that are distiguished only by their suffix or prefix. Gessner Dr is not the same as Gessner St. They are miles apart. So for geocoding, getting it right is crucial.

I will only show a portion of the code here, as it is pretty boring stuff for the most part. Correcting a lot of trivial spelling errors. There are a few more general issues, which I will illustrate.

Sometimes apartment or suite numbers appear, many have issues like ending in “ST, STREET, ST.”, or no ending. In other cases there are errors like “HOGAN ST APT 2”, or really weird ones like “1000 6 1”, “”1102 1“, which are unintelligible, at least to me.

Convert all the freeways to the named designators, e.g., IH 10 becomes Katy Fwy. For Loop 610, conform to “quadrant LOOP”, e.g., N Loop Fwy.

#################################
# Street
#################################

#   copy Street to OldStreet to save it

##---new---##
df2$OldStreet <- df2$Street

keepdf2 <- df2 # just in case I screw up

#####  Remove extra blanks, periods, apartments, etc
dictionary <- c(
                "\\.",      " ",  # remove periods
                "-",       " ", # remove dashes
                "\\s{2,}",  " ", # remove extra spaces
                "^\\s+",    "",  # remove leading spaces
                "\\s+$",    "",  # remove trailing spaces
                "\n\\s+",   " ",  # remove carriage returns
                " APT\\s*#*\\d*$", "", # remove Apartment numbers
                " APT\\s*#*[A-Z0-9]*$", "", # remove Apartment numbers
                " APARTMENT\\s*#*[A-Z0-9]*$", "", # remove Apartment numbers
                " UNIT \\w*",  "", #  remove Unit numbers
                " NO\\s*\\d+", "", #  more Unit numbers 
                " STE\\s*#*[A-Z0-9]*$", "",  #  SUITE number
                " SUITE\\s*#*[A-Z0-9]*$", ""  #  SUITE number
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Street", dict)
## [1] "Pattern:  \\."
##  [1] "   Result:  W. PARK VIEW"        "   Result:  W. SAM HOUSTON"     
##  [3] "   Result:  S. PARK VIEW"        "   Result:  W. TRI OAKS LANE"   
##  [5] "   Result:  BRINWOOD DR."        "   Result:  CHAMPION VILLA DR." 
##  [7] "   Result:  ROCKCREST RD."       "   Result:  P.O.BOX 268994"     
##  [9] "   Result:  W.PARK VIEW DR"      "   Result:  H.C.S.O"            
## [11] "   Result:  SAGECROFT DR."       "   Result:  KENROSS ST."        
## [13] "   Result:  CUTTEN RD. APT. 904" "   Result:  EDDYSTONE DR."      
## [15] "   Result:  W. NATIONAL"         "   Result:  MILLER AVE."        
## [17] "   Result:  MISTY VALLEY DR."    "   Result:  P.O. BOX 610027"    
## [1] "Pattern:  -"
## [1] "   Result:  WESTVIEW C-3" "   Result:  B-1 FM1092"  
## [1] "Pattern:  \\s{2,}"
## [1] "   Result:  PARK HARBOR\n    ESTATES"         
## [2] "   Result:  SHERWOOD FOREST\n    GLEN"        
## [3] "   Result:  SPRINGWOOD\n    FOREST"           
## [4] "   Result:  1729  \"B\"\n  BRITTMOORE RD"     
## [5] "   Result:  SAM HOUSTON ENTR  /KATY\n  FWY OB"
## [1] "Pattern:  ^\\s+"
## [1] "No hits"
## [1] "Pattern:  \\s+$"
## [1] "No hits"
## [1] "Pattern:  \n\\s+"
## [1] "   Result:  PARK HARBOR\n    ESTATES"         
## [2] "   Result:  SHERWOOD FOREST\n    GLEN"        
## [3] "   Result:  SPRINGWOOD\n    FOREST"           
## [4] "   Result:  1729  \"B\"\n  BRITTMOORE RD"     
## [5] "   Result:  SAM HOUSTON ENTR  /KATY\n  FWY OB"
## [1] "Pattern:   APT\\s*#*\\d*$"
##  [1] "   Result:  WESTVIEW DR APT 11"    
##  [2] "   Result:  BARKER CYPRESS RD APT" 
##  [3] "   Result:  KATY FREEWAY APT 434"  
##  [4] "   Result:  CRESDALE DR APT"       
##  [5] "   Result:  WIRT RD APT 98"        
##  [6] "   Result:  GESSNER DR APT 170"    
##  [7] "   Result:  GOLD POINT DR APT 312" 
##  [8] "   Result:  CENTREPARK DR APT1633" 
##  [9] "   Result:  BISSONNET ST APT 1118" 
## [10] "   Result:  GRAND HARBOR DR APT"   
## [11] "   Result:  UPLAND DR APT 121"     
## [12] "   Result:  PAGEWOOD LN APT 424"   
## [13] "   Result:  SYCAMORE LANE APT 1116"
## [1] "Pattern:   APT\\s*#*[A-Z0-9]*$"
##  [1] "   Result:  WESTVIEW DR APT 11"    
##  [2] "   Result:  BARKER CYPRESS RD APT" 
##  [3] "   Result:  KATY FREEWAY APT 434"  
##  [4] "   Result:  CRESDALE DR APT"       
##  [5] "   Result:  WIRT RD APT 98"        
##  [6] "   Result:  GESSNER DR APT 170"    
##  [7] "   Result:  GOLD POINT DR APT 312" 
##  [8] "   Result:  CENTREPARK DR APT1633" 
##  [9] "   Result:  BISSONNET ST APT 1118" 
## [10] "   Result:  GRAND HARBOR DR APT"   
## [11] "   Result:  UPLAND DR APT 121"     
## [12] "   Result:  PAGEWOOD LN APT 424"   
## [13] "   Result:  SYCAMORE LANE APT 1116"
## [1] "Pattern:   APARTMENT\\s*#*[A-Z0-9]*$"
## [1] "No hits"
## [1] "Pattern:   UNIT \\w*"
## [1] "No hits"
## [1] "Pattern:   NO\\s*\\d+"
## [1] "No hits"
## [1] "Pattern:   STE\\s*#*[A-Z0-9]*$"
## [1] "No hits"
## [1] "Pattern:   SUITE\\s*#*[A-Z0-9]*$"
## [1] "   Result:  SAWDUST RD SUITE 201" "   Result:  KATY FWY SUITE"
#   Apply
df2 <- applyregex(df2, "Street", dict)


dictionary <- c(
                "^#\\s?[A-Z]? ",  "",  #  more address removal
                " #\\s?[A-Z]?$",   "",  #  more address removal
                " #\\s?[0-9]*$",   "",  #  more address removal
                " [IO]B$",        "",
                " OBIB",          "",
                " UTUR",          "",
                " [IO]B",         "",
                " STREET$",       " ST",
                " STREET",        "",
                "([A-Z ]){1}/([A-Z ]){1}",  "\\1 @ \\2", #  intersections
                "  ",             " ",   #  remove any double spaces
                " EXIT$",         "",
                "^\\d{3,6} ",     "",
                "#\\w+",          ""
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Street", dict)
## [1] "Pattern:  ^#\\s?[A-Z]? "
## [1] "No hits"
## [1] "Pattern:   #\\s?[A-Z]?$"
## [1] "No hits"
## [1] "Pattern:   #\\s?[0-9]*$"
## [1] "No hits"
## [1] "Pattern:   [IO]B$"
## [1] "   Result:  KATY FWY OB"                  
## [2] "   Result:  N ELDRIDGE PKWY/KATY FWY OB"  
## [3] "   Result:  N SH 6 SERVICE RD OB"         
## [4] "   Result:  SAM HOUSTON ENTR /KATY FWY OB"
## [1] "Pattern:   OBIB"
## [1] "   Result:  DAIRY ASHFORD OBIB UTUR"     
## [2] "   Result:  KATY FWY OB/WILCREST OBIB"   
## [3] "   Result:  KATY FWY OB/BUNKER HILL OBIB"
## [1] "Pattern:   UTUR"
## [1] "   Result:  DAIRY ASHFORD UTURN"    
## [2] "   Result:  DAIRY ASHFORD OBIB UTUR"
## [1] "Pattern:   [IO]B"
##  [1] "   Result:  DAIRY ASHFORD OBIB UTUR"       
##  [2] "   Result:  KATY FWY OB"                   
##  [3] "   Result:  KATY FWY OB/WILCREST OBIB"     
##  [4] "   Result:  N ELDRIDGE PKWY/KATY FWY OB"   
##  [5] "   Result:  N SH 6 SERVICE RD OB"          
##  [6] "   Result:  KATY FWY OB/SAM HOUSTON PKWY N"
##  [7] "   Result:  KATY FWY OB/WESTGREEN BLVD"    
##  [8] "   Result:  KATY FWY OB/BUNKER HILL OBIB"  
##  [9] "   Result:  SAM HOUSTON ENTR /KATY FWY OB" 
## [10] "   Result:  KATY FWY OB/BARKER CYPRESS RD" 
## [11] "   Result:  KATY FWY OB/WITTE RD"          
## [1] "Pattern:   STREET$"
## [1] "   Result:  EAST 21ST STREET"  "   Result:  NORTH 31ST STREET"
## [3] "   Result:  PERIWINKLE STREET"
## [1] "Pattern:   STREET"
## [1] "   Result:  EAST 21ST STREET"  "   Result:  NORTH 31ST STREET"
## [3] "   Result:  PERIWINKLE STREET"
## [1] "Pattern:  ([A-Z ]){1}/([A-Z ]){1}"
##  [1] "   Result:  KATY FWY OB/WILCREST OBIB"     
##  [2] "   Result:  N ELDRIDGE PKWY/KATY FWY OB"   
##  [3] "   Result:  KATY FWY OB/SAM HOUSTON PKWY N"
##  [4] "   Result:  KATY FWY OB/WESTGREEN BLVD"    
##  [5] "   Result:  GESSNER / WESTVIEW"            
##  [6] "   Result:  KATY FWY OB/BUNKER HILL OBIB"  
##  [7] "   Result:  PINE LAKE / BUNKER HILL"       
##  [8] "   Result:  SAM HOUSTON ENTR /KATY FWY OB" 
##  [9] "   Result:  CLAY RD /SAM HOUSTON PKWY N"   
## [10] "   Result:  KATY FWY OB/BARKER CYPRESS RD" 
## [11] "   Result:  KATY FWY OB/WITTE RD"          
## [1] "Pattern:    "
## [1] "No hits"
## [1] "Pattern:   EXIT$"
## [1] "   Result:  BUNKER HILL EXIT"
## [1] "Pattern:  ^\\d{3,6} "
## [1] "   Result:  1729 \"B\" BRITTMOORE RD"
## [1] "Pattern:  #\\w+"
## [1] "No hits"
#   Apply
df2 <- applyregex(df2, "Street", dict)

#   null out nonsense records, like PO Box
dictionary <- c(
                "P ?O ?BOX\\s?","",
                "^BOX\\s?\\d*$","",
                "POB\\s","",
                "PO BX","",
                "^PO\\s\\d*$","",
                "HOMELESS","",
                "\\sBOX\\s\\d*$",""
)
dict <- Makedict(dictionary)
#   view them first
testregex(df2, "Street", dict)
## [1] "Pattern:  P ?O ?BOX\\s?"
## [1] "   Result:  P O BOX 268994" "   Result:  PO BOX 14261"  
## [3] "   Result:  PO BOX 2007"    "   Result:  PO BOX 2357"   
## [5] "   Result:  P O BOX 610027" "   Result:  PO BOX 3095"   
## [1] "Pattern:  ^BOX\\s?\\d*$"
## [1] "No hits"
## [1] "Pattern:  POB\\s"
## [1] "No hits"
## [1] "Pattern:  PO BX"
## [1] "No hits"
## [1] "Pattern:  ^PO\\s\\d*$"
## [1] "   Result:  PO 645"
## [1] "Pattern:  HOMELESS"
## [1] "   Result:  HOMELESS"
## [1] "Pattern:  \\sBOX\\s\\d*$"
## [1] "   Result:  P O BOX 268994" "   Result:  PO BOX 14261"  
## [3] "   Result:  PO BOX 2007"    "   Result:  PO BOX 2357"   
## [5] "   Result:  P O BOX 610027" "   Result:  PO BOX 3095"
#   just NA those PO BOX addresses, but leave HOMELESS alone
df2$Street[grepl("P ?O ?BOX\\s?",df2$Street)] <- NA
df2$Street[grepl("^BOX\\s?\\d*$",df2$Street)] <- NA
df2$Street[grepl("POB\\s",df2$Street)] <- NA
df2$Street[grepl("PO BX",df2$Street)] <- NA
df2$Street[grepl("^PO\\s\\d*$",df2$Street)] <- NA
df2$Street[grepl("\\sBOX\\s\\d*$",df2$Street)] <- NA

#   clean out numbers after road designations, and shorten
dictionary <- c(
                "FARM TO MARKET", "FM",
                "NASA RD 1", "NASA ROAD 1",
                "ST \\d+$",     "ST",
                "DR \\d+$",     "DR",
                "RD \\d+$",     "RD",
                "AVE \\d+$",    "AVE",
                "BLVD \\d+$",   "BLVD",
                "AVENUE$",      "AVE",
                "ROAD$",        "RD",
                "DRIVE$",       "DR",
                "LANE$",        "LN",
                " CIR$",        " CIRCLE",
                " COURT$",      " CT",
                " PLACE$",   " PL",
                " PVT ",        " ",
                " UNIT$",       "",
                "([A-C,EFG,J-L,N-Q,T-W,Z]) \\d{2,6}$",   "\\1",
                "\\sSPDWY$",    " SPEEDWAY"
)
dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Street", dict)
## [1] "Pattern:  FARM TO MARKET"
## [1] "No hits"
## [1] "Pattern:  NASA RD 1"
## [1] "No hits"
## [1] "Pattern:  ST \\d+$"
## [1] "No hits"
## [1] "Pattern:  DR \\d+$"
## [1] "   Result:  SHADOWDALE DR 25" "   Result:  BINTLIFF DR 200" 
## [1] "Pattern:  RD \\d+$"
## [1] "   Result:  PITNER RD 2005" "   Result:  WEST RD 514"   
## [1] "Pattern:  AVE \\d+$"
## [1] "No hits"
## [1] "Pattern:  BLVD \\d+$"
## [1] "No hits"
## [1] "Pattern:  AVENUE$"
## [1] "   Result:  LAKEPOINTE AVENUE"
## [1] "Pattern:  ROAD$"
## [1] "   Result:  DAIRY ASHFORD ROAD" "   Result:  OLD KATY ROAD"     
## [3] "   Result:  IVYRIDGE ROAD"      "   Result:  CARLEEN ROAD"      
## [1] "Pattern:  DRIVE$"
## [1] "   Result:  YAUPON MIST DRIVE"     "   Result:  ROSEFIELD DRIVE"      
## [3] "   Result:  TRANQUIL COVE DRIVE"   "   Result:  TAPPENBECK DRIVE"     
## [5] "   Result:  CAPRI DRIVE"           "   Result:  BUSINESS CENTER DRIVE"
## [7] "   Result:  LONGRIDGE DRIVE"       "   Result:  CIMARRON DRIVE"       
## [1] "Pattern:  LANE$"
## [1] "   Result:  W TRI OAKS LANE"    "   Result:  MOUNTAIN ROSE LANE"
## [3] "   Result:  CHESHIRE LANE"      "   Result:  GUM GROVE LANE"    
## [5] "   Result:  EVENING SHADE LANE" "   Result:  KIMBERLY LANE"     
## [7] "   Result:  LEE SHORE LANE"     "   Result:  BROOKSHIRE LANE"   
## [9] "   Result:  SYCAMORE LANE"     
## [1] "Pattern:   CIR$"
## [1] "   Result:  CRESTDALE CIR"  "   Result:  LAKEVIEW CIR"  
## [3] "   Result:  JARDIN DES CIR" "   Result:  HARBOR KEY CIR"
## [5] "   Result:  DEER KEY CIR"  
## [1] "Pattern:   COURT$"
## [1] "   Result:  JEFFERS COURT" "   Result:  BETTINA COURT"
## [1] "Pattern:   PLACE$"
## [1] "   Result:  PARK TEN PLACE"    "   Result:  BEEKMAN PLACE"    
## [3] "   Result:  CLARBOROUGH PLACE"
## [1] "Pattern:   PVT "
## [1] "No hits"
## [1] "Pattern:   UNIT$"
## [1] "   Result:  PERTHSHIRE RD UNIT"
## [1] "Pattern:  ([A-C,EFG,J-L,N-Q,T-W,Z]) \\d{2,6}$"
## [1] "   Result:  CRESTDALE 1007"  "   Result:  SOUTHLAKE 4303" 
## [3] "   Result:  BUNKER HILL 100"
## [1] "Pattern:  \\sSPDWY$"
## [1] "No hits"
#   Apply
df2 <- applyregex(df2, "Street", dict)

#   Some common errors that reappear with regularity
dictionary <- c(
                " ENT$",           "",
                " SOUTH$",       "",
                " SUITE? \\w*$",   "",
                "BOULEVARD",       "BLVD",
                "BISSONET", "BISSONNET",
                "^BLK ", "",
                "^ ", "",
                " STRE$", "",
                "SPEEDWAY",        "SPDWY",
                " LANE \\d*$",  " LN",
                "FREEWAY$",      "FWY",
                "PARKWAY$",     "PKWY",
                " \\+ ",     " @ ",
                " AT ",          " @ ",
                " & ",          " @ ",
                "COUNTY ROAD",         "CR",
                " DR \\w*$",     " DR",
                " AVEN$",      " AVE",
                " STE [A-Z0-9 ]*$",   "",
                "\\d+O\\d+","",
                " ENTR?$",         ""
)

dict <- Makedict(dictionary)
#   test them first
testregex(df2, "Street", dict)
## [1] "Pattern:   ENT$"
## [1] "   Result:  BLALOCK ENT"
## [1] "Pattern:   SOUTH$"
## [1] "   Result:  HWY 6 SOUTH"
## [1] "Pattern:   SUITE? \\w*$"
## [1] "No hits"
## [1] "Pattern:  BOULEVARD"
## [1] "No hits"
## [1] "Pattern:  BISSONET"
## [1] "No hits"
## [1] "Pattern:  ^BLK "
## [1] "No hits"
## [1] "Pattern:  ^ "
## [1] "No hits"
## [1] "Pattern:   STRE$"
## [1] "No hits"
## [1] "Pattern:  SPEEDWAY"
## [1] "No hits"
## [1] "Pattern:   LANE \\d*$"
## [1] "No hits"
## [1] "Pattern:  FREEWAY$"
## [1] "   Result:  KATY FREEWAY"     "   Result:  NW FREEWAY"      
## [3] "   Result:  OLD KATY FREEWAY"
## [1] "Pattern:  PARKWAY$"
## [1] "   Result:  SAM HOUSTON PARKWAY"
## [1] "Pattern:   \\+ "
## [1] "No hits"
## [1] "Pattern:   AT "
## [1] "No hits"
## [1] "Pattern:   & "
## [1] "No hits"
## [1] "Pattern:  COUNTY ROAD"
## [1] "   Result:  COUNTY ROAD 4683"
## [1] "Pattern:   DR \\w*$"
## [1] "   Result:  PARKVIEW DR H24"
## [1] "Pattern:   AVEN$"
## [1] "No hits"
## [1] "Pattern:   STE [A-Z0-9 ]*$"
## [1] "No hits"
## [1] "Pattern:  \\d+O\\d+"
## [1] "No hits"
## [1] "Pattern:   ENTR?$"
## [1] "   Result:  BLALOCK ENT"
#   Apply
df2 <- applyregex(df2, "Street", dict)

#   begin going through dataset in blocks of 100 visually inspecting

tail(head(sort(unique(df2$Street)),100),100)
##   [1] "34TH"                "ABERCROMBIE"         "AGAR"               
##   [4] "AIRPORT"             "ALCOTT"              "ALSTEAD"            
##   [7] "ALTA VISTA"          "ANNISTON"            "ASHCLIFT"           
##  [10] "ASH CREEK"           "ASH CREEK DR"        "ASHFORD HOLLOW"     
##  [13] "AUTUMN GROVE"        "AVALON PARK"         "B 1 FM1092"         
##  [16] "BANDELIER"           "BANDELIER ST"        "BARKER CYPRESS"     
##  [19] "BARKER CYPRESS RD"   "BARKER SPRINGS"      "BARKER SPRINGS RD"  
##  [22] "BARRELL"             "BARWOOD"             "BARWOOD DR"         
##  [25] "BASSWOOD FOREST"     "\"B\" BRITTMOORE RD" "BEAR CREEK"         
##  [28] "BEAR CREEK MEADOWS"  "BEEKMAN PL"          "BEEKMAN PLACE DR"   
##  [31] "BELL GARDENS"        "BELL GARDENS DR"     "BERNADETTE"         
##  [34] "BETHKE"              "BETTINA CT"          "BINTLIFF DR"        
##  [37] "BISSONNET ST"        "BLACKHAW"            "BLACK HAW"          
##  [40] "BLALOCK"             "BLALOCK RD"          "BLANCO BEND"        
##  [43] "BLANKENSHIP"         "BLUE HERON"          "BRAYMOORE"          
##  [46] "BRENTLEYWOOD"        "BRENWOOD GLEN TRAIL" "BRIAR FOREST"       
##  [49] "BRIARWILD"           "BRIARWILD LN"        "BRICKYARD"          
##  [52] "BRIDGELAND"          "BRIDGELAND LN"       "BRINWOOD"           
##  [55] "BRINWOOD DR"         "BRITTMOORE"          "BRITTMOORE OAKS"    
##  [58] "BRITTMOORE RD"       "BRITTWAY"            "BROADFIELD"         
##  [61] "BROADFIELD BLVD"     "BROOKGLADE"          "BROOKSHIRE"         
##  [64] "BROOKSHIRE LN"       "BROOKTREE"           "BUCKMANN"           
##  [67] "BUESCHER"            "BULLOCK"             "BUNDY"              
##  [70] "BUNKER HILL"         "BUNKER HILL RD"      "BUSINESS CENTER"    
##  [73] "BUSINESS CENTER DR"  "BUTLERCREST"         "CAMPBELL"           
##  [76] "CAMPBELL RD"         "CANOGA"              "CANOGA LN"          
##  [79] "CAPITAL PARK"        "CAPRI DR"            "CARLEEN RD"         
##  [82] "CAROUSEL"            "CEDARDALE"           "CEDARDALE DR"       
##  [85] "CEDAR POST"          "CEDAR POST LN"       "CENTENNIAL"         
##  [88] "CENTRAL PARK WEST"   "CENTREPARK"          "CENTREPARK DR"      
##  [91] "CHAMPION VILLA DR"   "CHAPARRAL"           "CHAPARRAL DR"       
##  [94] "CHATTERTON"          "CHATTERTON DR"       "CHESAPEAKE BAY"     
##  [97] "CHESHIRE LN"         "CHICORY"             "CHIMNEY ROCK"       
## [100] "CHURCH"
#####    many lines of inspect, correct, and check removed

#   Put ST/RD/DR/CT/BLVD/LN/AVE/CIRCLE/WAY into Type field and remove from Street field
matchstring <- c(" ST$| RD$| DR$| LN$| WAY$| BLVD$| CIRCLE$| CT$| PL$| CIR$| AVE$")
maska <- grepl(matchstring, df2$Street) # does street name have suffix?
masktype <- grepl("-", df2$Type) # is type field blank?
matched_values <- str_extract(df2$Street[maska&masktype],matchstring)
#  put value into Type field
df2$Type[maska&masktype] <- matched_values  
#  remove value from Street field
df2$Street[maska&masktype] <- sub(matchstring, "",
                                  df2$Street[maska&masktype])  

#   clean up extra space from Type field
df2$Type <- str_replace(df2$Type,"^ *","")

#######   much more, especially around freeways and hiways

Save the results

Finally I save the resulting, mostly clean, file. The next big step is geocoding, and it will reveal many more issues that will need to be addressed. But that will be part 2.

#################################
# Save the results
#################################

saveRDS(df2, file=paste("~/Dropbox/CrimeStats/District_",district,"_CleanData.rds",sep=""))