A data project So I have a job that runs every hour and scrapes the HFD incident page so that I can capture all the HFD 911 calls. The job has been running since May 6, 2022. To be honest, I haven’t seen any real obvious patterns, at least geographically. Part of the problem is the resolution. The data is located on the Keymap grid, which are 1x1 mile squares covering the whole area.
Geocoding Part 2 Let’s take the address for the Art Car Museum and use that as our example address. The first address is correct, the next 5 have a flaw in one of the fields.
# Address for the Art Car Museum Test_data <- tribble( ~ID, ~Street_num, ~Prefix, ~Street_name, ~Street_type, ~Zipcode, "1", "140", "" , "Heights", "BLVD", "77007", "2", "138", "" , "Heights", "BLVD", "77007", "3", "140", "W" , "Heights", "BLVD", "77007", "4", "140", "" , "Hieghts", "BLVD", "77007", "5", "140", "" , "Heights", "LN", "77007", "6", "140", "" , "Heights", "BLVD", "77070" ) Exact Matches The basic expected way to run the code is to first find all exact matches, and then use the additional tools to try to repair any failures that occurred.
Geocoding Attaching a Lat-Long to a street address is not an easy task. I have tried a variety of freely available geocoders, and have found all of them to be lacking for various reasons. See one of my earliest posts on this blog for more details.
Finally, I discovered that the city of Houston has made available a file from their GIS group that has most of the addresses and associated Lat-Longs for the city (a total of 1,480,215 records when I downloaded it).
Having bought solar panels myself a couple of years ago, and realizing that the city permit database could be used to find most installations, I decided that it would be interesting to look at the recent history and a few other facets of residential solar panel installations.
The first step is to download the structural permit data as a CSV file from the city open data website.. This file is no longer available, so I now download the data from the new site and clean it up.
Harris County Appraisal District data Let’s start exploring the data. We’ll look at all these exempt properties.
# This takes us from 1.4 million to 74,000 records Dx <- df %>% filter(str_detect(state_class, "^X")) Dx %>% ggplot(aes(x=state_class)) + geom_histogram(stat="count")+ labs(x="Exempt code", y="Number of Properties", title="Number of properties in each exempt class") # Same plot but for total square miles Dx %>% group_by(state_class) %>% summarize(area=sum(land_ar, na.rm=TRUE)*3.58701e-8) %>% ggplot(aes(x=state_class)) + geom_col(aes(y=area))+ labs(x="Exempt code", y="Square Miles", title="Area of properties in each exempt class") # Same plot but for total Market Value Dx %>% group_by(state_class) %>% summarize(area=sum(tot_mkt_val, na.
Let’s take a look at the early voting data for Harris County Since I already have a bunch of data for Harris county precincts and zipcodes, why not make some use of it?
Setup path <- "/home/ajackson/Dropbox/Rprojects/Voting/" BBM <- read_csv(paste0(path, "Cumulative_BBM_1120.csv"), col_types = "ccccccccccccccccccccccccccccccccccccccccc") BBM <- BBM %>% mutate(ActivityDate=mdy_hms(ActivityDate)) %>% mutate(ActivityDate=force_tz(ActivityDate, tzone = "US/Central")) %>% select(ElectionCode:ActivityDate) %>% mutate(Ballot_Type="Mail") EV <- list.files(path=path, pattern="Cumulative_EV_1120_1*", full.names=TRUE) %>% map_df(~read_csv(., col_types = "ccccccccccccccccccccccccccccccccccccccccc")) EV <- EV %>% mutate(ActivityDate=mdy_hms(ActivityDate)) %>% mutate(ActivityDate=force_tz(ActivityDate, tzone = "US/Central")) %>% select(ElectionCode:ActivityDate) %>% mutate(Ballot_Type="Early") Votes <- rbind(BBM, EV) VotesByZipDate <- Votes %>% mutate(Date=floor_date(ActivityDate, unit="day")) %>% group_by(Date, Ballot_Type, VoterZIP) %>% summarise(Votes=n()) %>% ungroup() %>% rename(Zip=VoterZIP) %>% drop_na() ########### registered voters path <- paste0(path, "HarrisRegisteredVoters/") files <- dir(path=path, pattern = "*.
Harris County COVID-19 data I have a very nice (I hope) dataset consisting of number of positive COVID-19 cases per day in Harris county by zipcode. In this blog entry I would like to study this dataset and look at comparisons with various other data.
Initial look First off, let’s explore the data for issues, and for ideas about what might be interesting.
# How is the data distributed? Let's look at the most recent day Harris %>% group_by(Zip) %>% summarize(Cases_today=last(Cases)) %>% ggplot(aes(x=Cases_today)) + geom_histogram() So over 20 zipcodes have no cases (but they may also have no people), and it looks like most zipcodes are in the 250-750 range.