On my recent spring break to Chicago (really more of a “break from spring”) I became curious about the city’s 77 different “Community Areas”. Here is a map of these areas.
On the City of Chicago’s data website I found a nice dataset (downloaded from here, codebook here) that contains information on 27 different public health and economic factors for each of the 77 areas. It has information on 21 health factors: fertility rates, cancer rates, lead poisoning, STD rates, etc., as well six economic factors: “Below.Poverty.Level”, “Crowded.Housing”, “Dependency” “No.High.School.Diploma”, “Per.Capita.Income”, and “Unemployment”.
My question is: How does this public health data cluster the 77 Community Areas?
I’m interested in separating the primary health data, on rates of cancer, fertility, STDs, etc, from the secondary economic data. So I’ll form my clusters using the 21 health data columns, and then look at how these clusters occupy economic space.
There are a small number of NAs, and I used median values to fill these in. Most values in the dataset are in the 0-100 range, but there are a few columns with high values (e.g. gonorrhea rates per 100,000). These factors were dominating my initial clustering analyses, so I decided to center and scale each column. (See RMarkdown file for code.)
First I’ll do the clustering with k=3 centers, again just using the 21 health data columns. Here is a table of how many areas are assigned to each of the three clusters.
##
## 1 2 3
## 23 28 26
Let’s look at the smallest cluster, with index 1. We can check the community areas this has clustered together.
Community area codes (see map link above):
## [1] 1 2 14 18 19 20 21 22 24 28 30 31 33 36 52 55 57 58 59 62 63 65 66
Community area names:
## [1] Rogers Park West Ridge Albany Park Montclaire
## [5] Belmont Cragin Hermosa Avondale Logan Square
## [9] West Town Near West Side South Lawndale Lower West Side
## [13] Near South Side Oakland East Side Hegewisch
## [17] Archer Heights Brighton Park McKinley Park West Elsdon
## [21] Gage Park West Lawn Chicago Lawn
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn
Here are the two other clusters:
Codes and names for middle cluster (index 3):
## [1] 3 4 5 6 7 8 9 10 11 12 13 15 16 17 32 34 39 41 56 60 64 70 72
## [24] 74 76 77
## [1] Uptown Lincoln Square North Center Lake View
## [5] Lincoln Park Near North Side Edison Park Norwood Park
## [9] Jefferson Park Forest Glen North Park Portage Park
## [13] Irving Park Dunning Loop Armour Square
## [17] Kenwood Hyde Park Garfield Ridge Bridgeport
## [21] Clearing Ashburn Beverly Mount Greenwood
## [25] O'Hare Edgewater
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn
Codes and names for largest cluster (index 2):
## [1] 23 25 26 27 29 35 37 38 40 42 43 44 45 46 47 48 49 50 51 53 54 61 67
## [24] 68 69 71 73 75
## [1] Humboldt Park Austin West Garfield Park
## [4] East Garfield Park North Lawndale Douglas
## [7] Fuller Park Grand Boulevard Washington Park
## [10] Woodlawn South Shore Chatham
## [13] Avalon Park South Chicago Burnside
## [16] Calumet Heights Roseland Pullman
## [19] South Deering West Pullman Riverdale
## [22] New City West Englewood Englewood
## [25] Greater Grand Crossing Auburn Gresham Washington Heights
## [28] Morgan Park
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn
These clusters were formed using health statistics. How do they play out in term of the economic statistics?
Two important economic factors are Per.Capita.Income and Unemployment. The next plot shows how the three clusters occupy this space. It does seem that they maintain their clusters.
Two other demographic factors for public health might be crowded housing and education. The next plot looks at how the three clusters spread in this space; the clustering is not so clear.
To try and figure out which k value best captures the data, I’ll try a range of k values. Here are two plots. It’s hard for me to interpret these, so maybe there’s no clear winner for best value for k.
On the other hand, maybe topological data analysis can suggest a good k value. We have 77 points in the 21-dimensional space of health statistics. Let’s look at the barcode of 0-dimensional and 1-dimensional persistent homology for this dataset.
This seems to suggest that k=5 might be a useful clustering to look at. (The first long stretch of unchanging H0, around filtration value “time” 4, maintains 5 horizontal black bars.)
So we repeat the above analysis with k=5. Here is a table of how many areas are assigned to each of the five clusters. They are pretty evenly spread out.
##
## 1 2 3 4 5
## 9 11 21 16 20
Let’s look at the smallest cluster, with index 1. We can check the community areas this has clustered together.
Community area codes (see map link above):
## [1] 25 26 29 37 47 50 67 68 69
Community area names:
## [1] Austin West Garfield Park North Lawndale
## [4] Fuller Park Burnside Pullman
## [7] West Englewood Englewood Greater Grand Crossing
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn
The next plots shows how the five clusters spread out in economic spaces.