On my recent spring break to Chicago (really more of a “break from spring”) I became curious about the city’s 77 different “Community Areas”. Here is a map of these areas.

On the City of Chicago’s data website I found a nice dataset (downloaded from here, codebook here) that contains information on 27 different public health and economic factors for each of the 77 areas. It has information on 21 health factors: fertility rates, cancer rates, lead poisoning, STD rates, etc., as well six economic factors: “Below.Poverty.Level”, “Crowded.Housing”, “Dependency” “No.High.School.Diploma”, “Per.Capita.Income”, and “Unemployment”.

My question is: How does this public health data cluster the 77 Community Areas?

I’m interested in separating the primary health data, on rates of cancer, fertility, STDs, etc, from the secondary economic data. So I’ll form my clusters using the 21 health data columns, and then look at how these clusters occupy economic space.

Cleaning and Exploratory Data Analysis

There are a small number of NAs, and I used median values to fill these in. Most values in the dataset are in the 0-100 range, but there are a few columns with high values (e.g. gonorrhea rates per 100,000). These factors were dominating my initial clustering analyses, so I decided to center and scale each column. (See RMarkdown file for code.)

Clustering with k=3

First I’ll do the clustering with k=3 centers, again just using the 21 health data columns. Here is a table of how many areas are assigned to each of the three clusters.

## 
##  1  2  3 
## 23 28 26

Let’s look at the smallest cluster, with index 1. We can check the community areas this has clustered together.

Community area codes (see map link above):

##  [1]  1  2 14 18 19 20 21 22 24 28 30 31 33 36 52 55 57 58 59 62 63 65 66

Community area names:

##  [1] Rogers Park     West Ridge      Albany Park     Montclaire     
##  [5] Belmont Cragin  Hermosa         Avondale        Logan Square   
##  [9] West Town       Near West Side  South Lawndale  Lower West Side
## [13] Near South Side Oakland         East Side       Hegewisch      
## [17] Archer Heights  Brighton Park   McKinley Park   West Elsdon    
## [21] Gage Park       West Lawn       Chicago Lawn   
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn

Here are the two other clusters:

Codes and names for middle cluster (index 3):

##  [1]  3  4  5  6  7  8  9 10 11 12 13 15 16 17 32 34 39 41 56 60 64 70 72
## [24] 74 76 77
##  [1] Uptown          Lincoln Square  North Center    Lake View      
##  [5] Lincoln Park    Near North Side Edison Park     Norwood Park   
##  [9] Jefferson Park  Forest Glen     North Park      Portage Park   
## [13] Irving Park     Dunning         Loop            Armour Square  
## [17] Kenwood         Hyde Park       Garfield Ridge  Bridgeport     
## [21] Clearing        Ashburn         Beverly         Mount Greenwood
## [25] O'Hare          Edgewater      
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn

Codes and names for largest cluster (index 2):

##  [1] 23 25 26 27 29 35 37 38 40 42 43 44 45 46 47 48 49 50 51 53 54 61 67
## [24] 68 69 71 73 75
##  [1] Humboldt Park          Austin                 West Garfield Park    
##  [4] East Garfield Park     North Lawndale         Douglas               
##  [7] Fuller Park            Grand Boulevard        Washington Park       
## [10] Woodlawn               South Shore            Chatham               
## [13] Avalon Park            South Chicago          Burnside              
## [16] Calumet Heights        Roseland               Pullman               
## [19] South Deering          West Pullman           Riverdale             
## [22] New City               West Englewood         Englewood             
## [25] Greater Grand Crossing Auburn Gresham         Washington Heights    
## [28] Morgan Park           
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn

Health clusters and economics

These clusters were formed using health statistics. How do they play out in term of the economic statistics?

Two important economic factors are Per.Capita.Income and Unemployment. The next plot shows how the three clusters occupy this space. It does seem that they maintain their clusters.

Two other demographic factors for public health might be crowded housing and education. The next plot looks at how the three clusters spread in this space; the clustering is not so clear.

Which k value is best?

To try and figure out which k value best captures the data, I’ll try a range of k values. Here are two plots. It’s hard for me to interpret these, so maybe there’s no clear winner for best value for k.

Clustering with topological data analysis

On the other hand, maybe topological data analysis can suggest a good k value. We have 77 points in the 21-dimensional space of health statistics. Let’s look at the barcode of 0-dimensional and 1-dimensional persistent homology for this dataset.

This seems to suggest that k=5 might be a useful clustering to look at. (The first long stretch of unchanging H0, around filtration value “time” 4, maintains 5 horizontal black bars.)

Revisiting with k=5

So we repeat the above analysis with k=5. Here is a table of how many areas are assigned to each of the five clusters. They are pretty evenly spread out.

## 
##  1  2  3  4  5 
##  9 11 21 16 20

Let’s look at the smallest cluster, with index 1. We can check the community areas this has clustered together.

Community area codes (see map link above):

## [1] 25 26 29 37 47 50 67 68 69

Community area names:

## [1] Austin                 West Garfield Park     North Lawndale        
## [4] Fuller Park            Burnside               Pullman               
## [7] West Englewood         Englewood              Greater Grand Crossing
## 77 Levels: Albany Park Archer Heights Armour Square ... Woodlawn

The next plots shows how the five clusters spread out in economic spaces.