Indian household clustering

The question

Is there a natural way to cluster Indian households, based on data about the nature of these households – do they have electricity, a TV, a computer, a cow, an ox-drawn cart? How do these clusters correspond to the 5-level Wealth Index?

The dataset

The National Family Health Survey is conducted in India every 10 years or so. The latest one was in 2015-2016 and the data is yet to be released. In preparation, I’m looking at the third NFHS, conducted in 2005-2006. I registered to get access to this data, (which unfortunately I can’t upload to GitHub).

This week, I’m looking at how Indian households cluster according to their characteristics. The dataset includes 109,041 separate questionnaire responses, each with 3588 (!!!) variables entered. A codebook can be found in the file IAHR52FL.MAP (open it with a text editor). There’s a copy of the household questionnaire starting on p.636 of FRIND3-Vol1AndVol2.pdf. These files are uploaded to the GitHub repo corresponding to this report.

The variables

I will look specifically at the following variables.

Variable	Does the household have…?	Variable	[other characteristics]
HV206	electricity	HV025	Rural or urban?
HV207	radio	HV013	# in household?
HV208	television	HV014	# of children in HH?
HV209	refrigerator	HV216	# of rooms for sleeping?
HV210	bicycle	HV219	sex of HH head?
HV211	motorcycle/scooter	HV220	age of HH head?
HV212	car	SH30	anyone with tuberculosis?
HV221	non-mobile telephone	SH58	own this house?
HV227	bednets for sleeping	SH62A	have cows/bulls/buffalo?
HV243A	mobile telephone	SH62B	have camels?
HV243B	watch	SH62C	have horses/donkeys/mules?
HV243C	animal-drawn cart	SH62D	have goats?
HV246	livestock	SH62E	have sheep?
SH47B	mattress	SH62F	have chickens?
SH47C	pressure cooker	SHSTRUC	nuclear or non-nuclear family?
SH47D	chair
SH47E	cot/bed
SH47F	table
SH47G	electric fan
SH47I	B&W television
SH47J	color television
SH47K	sewing machine
SH47N	computer
SH47U	water pump
SH47V	thresher
SH47W	tractor
SH56A	any windows
SH56B	windows with glass
SH56C	windows with screens
SH56D	windows with curtains/shutters

TDA to detect clusters

As a first step, we can use topological data analysis (TDA) to see if there is a natural number of clusters in the data. For the sake of computation, I’ll sample 1000 of the 109,041 cases (and remove 7 with missing data). To do the whole 993 x 44 dataset with maxscale=10, it takes my laptop about 7.5 minutes. It only took about 2 minutes with maxscale=5. Below we have the persistence homology barcode and the persistence diagram.

(See this report or this report for some details on using TDA.)

Because there are 993 black barcode lines, it’s hard to see how many larger clusters we are looking at. For this we can look explicitly at the birth and death of the H0 intervals.

##       dimension Birth     Death
##  [1,]         0     0 10.000000
##  [2,]         0     0  7.724024
##  [3,]         0     0  5.306447
##  [4,]         0     0  5.036054
##  [5,]         0     0  4.868821
##  [6,]         0     0  4.248221
##  [7,]         0     0  3.724490
##  [8,]         0     0  3.722570
##  [9,]         0     0  3.719849
## [10,]         0     0  3.582060
## [11,]         0     0  3.527882
## [12,]         0     0  3.520842
## [13,]         0     0  3.514229
## [14,]         0     0  3.414533
## [15,]         0     0  3.383944

We see that before everything gets clumped together into one connected component, there is a long stretch with two clusters. And there is a persistent feature of k=5 clusters. So, miraculously or by design, it was a good idea to use a Wealth Index with values 1-5.

What are the two clusters?

Before we cluster with k=5, let’s see what we can say about the situation with k=2, which was suggested by TDA. It seems the two clusters have similar sizes.

## 
##   1   2 
## 519 474

We can tabulate the mean for each variable, grouped by cluster, to look for any large differences.

##   cluster2 HV025 HV206 HV207 HV208  HV209 HV210  HV211 HV212 HV219  HV221
## 1        1 0.254 0.632 0.233 0.210 0.0212 0.407 0.0405 0.000 0.832 0.0173
## 2        2 0.700 0.983 0.496 0.932 0.4599 0.489 0.3734 0.103 0.884 0.3481
##   HV227 HV243A HV243B HV243C HV246   SH30 SH47B SH47C SH47D SH47E SH47F
## 1 0.349 0.0443  0.723 0.0539 0.565 0.0154 0.482 0.214 0.426 0.755 0.308
## 2 0.426 0.4958  0.968 0.0295 0.259 0.0148 0.932 0.899 0.947 0.943 0.903
##   SH47G SH47I  SH47J  SH47K   SH47N  SH47U  SH47V   SH47W SH56A  SH56B
## 1 0.339 0.170 0.0405 0.0944 0.00193 0.0328 0.0116 0.00963 0.599 0.0539
## 2 0.861 0.245 0.7342 0.4705 0.08650 0.1540 0.0148 0.01688 0.964 0.4747
##    SH56C SH56D  SH58 SH62A   SH62B   SH62C  SH62D   SH62E SH62F SHSTRUC
## 1 0.0751 0.187 0.844 0.395 0.00578 0.00771 0.1657 0.03468 0.264   0.699
## 2 0.4030 0.605 0.827 0.169 0.00000 0.00211 0.0506 0.00422 0.120   0.565
##    HV013  HV014  HV216  HV220
## 1 -0.127  0.118 -0.287 -0.200
## 2  0.140 -0.137  0.312  0.222

If we plot histograms of a few of these variables, the picture emerges that cluster 1 tends to be rural households, with almost no amenities like a refrigerator or cell phone. Cluster 2 tends to be urban households, almost all electrified, and much more likely to have household amenities.

Comparing clusters by Wealth Index

Now let’s return to the full dataset of 109,041 households, and use the same 44 variables to cluster with k=5.

The supplemental file NFHS3SUP.pdf, p.6, explains how the “Wealth Index” variable HV270 was constructed using PCA to generate weight for these variables. The index is a cumulative score, normalized to a ranking from 1 (poorest) to 5 (wealthiest). How do the five clusters compare with the Wealth Index?

##    
##     Middle Poorer Poorest Richer Richest
##   1   2604   1020     293   3829    3689
##   2   4079   5386    5731   1877     211
##   3   7671   9037    8453   2040      30
##   4   6078    968      45  13512    6715
##   5    304      1       0   4000   20389

Cluster 5 seems to capture most of the Richest and Richer households. Cluster 4 is capturing a lot of the Richer, with some bleeding into Richest and Middle. Cluster 3 is roughly evenly spread between Middle, Poorer, and Poorest. Cluster 2 and 1 do not bin well with the Wealth Index, although cluster 2 has few Richest households and cluster 1 has few Poorest ones.