The question

Is there a natural way to cluster Indian households, based on data about the nature of these households – do they have electricity, a TV, a computer, a cow, an ox-drawn cart? How do these clusters correspond to the 5-level Wealth Index?

The dataset

The National Family Health Survey is conducted in India every 10 years or so. The latest one was in 2015-2016 and the data is yet to be released. In preparation, I’m looking at the third NFHS, conducted in 2005-2006. I registered to get access to this data, (which unfortunately I can’t upload to GitHub).

This week, I’m looking at how Indian households cluster according to their characteristics. The dataset includes 109,041 separate questionnaire responses, each with 3588 (!!!) variables entered. A codebook can be found in the file IAHR52FL.MAP (open it with a text editor). There’s a copy of the household questionnaire starting on p.636 of FRIND3-Vol1AndVol2.pdf. These files are uploaded to the GitHub repo corresponding to this report.

The variables

I will look specifically at the following variables.

Variable Does the household have…? Variable [other characteristics]
HV206 electricity HV025 Rural or urban?
HV207 radio HV013 # in household?
HV208 television HV014 # of children in HH?
HV209 refrigerator HV216 # of rooms for sleeping?
HV210 bicycle HV219 sex of HH head?
HV211 motorcycle/scooter HV220 age of HH head?
HV212 car SH30 anyone with tuberculosis?
HV221 non-mobile telephone SH58 own this house?
HV227 bednets for sleeping SH62A have cows/bulls/buffalo?
HV243A mobile telephone SH62B have camels?
HV243B watch SH62C have horses/donkeys/mules?
HV243C animal-drawn cart SH62D have goats?
HV246 livestock SH62E have sheep?
SH47B mattress SH62F have chickens?
SH47C pressure cooker SHSTRUC nuclear or non-nuclear family?
SH47D chair
SH47E cot/bed
SH47F table
SH47G electric fan
SH47I B&W television
SH47J color television
SH47K sewing machine
SH47N computer
SH47U water pump
SH47V thresher
SH47W tractor
SH56A any windows
SH56B windows with glass
SH56C windows with screens
SH56D windows with curtains/shutters

TDA to detect clusters

As a first step, we can use topological data analysis (TDA) to see if there is a natural number of clusters in the data. For the sake of computation, I’ll sample 1000 of the 109,041 cases (and remove 7 with missing data). To do the whole 993 x 44 dataset with maxscale=10, it takes my laptop about 7.5 minutes. It only took about 2 minutes with maxscale=5. Below we have the persistence homology barcode and the persistence diagram.

(See this report or this report for some details on using TDA.)

Because there are 993 black barcode lines, it’s hard to see how many larger clusters we are looking at. For this we can look explicitly at the birth and death of the H0 intervals.

##       dimension Birth     Death
##  [1,]         0     0 10.000000
##  [2,]         0     0  7.724024
##  [3,]         0     0  5.306447
##  [4,]         0     0  5.036054
##  [5,]         0     0  4.868821
##  [6,]         0     0  4.248221
##  [7,]         0     0  3.724490
##  [8,]         0     0  3.722570
##  [9,]         0     0  3.719849
## [10,]         0     0  3.582060
## [11,]         0     0  3.527882
## [12,]         0     0  3.520842
## [13,]         0     0  3.514229
## [14,]         0     0  3.414533
## [15,]         0     0  3.383944

We see that before everything gets clumped together into one connected component, there is a long stretch with two clusters. And there is a persistent feature of k=5 clusters. So, miraculously or by design, it was a good idea to use a Wealth Index with values 1-5.

What are the two clusters?

Before we cluster with k=5, let’s see what we can say about the situation with k=2, which was suggested by TDA. It seems the two clusters have similar sizes.

## 
##   1   2 
## 519 474

We can tabulate the mean for each variable, grouped by cluster, to look for any large differences.

##   cluster2 HV025 HV206 HV207 HV208  HV209 HV210  HV211 HV212 HV219  HV221
## 1        1 0.254 0.632 0.233 0.210 0.0212 0.407 0.0405 0.000 0.832 0.0173
## 2        2 0.700 0.983 0.496 0.932 0.4599 0.489 0.3734 0.103 0.884 0.3481
##   HV227 HV243A HV243B HV243C HV246   SH30 SH47B SH47C SH47D SH47E SH47F
## 1 0.349 0.0443  0.723 0.0539 0.565 0.0154 0.482 0.214 0.426 0.755 0.308
## 2 0.426 0.4958  0.968 0.0295 0.259 0.0148 0.932 0.899 0.947 0.943 0.903
##   SH47G SH47I  SH47J  SH47K   SH47N  SH47U  SH47V   SH47W SH56A  SH56B
## 1 0.339 0.170 0.0405 0.0944 0.00193 0.0328 0.0116 0.00963 0.599 0.0539
## 2 0.861 0.245 0.7342 0.4705 0.08650 0.1540 0.0148 0.01688 0.964 0.4747
##    SH56C SH56D  SH58 SH62A   SH62B   SH62C  SH62D   SH62E SH62F SHSTRUC
## 1 0.0751 0.187 0.844 0.395 0.00578 0.00771 0.1657 0.03468 0.264   0.699
## 2 0.4030 0.605 0.827 0.169 0.00000 0.00211 0.0506 0.00422 0.120   0.565
##    HV013  HV014  HV216  HV220
## 1 -0.127  0.118 -0.287 -0.200
## 2  0.140 -0.137  0.312  0.222

If we plot histograms of a few of these variables, the picture emerges that cluster 1 tends to be rural households, with almost no amenities like a refrigerator or cell phone. Cluster 2 tends to be urban households, almost all electrified, and much more likely to have household amenities.

Comparing clusters by Wealth Index

Now let’s return to the full dataset of 109,041 households, and use the same 44 variables to cluster with k=5.

The supplemental file NFHS3SUP.pdf, p.6, explains how the “Wealth Index” variable HV270 was constructed using PCA to generate weight for these variables. The index is a cumulative score, normalized to a ranking from 1 (poorest) to 5 (wealthiest). How do the five clusters compare with the Wealth Index?

##    
##     Middle Poorer Poorest Richer Richest
##   1   2604   1020     293   3829    3689
##   2   4079   5386    5731   1877     211
##   3   7671   9037    8453   2040      30
##   4   6078    968      45  13512    6715
##   5    304      1       0   4000   20389

Cluster 5 seems to capture most of the Richest and Richer households. Cluster 4 is capturing a lot of the Richer, with some bleeding into Richest and Middle. Cluster 3 is roughly evenly spread between Middle, Poorer, and Poorest. Cluster 2 and 1 do not bin well with the Wealth Index, although cluster 2 has few Richest households and cluster 1 has few Poorest ones.