Is there a natural way to cluster Indian households, based on data about the nature of these households – do they have electricity, a TV, a computer, a cow, an ox-drawn cart? How do these clusters correspond to the 5-level Wealth Index?
The National Family Health Survey is conducted in India every 10 years or so. The latest one was in 2015-2016 and the data is yet to be released. In preparation, I’m looking at the third NFHS, conducted in 2005-2006. I registered to get access to this data, (which unfortunately I can’t upload to GitHub).
This week, I’m looking at how Indian households cluster according to their characteristics. The dataset includes 109,041 separate questionnaire responses, each with 3588 (!!!) variables entered. A codebook can be found in the file IAHR52FL.MAP
(open it with a text editor). There’s a copy of the household questionnaire starting on p.636 of FRIND3-Vol1AndVol2.pdf
. These files are uploaded to the GitHub repo corresponding to this report.
I will look specifically at the following variables.
Variable | Does the household have…? | Variable | [other characteristics] |
---|---|---|---|
HV206 | electricity | HV025 | Rural or urban? |
HV207 | radio | HV013 | # in household? |
HV208 | television | HV014 | # of children in HH? |
HV209 | refrigerator | HV216 | # of rooms for sleeping? |
HV210 | bicycle | HV219 | sex of HH head? |
HV211 | motorcycle/scooter | HV220 | age of HH head? |
HV212 | car | SH30 | anyone with tuberculosis? |
HV221 | non-mobile telephone | SH58 | own this house? |
HV227 | bednets for sleeping | SH62A | have cows/bulls/buffalo? |
HV243A | mobile telephone | SH62B | have camels? |
HV243B | watch | SH62C | have horses/donkeys/mules? |
HV243C | animal-drawn cart | SH62D | have goats? |
HV246 | livestock | SH62E | have sheep? |
SH47B | mattress | SH62F | have chickens? |
SH47C | pressure cooker | SHSTRUC | nuclear or non-nuclear family? |
SH47D | chair | ||
SH47E | cot/bed | ||
SH47F | table | ||
SH47G | electric fan | ||
SH47I | B&W television | ||
SH47J | color television | ||
SH47K | sewing machine | ||
SH47N | computer | ||
SH47U | water pump | ||
SH47V | thresher | ||
SH47W | tractor | ||
SH56A | any windows | ||
SH56B | windows with glass | ||
SH56C | windows with screens | ||
SH56D | windows with curtains/shutters |
As a first step, we can use topological data analysis (TDA) to see if there is a natural number of clusters in the data. For the sake of computation, I’ll sample 1000 of the 109,041 cases (and remove 7 with missing data). To do the whole 993 x 44 dataset with maxscale=10, it takes my laptop about 7.5 minutes. It only took about 2 minutes with maxscale=5. Below we have the persistence homology barcode and the persistence diagram.
(See this report or this report for some details on using TDA.)
Because there are 993 black barcode lines, it’s hard to see how many larger clusters we are looking at. For this we can look explicitly at the birth and death of the H0 intervals.
## dimension Birth Death
## [1,] 0 0 10.000000
## [2,] 0 0 7.724024
## [3,] 0 0 5.306447
## [4,] 0 0 5.036054
## [5,] 0 0 4.868821
## [6,] 0 0 4.248221
## [7,] 0 0 3.724490
## [8,] 0 0 3.722570
## [9,] 0 0 3.719849
## [10,] 0 0 3.582060
## [11,] 0 0 3.527882
## [12,] 0 0 3.520842
## [13,] 0 0 3.514229
## [14,] 0 0 3.414533
## [15,] 0 0 3.383944
We see that before everything gets clumped together into one connected component, there is a long stretch with two clusters. And there is a persistent feature of k=5 clusters. So, miraculously or by design, it was a good idea to use a Wealth Index with values 1-5.
Before we cluster with k=5, let’s see what we can say about the situation with k=2, which was suggested by TDA. It seems the two clusters have similar sizes.
##
## 1 2
## 519 474
We can tabulate the mean for each variable, grouped by cluster, to look for any large differences.
## cluster2 HV025 HV206 HV207 HV208 HV209 HV210 HV211 HV212 HV219 HV221
## 1 1 0.254 0.632 0.233 0.210 0.0212 0.407 0.0405 0.000 0.832 0.0173
## 2 2 0.700 0.983 0.496 0.932 0.4599 0.489 0.3734 0.103 0.884 0.3481
## HV227 HV243A HV243B HV243C HV246 SH30 SH47B SH47C SH47D SH47E SH47F
## 1 0.349 0.0443 0.723 0.0539 0.565 0.0154 0.482 0.214 0.426 0.755 0.308
## 2 0.426 0.4958 0.968 0.0295 0.259 0.0148 0.932 0.899 0.947 0.943 0.903
## SH47G SH47I SH47J SH47K SH47N SH47U SH47V SH47W SH56A SH56B
## 1 0.339 0.170 0.0405 0.0944 0.00193 0.0328 0.0116 0.00963 0.599 0.0539
## 2 0.861 0.245 0.7342 0.4705 0.08650 0.1540 0.0148 0.01688 0.964 0.4747
## SH56C SH56D SH58 SH62A SH62B SH62C SH62D SH62E SH62F SHSTRUC
## 1 0.0751 0.187 0.844 0.395 0.00578 0.00771 0.1657 0.03468 0.264 0.699
## 2 0.4030 0.605 0.827 0.169 0.00000 0.00211 0.0506 0.00422 0.120 0.565
## HV013 HV014 HV216 HV220
## 1 -0.127 0.118 -0.287 -0.200
## 2 0.140 -0.137 0.312 0.222
If we plot histograms of a few of these variables, the picture emerges that cluster 1 tends to be rural households, with almost no amenities like a refrigerator or cell phone. Cluster 2 tends to be urban households, almost all electrified, and much more likely to have household amenities.
Now let’s return to the full dataset of 109,041 households, and use the same 44 variables to cluster with k=5.
The supplemental file NFHS3SUP.pdf
, p.6, explains how the “Wealth Index” variable HV270 was constructed using PCA to generate weight for these variables. The index is a cumulative score, normalized to a ranking from 1 (poorest) to 5 (wealthiest). How do the five clusters compare with the Wealth Index?
##
## Middle Poorer Poorest Richer Richest
## 1 2604 1020 293 3829 3689
## 2 4079 5386 5731 1877 211
## 3 7671 9037 8453 2040 30
## 4 6078 968 45 13512 6715
## 5 304 1 0 4000 20389
Cluster 5 seems to capture most of the Richest and Richer households. Cluster 4 is capturing a lot of the Richer, with some bleeding into Richest and Middle. Cluster 3 is roughly evenly spread between Middle, Poorer, and Poorest. Cluster 2 and 1 do not bin well with the Wealth Index, although cluster 2 has few Richest households and cluster 1 has few Poorest ones.