Continuing with the Indian NFHS-3 dataset from last week, now I’m interested in the variable HV219, which asks, “Is the head of the household male or female?”
First I look at the difference between male- and female-led households, and then I build an algorithm that predicts the sex of the HH head from household characteristics.
I’m using the same cleaned dataset from last week, based on the 109,041 questionnaire responses about Indian households. In last week’s report I listed the 44 household characteristic variables that I’m looking at.
First, a table of the HV219 variable shows that only 14.37 percent of households have women heads.
##
## Male Female
## 92446 15516
The first piece of good news is that there is not much difference in the distribution of HV270, the “Wealth Index”, between the two types of household.
A quick way to look for differences between male- and female-led households is to compare the mean value of the “does not have / does have” variables.
## HV219 HV025 HV206 HV207 HV208 HV209 HV210 HV211 HV212 HV221 HV227
## 1 Male 0.536 0.791 0.364 0.551 0.230 0.474 0.218 0.0505 0.192 0.382
## 2 Female 0.558 0.763 0.313 0.494 0.218 0.302 0.141 0.0358 0.185 0.359
## HV243A HV243B HV243C HV246 SH30 SH47B SH47C SH47D SH47E SH47F SH47G
## 1 0.246 0.839 0.0431 0.440 0.0195 0.666 0.526 0.643 0.847 0.562 0.606
## 2 0.203 0.756 0.0132 0.338 0.0151 0.629 0.485 0.623 0.814 0.538 0.550
## SH47I SH47J SH47K SH47N SH47U SH47V SH47W SH56A SH56B SH56C SH56D
## 1 0.220 0.359 0.251 0.0522 0.1033 0.01206 0.01501 0.741 0.257 0.217 0.365
## 2 0.181 0.337 0.222 0.0410 0.0821 0.00554 0.00567 0.740 0.255 0.207 0.366
## SH58 SH62A SH62B SH62C SH62D SH62E SH62F SHSTRUC
## 1 0.841 0.318 0.00433 0.00561 0.1172 0.0143 0.178 0.381
## 2 0.835 0.191 0.00232 0.00258 0.0878 0.0098 0.159 0.430
Many of these means are not significantly different, although there is a tendency for male-led HHs to have more stuff. By this measure, male-led HHs are 2-3 times more likely to have threshers (SH47V) and tractors (SH47W), and to have livestock of various kinds (SH62A - SH62F). Women-led HHs are more prevalent in urban HHs (HV025), and with non-nuclear family structures (SHSTRUC).
We can pick out a few of these to look at histograms.
From our cleaned dataset of 107962 households with 44 characteristic variables, we want to build an algorithm that will predict HV219: whether the head of the household is male or female.
The factor variables have been converted into 0/1s, and the numeric variables (number in household, number of children, number of rooms for sleeping) have been scaled to have mean 0 and standard deviation 1.
A good first try would be with a logistic regression.
We can check how the model predicts on the test dataset.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18383 3091
## 1 48 71
##
## Accuracy : 0.85463
## 95% CI : (0.84986, 0.85931)
## No Information Rate : 0.85356
## P-Value [Acc > NIR] : 0.33314
##
## Kappa : 0.03301
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.997396
## Specificity : 0.022454
## Pos Pred Value : 0.856058
## Neg Pred Value : 0.596639
## Prevalence : 0.853564
## Detection Rate : 0.851341
## Detection Prevalence : 0.994489
## Balanced Accuracy : 0.509925
##
## 'Positive' Class : 0
##
This is pretty bad. Our algorithm is almost always predicting “Male”, and we can’t be confident it will do better than just guessing “Male” every time.
Let’s try with a tree:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Male Female
## Male 18176 2666
## Female 255 496
##
## Accuracy : 0.86472
## 95% CI : (0.86009, 0.86926)
## No Information Rate : 0.85356
## P-Value [Acc > NIR] : 1.4531e-06
##
## Kappa : 0.20906
## Mcnemar's Test P-Value : < 2.22e-16
##
## Sensitivity : 0.98616
## Specificity : 0.15686
## Pos Pred Value : 0.87209
## Neg Pred Value : 0.66045
## Prevalence : 0.85356
## Detection Rate : 0.84175
## Detection Prevalence : 0.96522
## Balanced Accuracy : 0.57151
##
## 'Positive' Class : Male
##
Still very bad, but at least it is reliably better than the No Information Rate. Let’s see if a random forest model can do better:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Male Female
## Male 18268 2725
## Female 163 437
##
## Accuracy : 0.86625
## 95% CI : (0.86164, 0.87077)
## No Information Rate : 0.85356
## P-Value [Acc > NIR] : 4.975e-08
##
## Kappa : 0.19471
## Mcnemar's Test P-Value : < 2.22e-16
##
## Sensitivity : 0.99116
## Specificity : 0.13820
## Pos Pred Value : 0.87019
## Neg Pred Value : 0.72833
## Prevalence : 0.85356
## Detection Rate : 0.84601
## Detection Prevalence : 0.97221
## Balanced Accuracy : 0.56468
##
## 'Positive' Class : Male
##
This is better, but not by much. It seems that we’re going to have to engineer better features if we want to do better. But finally, let’s try some boosting to see if that does something magical:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Male Female
## Male 18227 2673
## Female 204 489
##
## Accuracy : 0.86676
## 95% CI : (0.86216, 0.87127)
## No Information Rate : 0.85356
## P-Value [Acc > NIR] : 1.471e-08
##
## Kappa : 0.21222
## Mcnemar's Test P-Value : < 2.22e-16
##
## Sensitivity : 0.98893
## Specificity : 0.15465
## Pos Pred Value : 0.87211
## Neg Pred Value : 0.70563
## Prevalence : 0.85356
## Detection Rate : 0.84412
## Detection Prevalence : 0.96791
## Balanced Accuracy : 0.57179
##
## 'Positive' Class : Male
##
Also very bad. In all cases our algorithm predicts “Male” too often; the false positive rate is very high so the specificity is an abyssmal 13-15%.
This first pass at a household head prediction algorithm was naively hopeful that simply counting things like beds, windows, bicycles, cows, and tractors in a household might allow me to predict who was in charge. For better or for worse, the model needs more nuance.
A good first place to look would be at the role of the scaled numerical variables (# in household, # of children, # of rooms for sleeping, age of HH head) versus the 0/1 categorical variables. Does the scaling affect the model performance?
There are many categorical variables with more than two levels – what is the house floor made of? what are the walls made of? what state is the house in? where do they get their water? how do they treat their water? what type of cooking fuel do they use? what is the HH Wealth Index? – that I left out of the model but could add in. And there are variables related to health practices in the household that might help too. I could make the algorithm better, I’m sure, by adding back in some of the variables I ignored. I’m using 45 here, and the original dataset had 3588!