Last week I looked at the “Pew Research Center 2014 U.S. Religious Landscape Study”1, and presented some interesting aspects of the survey responses. This week I want to see if I can use survey responses to predict a respondent’s religion. That’s right, a predictive religious test!
I make a logistic regression model that predicts Christian/non-Christian status with 89% accuracy. A random forest model doesn’t do any better. I put this model into a Shiny app, so you can see what it predicts for you:
https://lukewolcott.shinyapps.io/InTheResistance_Week14/
Then I try to predict religious affiliation out of six categories: Christian, Muslim, Jewish, Buddhist, Other, None. For this I use a multinomial logistic regression. As you’ll see below, it is 85% accurate. This is not very good, as you can see from the confusion matrix. A “null prediction” of “Christian” would be right 71% of the time, since that is the percentage of Christians in the dataset. I tried a random forest model as well for the 6-way prediction, and it had basically the same performance.
I want to choose survey questions that don’t directly ask about a person’s religion. For example, here are some that I’ll include.
Q.A1: Generally, how would you say things are these days in your life – would you say that you are very happy, pretty happy, or not too happy?
Q.I4: Now, thinking about some different kinds of experiences, how often do you feel a deep sense of spiritual peace and well-being?… feel a deep sense of wonder about the universe?… feel a strong sense of gratitude or thankfulness?… think about the meaning and purpose of life?
Q.M5: As I read a short list of statements about churches and other religious organizations, please tell me if you agree or disagree with each one. First, churches and other religious organizations focus too much on rules?… Play an important role in helping the poor and needy?… Are too involved with politics?… Protect and strengthen morality in society?… Are too concerned with money and power?… Bring people together and strengthen community bonds?
There are three questions that ask about views on homosexuality, but since the survey was taken in 2014 I’m worried (hoping, really) that views have changed, and this will not be relevant in 2017.
We also have various demographic questions – about highest level of education, family income, political affiliation, etc. – that will be good to include.
The RELTRAD variable bins the respondents’ religions into 16 broad categories (including a ‘None’). But I’m going to bin these further into a variable RELTRAD6 with only six: Christian, Muslim, Jewish, Buddhist, Other, and None. Also, I’ll create a variable CHRISTIAN that bins everyone as self-reporting as either Christian or not.
The survey asks about the religion (if any) of the respondent, and also the religion (if any) they were raised with. I’m going to create a variable KEPTREL that keeps track of whether or not the person kept the religion they were raised with (using the six bins, not the 16).
Look at the .Rmd file to see all the code for this.
In case you’re curious, in the 35,071 survey responses here is the percentage breakdown of Christian and non-Christian.
##
## Christian non-Christian
## 71.42 28.58
First we split the 35071 x 20 dataset into a training and test set.
Then we build the model using logistic regression.
For starters, here’s the anova table. The GitHub repo for this report has the codebook that explains what these questions are, if you can’t tell from what I said above.
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: CHRISTIAN
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 28055 33544
## qa1 3 56.8 28052 33488 2.839e-12 ***
## qi4a 5 2031.3 28047 31456 < 2.2e-16 ***
## qi4b 5 344.4 28042 31112 < 2.2e-16 ***
## qi4c 5 144.9 28037 30967 < 2.2e-16 ***
## qi4d 5 154.8 28032 30812 < 2.2e-16 ***
## qm5a 2 1182.1 28030 29630 < 2.2e-16 ***
## qm5b 2 266.4 28028 29364 < 2.2e-16 ***
## qm5d 2 640.3 28026 28723 < 2.2e-16 ***
## qm5e 2 1008.2 28024 27715 < 2.2e-16 ***
## qm5f 2 8.9 28022 27706 0.01175 *
## qm5g 2 30.0 28020 27676 3.110e-07 ***
## agerec 15 871.8 28005 26805 < 2.2e-16 ***
## educ 8 334.5 27997 26470 < 2.2e-16 ***
## income 9 35.9 27988 26434 4.191e-05 ***
## party 5 518.7 27983 25915 < 2.2e-16 ***
## ideo 5 550.3 27978 25365 < 2.2e-16 ***
## SEX 1 90.5 27977 25275 < 2.2e-16 ***
## KEPTREL 1 9403.5 27976 15871 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pretty much all the questions are significant. Question qmf5 asks if you agree or disagree that churches and other religious organizations “bring people together and strengthen community bonds”. As we saw last week, people agreed with this across the board for the most part.
Since there is still residual deviance, I could certainly include more of the 100 questions asked. But I’m keeping it to this small set of questions because I like them.
We can check how the model predicts on the test dataset.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Christian non-Christian
## Christian 4747 500
## non-Christian 247 1521
##
## Accuracy : 0.8935
## 95% CI : (0.8861, 0.9006)
## No Information Rate : 0.7119
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7304
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9505
## Specificity : 0.7526
## Pos Pred Value : 0.9047
## Neg Pred Value : 0.8603
## Prevalence : 0.7119
## Detection Rate : 0.6767
## Detection Prevalence : 0.7480
## Balanced Accuracy : 0.8516
##
## 'Positive' Class : Christian
##
So I have a model that can predict, with 89% accuracy, whether or not you are a Christian, from answers you give to survey questions that don’t directly ask about your religion!
Certainly, this needs to be made into a Shiny app so anyone can answer the questions and see what the model predicts. Here it is:
I tried a random forest model, but it didn’t perform any better.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Christian non-Christian
## Christian 4734 503
## non-Christian 260 1518
##
## Accuracy : 0.8912
## 95% CI : (0.8837, 0.8984)
## No Information Rate : 0.7119
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.725
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9479
## Specificity : 0.7511
## Pos Pred Value : 0.9040
## Neg Pred Value : 0.8538
## Prevalence : 0.7119
## Detection Rate : 0.6748
## Detection Prevalence : 0.7465
## Balanced Accuracy : 0.8495
##
## 'Positive' Class : Christian
##
We can go back to the RELTRAD6 variable, with 6 bins for religious affiliation, and fit it with a multinomial logistic regression.
## # weights: 486 (400 variable)
## initial value 50269.603669
## iter 10 value 17541.289581
## iter 20 value 16225.918558
## iter 30 value 14856.361417
## iter 40 value 14214.671262
## iter 50 value 13497.700197
## iter 60 value 13230.129449
## iter 70 value 13124.614897
## iter 80 value 13104.588385
## iter 90 value 13098.882660
## iter 100 value 13095.801330
## final value 13095.801330
## stopped after 100 iterations
## Confusion Matrix and Statistics
##
## Reference
## Prediction Buddhist Christian Jewish Muslim None Other
## Buddhist 0 0 0 0 0 0
## Christian 15 4784 141 44 296 77
## Jewish 0 1 5 0 1 0
## Muslim 0 0 0 0 0 1
## None 52 207 30 8 1174 176
## Other 0 2 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.8502
## 95% CI : (0.8416, 0.8585)
## No Information Rate : 0.7119
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.632
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buddhist Class: Christian Class: Jewish
## Sensitivity 0.000000 0.9579 0.0284091
## Specificity 1.000000 0.7165 0.9997076
## Pos Pred Value NaN 0.8930 0.7142857
## Neg Pred Value 0.990449 0.8733 0.9755993
## Prevalence 0.009551 0.7119 0.0250891
## Detection Rate 0.000000 0.6820 0.0007128
## Detection Prevalence 0.000000 0.7636 0.0009979
## Balanced Accuracy 0.500000 0.8372 0.5140583
## Class: Muslim Class: None Class: Other
## Sensitivity 0.0000000 0.7981 0.0039216
## Specificity 0.9998564 0.9147 0.9997041
## Pos Pred Value 0.0000000 0.7128 0.3333333
## Neg Pred Value 0.9925863 0.9447 0.9637764
## Prevalence 0.0074127 0.2097 0.0363507
## Detection Rate 0.0000000 0.1674 0.0001426
## Detection Prevalence 0.0001426 0.2348 0.0004277
## Balanced Accuracy 0.4999282 0.8564 0.5018129
Finally, we could try a random forest algorithm to predict RELTRAD6. Sadly, it doesn’t do any better.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Buddhist Christian Jewish Muslim None Other
## Buddhist 0 0 0 0 0 0
## Christian 13 4792 143 45 303 76
## Jewish 0 0 1 0 0 0
## Muslim 0 0 0 0 0 0
## None 54 202 32 7 1168 178
## Other 0 0 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.8499
## 95% CI : (0.8413, 0.8582)
## No Information Rate : 0.7119
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6301
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buddhist Class: Christian Class: Jewish
## Sensitivity 0.000000 0.9596 0.0056818
## Specificity 1.000000 0.7130 1.0000000
## Pos Pred Value NaN 0.8920 1.0000000
## Neg Pred Value 0.990449 0.8771 0.9750499
## Prevalence 0.009551 0.7119 0.0250891
## Detection Rate 0.000000 0.6831 0.0001426
## Detection Prevalence 0.000000 0.7658 0.0001426
## Balanced Accuracy 0.500000 0.8363 0.5028409
## Class: Muslim Class: None Class: Other
## Sensitivity 0.000000 0.7940 0.0039216
## Specificity 1.000000 0.9147 1.0000000
## Pos Pred Value NaN 0.7118 1.0000000
## Neg Pred Value 0.992587 0.9436 0.9637867
## Prevalence 0.007413 0.2097 0.0363507
## Detection Rate 0.000000 0.1665 0.0001426
## Detection Prevalence 0.000000 0.2339 0.0001426
## Balanced Accuracy 0.500000 0.8544 0.5019608
Pew Research Center bears no responsibility for the interpretations presented or conclusions reached based on analysis of the data.↩