Predicting Religion

Intro

Last week I looked at the “Pew Research Center 2014 U.S. Religious Landscape Study”¹, and presented some interesting aspects of the survey responses. This week I want to see if I can use survey responses to predict a respondent’s religion. That’s right, a predictive religious test!

I make a logistic regression model that predicts Christian/non-Christian status with 89% accuracy. A random forest model doesn’t do any better. I put this model into a Shiny app, so you can see what it predicts for you:

https://lukewolcott.shinyapps.io/InTheResistance_Week14/

Then I try to predict religious affiliation out of six categories: Christian, Muslim, Jewish, Buddhist, Other, None. For this I use a multinomial logistic regression. As you’ll see below, it is 85% accurate. This is not very good, as you can see from the confusion matrix. A “null prediction” of “Christian” would be right 71% of the time, since that is the percentage of Christians in the dataset. I tried a random forest model as well for the 6-way prediction, and it had basically the same performance.

Feature engineering

I want to choose survey questions that don’t directly ask about a person’s religion. For example, here are some that I’ll include.

Q.A1: Generally, how would you say things are these days in your life – would you say that you are very happy, pretty happy, or not too happy?

Q.I4: Now, thinking about some different kinds of experiences, how often do you feel a deep sense of spiritual peace and well-being?… feel a deep sense of wonder about the universe?… feel a strong sense of gratitude or thankfulness?… think about the meaning and purpose of life?

Q.M5: As I read a short list of statements about churches and other religious organizations, please tell me if you agree or disagree with each one. First, churches and other religious organizations focus too much on rules?… Play an important role in helping the poor and needy?… Are too involved with politics?… Protect and strengthen morality in society?… Are too concerned with money and power?… Bring people together and strengthen community bonds?

There are three questions that ask about views on homosexuality, but since the survey was taken in 2014 I’m worried (hoping, really) that views have changed, and this will not be relevant in 2017.

We also have various demographic questions – about highest level of education, family income, political affiliation, etc. – that will be good to include.

The RELTRAD variable bins the respondents’ religions into 16 broad categories (including a ‘None’). But I’m going to bin these further into a variable RELTRAD6 with only six: Christian, Muslim, Jewish, Buddhist, Other, and None. Also, I’ll create a variable CHRISTIAN that bins everyone as self-reporting as either Christian or not.

The survey asks about the religion (if any) of the respondent, and also the religion (if any) they were raised with. I’m going to create a variable KEPTREL that keeps track of whether or not the person kept the religion they were raised with (using the six bins, not the 16).

Look at the .Rmd file to see all the code for this.

In case you’re curious, in the 35,071 survey responses here is the percentage breakdown of Christian and non-Christian.

## 
##     Christian non-Christian 
##         71.42         28.58

Christian/non-Christian logistic regression model

First we split the 35071 x 20 dataset into a training and test set.

Then we build the model using logistic regression.

For starters, here’s the anova table. The GitHub repo for this report has the codebook that explains what these questions are, if you can’t tell from what I said above.

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: CHRISTIAN
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    28055      33544              
## qa1      3     56.8     28052      33488 2.839e-12 ***
## qi4a     5   2031.3     28047      31456 < 2.2e-16 ***
## qi4b     5    344.4     28042      31112 < 2.2e-16 ***
## qi4c     5    144.9     28037      30967 < 2.2e-16 ***
## qi4d     5    154.8     28032      30812 < 2.2e-16 ***
## qm5a     2   1182.1     28030      29630 < 2.2e-16 ***
## qm5b     2    266.4     28028      29364 < 2.2e-16 ***
## qm5d     2    640.3     28026      28723 < 2.2e-16 ***
## qm5e     2   1008.2     28024      27715 < 2.2e-16 ***
## qm5f     2      8.9     28022      27706   0.01175 *  
## qm5g     2     30.0     28020      27676 3.110e-07 ***
## agerec  15    871.8     28005      26805 < 2.2e-16 ***
## educ     8    334.5     27997      26470 < 2.2e-16 ***
## income   9     35.9     27988      26434 4.191e-05 ***
## party    5    518.7     27983      25915 < 2.2e-16 ***
## ideo     5    550.3     27978      25365 < 2.2e-16 ***
## SEX      1     90.5     27977      25275 < 2.2e-16 ***
## KEPTREL  1   9403.5     27976      15871 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pretty much all the questions are significant. Question qmf5 asks if you agree or disagree that churches and other religious organizations “bring people together and strengthen community bonds”. As we saw last week, people agreed with this across the board for the most part.

Since there is still residual deviance, I could certainly include more of the 100 questions asked. But I’m keeping it to this small set of questions because I like them.

We can check how the model predicts on the test dataset.

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Christian non-Christian
##   Christian          4747           500
##   non-Christian       247          1521
##                                           
##                Accuracy : 0.8935          
##                  95% CI : (0.8861, 0.9006)
##     No Information Rate : 0.7119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7304          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9505          
##             Specificity : 0.7526          
##          Pos Pred Value : 0.9047          
##          Neg Pred Value : 0.8603          
##              Prevalence : 0.7119          
##          Detection Rate : 0.6767          
##    Detection Prevalence : 0.7480          
##       Balanced Accuracy : 0.8516          
##                                           
##        'Positive' Class : Christian       
##

So I have a model that can predict, with 89% accuracy, whether or not you are a Christian, from answers you give to survey questions that don’t directly ask about your religion!

Certainly, this needs to be made into a Shiny app so anyone can answer the questions and see what the model predicts. Here it is:

https://lukewolcott.shinyapps.io/InTheResistance_Week14/

Random forest algorithm to predict Christian/non-Christian

I tried a random forest model, but it didn’t perform any better.

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Christian non-Christian
##   Christian          4734           503
##   non-Christian       260          1518
##                                           
##                Accuracy : 0.8912          
##                  95% CI : (0.8837, 0.8984)
##     No Information Rate : 0.7119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.725           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9479          
##             Specificity : 0.7511          
##          Pos Pred Value : 0.9040          
##          Neg Pred Value : 0.8538          
##              Prevalence : 0.7119          
##          Detection Rate : 0.6748          
##    Detection Prevalence : 0.7465          
##       Balanced Accuracy : 0.8495          
##                                           
##        'Positive' Class : Christian       
##

Multinomial logistic regression

We can go back to the RELTRAD6 variable, with 6 bins for religious affiliation, and fit it with a multinomial logistic regression.

## # weights:  486 (400 variable)
## initial  value 50269.603669 
## iter  10 value 17541.289581
## iter  20 value 16225.918558
## iter  30 value 14856.361417
## iter  40 value 14214.671262
## iter  50 value 13497.700197
## iter  60 value 13230.129449
## iter  70 value 13124.614897
## iter  80 value 13104.588385
## iter  90 value 13098.882660
## iter 100 value 13095.801330
## final  value 13095.801330 
## stopped after 100 iterations

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Buddhist Christian Jewish Muslim None Other
##   Buddhist         0         0      0      0    0     0
##   Christian       15      4784    141     44  296    77
##   Jewish           0         1      5      0    1     0
##   Muslim           0         0      0      0    0     1
##   None            52       207     30      8 1174   176
##   Other            0         2      0      0    0     1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8416, 0.8585)
##     No Information Rate : 0.7119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.632           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Buddhist Class: Christian Class: Jewish
## Sensitivity                 0.000000           0.9579     0.0284091
## Specificity                 1.000000           0.7165     0.9997076
## Pos Pred Value                   NaN           0.8930     0.7142857
## Neg Pred Value              0.990449           0.8733     0.9755993
## Prevalence                  0.009551           0.7119     0.0250891
## Detection Rate              0.000000           0.6820     0.0007128
## Detection Prevalence        0.000000           0.7636     0.0009979
## Balanced Accuracy           0.500000           0.8372     0.5140583
##                      Class: Muslim Class: None Class: Other
## Sensitivity              0.0000000      0.7981    0.0039216
## Specificity              0.9998564      0.9147    0.9997041
## Pos Pred Value           0.0000000      0.7128    0.3333333
## Neg Pred Value           0.9925863      0.9447    0.9637764
## Prevalence               0.0074127      0.2097    0.0363507
## Detection Rate           0.0000000      0.1674    0.0001426
## Detection Prevalence     0.0001426      0.2348    0.0004277
## Balanced Accuracy        0.4999282      0.8564    0.5018129

Finally, we could try a random forest algorithm to predict RELTRAD6. Sadly, it doesn’t do any better.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Buddhist Christian Jewish Muslim None Other
##   Buddhist         0         0      0      0    0     0
##   Christian       13      4792    143     45  303    76
##   Jewish           0         0      1      0    0     0
##   Muslim           0         0      0      0    0     0
##   None            54       202     32      7 1168   178
##   Other            0         0      0      0    0     1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8499          
##                  95% CI : (0.8413, 0.8582)
##     No Information Rate : 0.7119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6301          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Buddhist Class: Christian Class: Jewish
## Sensitivity                 0.000000           0.9596     0.0056818
## Specificity                 1.000000           0.7130     1.0000000
## Pos Pred Value                   NaN           0.8920     1.0000000
## Neg Pred Value              0.990449           0.8771     0.9750499
## Prevalence                  0.009551           0.7119     0.0250891
## Detection Rate              0.000000           0.6831     0.0001426
## Detection Prevalence        0.000000           0.7658     0.0001426
## Balanced Accuracy           0.500000           0.8363     0.5028409
##                      Class: Muslim Class: None Class: Other
## Sensitivity               0.000000      0.7940    0.0039216
## Specificity               1.000000      0.9147    1.0000000
## Pos Pred Value                 NaN      0.7118    1.0000000
## Neg Pred Value            0.992587      0.9436    0.9637867
## Prevalence                0.007413      0.2097    0.0363507
## Detection Rate            0.000000      0.1665    0.0001426
## Detection Prevalence      0.000000      0.2339    0.0001426
## Balanced Accuracy         0.500000      0.8544    0.5019608

Pew Research Center bears no responsibility for the interpretations presented or conclusions reached based on analysis of the data.↩