Kaggle Titanic

I thought I’d start getting into Kaggle to work on some non-finance data to get a feel for the messiness of real-world information. Kaggle’s introductory competition is about predicting which passengers on the Titanic are going to survive using a handful of features, so let’s launch into mucking about. This post follows a “lab book” style and is quite scattered, as I develop ideas about what to do.

# Libraries
library(tidyverse)

## Warning: package 'purrr' was built under R version 3.4.1

library(stringr)
# Load data
train <- read_csv("../../data/Titanic/train.csv")
test <- read_csv("../../data/Titanic/test.csv")

Now let’s take a look at the data.

str(train)

## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 12
##   .. ..$ PassengerId: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Survived   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Pclass     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Name       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Sex        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Age        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ SibSp      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Parch      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Ticket     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Fare       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Cabin      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Embarked   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

What I want to do first is add a couple of features. DataCamp’s excellent tutorial ¹ on this data set uses Title and FamilySize, which I’ll add now. I also thought it might be cool to separate out family names to see if certain families were likely to survive.

# Do it with test and train, don't want to reconcile them later.
test <- test %>% 
  mutate(Surname = as.factor(word(test$Name, sep = fixed(","))),
         Title = word(test$Name, start = 1, sep = fixed(".")))

test$Title <- test$Title %>% 
  str_replace("[.]", "") %>%
  word(start = -1) %>% 
  as.factor(.)

# Remove uncommon titles
uncommon <- test %>% 
  group_by(Title) %>%
  count() %>% 
  filter (n >=5) 

levels(test$Title) <- c(levels(test$Title), "Other")
test$Title[!(test$Title %in% uncommon$Title)] <- as.factor("Other")
test$Title <- droplevels.data.frame(test)$Title

# Update embarkment location to factor
test$Embarked <- as.factor(test$Embarked)

# Gender to factor
test$Sex <- as.factor(test$Sex)

test$FamilySize <- test$Parch + test$SibSp + 1

# Change training dataset
train <- train %>% 
  mutate(Surname = as.factor(word(train$Name, sep = fixed(","))),
         Title = word(train$Name, start = 1, sep = fixed(".")))

train$Title <- train$Title %>% 
  str_replace("[.]", "") %>%
  word(start = -1) %>% 
  as.factor(.)

# Remove uncommon titles
uncommon <- train %>% 
  group_by(Title) %>%
  count() %>% 
  filter (n >=5) 

levels(train$Title) <- c(levels(train$Title), "Other")
train$Title[!(train$Title %in% uncommon$Title)] <- as.factor("Other")
train$Title <- droplevels.data.frame(train)$Title

# Update embarkment location to factor
train$Embarked <- as.factor(train$Embarked)

# Gender to factor
train$Sex <- as.factor(train$Sex)

train$FamilySize <- train$Parch + train$SibSp + 1

summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex           Age            SibSp           Parch       
##  female:314   Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  male  :577   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##               Median :28.00   Median :0.000   Median :0.0000  
##               Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##               3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##               Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##               NA's   :177                                     
##     Ticket               Fare           Cabin           Embarked  
##  Length:891         Min.   :  0.00   Length:891         C   :168  
##  Class :character   1st Qu.:  7.91   Class :character   Q   : 77  
##  Mode  :character   Median : 14.45   Mode  :character   S   :644  
##                     Mean   : 32.20                      NA's:  2  
##                     3rd Qu.: 31.00                                
##                     Max.   :512.33                                
##                                                                   
##       Surname       Title       FamilySize    
##  Andersson:  9   Dr    :  7   Min.   : 1.000  
##  Sage     :  7   Master: 40   1st Qu.: 1.000  
##  Carter   :  6   Miss  :182   Median : 1.000  
##  Goodwin  :  6   Mr    :517   Mean   : 1.905  
##  Johnson  :  6   Mrs   :125   3rd Qu.: 2.000  
##  Panula   :  6   Rev   :  6   Max.   :11.000  
##  (Other)  :851   Other : 14

That’s a lot of Anderssons! I wonder if they’re related - let’s check family size by surname.

train %>% 
  filter(Surname == "Andersson") %>% 
  select(Name, FamilySize, Survived, SibSp, Parch)

## # A tibble: 9 × 5
##                                                        Name FamilySize
##                                                       <chr>      <dbl>
## 1                               Andersson, Mr. Anders Johan          7
## 2                           Andersson, Miss. Erna Alexandra          7
## 3                         Andersson, Miss. Ellis Anna Maria          7
## 4              Andersson, Mr. August Edvard ("Wennerstrom")          1
## 5                      Andersson, Miss. Ingeborg Constanzia          7
## 6                         Andersson, Miss. Sigrid Elisabeth          7
## 7 Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)          7
## 8                        Andersson, Miss. Ebba Iris Alfrida          7
## 9                   Andersson, Master. Sigvard Harald Elias          7
## # ... with 3 more variables: Survived <int>, SibSp <int>, Parch <int>

They are all related - except for Erna and August, and the whole family died. This is a really sad data set.

Na’s are the bane of any good analysis, and I want to try to remove some of them. Let’s try to pull out as many as we can.

clean_age <- function(df) {
  # Turns missing values into the average for the column.
  NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
  df[,'Age'] <- lapply(df[,'Age'], NA2mean)
  return(df)
}

clean_embarkment <- function(df) {
  # The most people embarked from 'S', so I'm just setting
  # the two missing values to 'S'.
  df[is.na(df[,'Embarked']), 'Embarked'] <- 'S'
  return (df)
}

test <- clean_age(test)
train <- clean_age(train)

test <- clean_embarkment(test)
train <- clean_embarkment(train)

I also want to scale all my features to between 0 and 1, to make processing easier. This also means scrapping the names and turning all numerical values into numbers.

scaler <- function(x, ...){(x - min(x, ...)) / (max(x, ...) - min(x, ...))}

cleanse <- function(df) {
  # Remove character variables
  df <- subset(df, select = -c(Name, Ticket, Cabin, Surname))
  
  # If it's a prediction set or otherwise, break it out
  if('Survived' %in% colnames(df)){
    id <- select(df, Survived, PassengerId)
    df <- subset(df, select = -c(Survived, PassengerId))
  } else {
    id <- select(df, PassengerId)
    df <- subset(df, select = -c(PassengerId))
  }
  
  # Convert factors to numbers
  factname = c('Embarked', 'Title', 'Sex')
  df[,factname] <- lapply(df[,factname] , as.integer)
  
  # Scale variables
  df <- as.tibble(map(df, na.rm = TRUE, scaler))
  
  # Again, separate by labeled or not
  if('Survived' %in% colnames(id)){
    df$PassengerId <- id$PassengerId
    df$Survived <- id$Survived
  } else {
    df$PassengerId <- id$PassengerId
  }
  
  return(df)
}

train_scl <- cleanse(train)
test_scl <- cleanse(test)

summary(train_scl)

##      Pclass            Sex              Age             SibSp        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.5000   1st Qu.:0.0000   1st Qu.:0.2712   1st Qu.:0.00000  
##  Median :1.0000   Median :1.0000   Median :0.3679   Median :0.00000  
##  Mean   :0.6543   Mean   :0.6476   Mean   :0.3679   Mean   :0.06538  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.4345   3rd Qu.:0.12500  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      Parch             Fare            Embarked          Title       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.01544   1st Qu.:0.5000   1st Qu.:0.3333  
##  Median :0.0000   Median :0.02821   Median :1.0000   Median :0.5000  
##  Mean   :0.0636   Mean   :0.06286   Mean   :0.7682   Mean   :0.4805  
##  3rd Qu.:0.0000   3rd Qu.:0.06051   3rd Qu.:1.0000   3rd Qu.:0.5000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##    FamilySize       PassengerId       Survived     
##  Min.   :0.00000   Min.   :  1.0   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:223.5   1st Qu.:0.0000  
##  Median :0.00000   Median :446.0   Median :0.0000  
##  Mean   :0.09046   Mean   :446.0   Mean   :0.3838  
##  3rd Qu.:0.10000   3rd Qu.:668.5   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :891.0   Max.   :1.0000

Let’s start the analysis with a good old-fashioned logistic regression. Throw everything we’ve got into the pot.

logit <- glm(Survived ~ Pclass + Sex + Age + SibSp + 
               Parch + Fare + Embarked + Title,
             family = binomial(),
             data = train_scl,
             na.action = na.omit)
summary(logit)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + Title, family = binomial(), data = train_scl, 
##     na.action = na.omit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6460  -0.5874  -0.4168   0.6330   2.4134  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.4881     0.5158   8.701  < 2e-16 ***
## Pclass       -2.1785     0.2789  -7.811 5.66e-15 ***
## Sex          -2.7493     0.1995 -13.783  < 2e-16 ***
## Age          -2.7969     0.6686  -4.183 2.87e-05 ***
## SibSp        -2.7339     0.8755  -3.123  0.00179 ** 
## Parch        -0.5260     0.7104  -0.740  0.45910    
## Fare          0.9325     1.2190   0.765  0.44427    
## Embarked     -0.4345     0.2300  -1.889  0.05891 .  
## Title        -0.8603     0.6672  -1.289  0.19724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  783.39  on 882  degrees of freedom
## AIC: 801.39
## 
## Number of Fisher Scoring iterations: 5

Basically what the above tells us is that Pretty much everything decreases your chances of living. You start at a high level (the intercept has a coefficient of 4.6) and decrease from there. Men have a sex of 1, and women have a sex of 0, so being a man is a strong predictor of dying. The strongest indicator by far is age - being older decreases your chances of living. Let’s take the testing data set and predict what we think the results are likely to be.

# Now we predict using the model
threshold <- 0.5
logit_pred <- predict(logit, newdata = test_scl, type = 'response')
hist(logit_pred)

logit_pred <- ifelse(logit_pred > threshold, 1, 0)
# If we're missing data, predict 0.
logit_pred[is.na(logit_pred)] <- 0
summary(logit_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3589  1.0000  1.0000

Cool. Let’s export it and see what results we get!

all <- data.frame(test$PassengerId, logit_pred)
colnames(all) <- c("PassengerID", "Survived")
write_csv(all, "../../data/Titanic/predictions/logit_prediction.csv")

Neural Networks

After submitting to Kaggle, this method gives me an accuracy of 76%, worse than the random forest method, which gave 79%. Let me see if a neural network is any better.

library(neuralnet)
set.seed(91)

# Model the neural network
nnet <- neuralnet(Survived ~ Pclass + Sex + Age + SibSp + 
             Parch + Fare + Embarked + Title,
             hidden = c(2,2,2),
             threshold = 0.035,
             stepmax = 400000000,
             data = train_scl,
             lifesign = 'full')

## hidden: 2, 2, 2    thresh: 0.035    rep: 1/1    steps:      1000 min thresh: 0.1103394579
##                                                             2000 min thresh: 0.05215387676
##                                                             3000 min thresh: 0.04034919191
##                                                             4000 min thresh: 0.03822729544
##                                                             5000 min thresh: 0.03554090753
##                                                             5122 error: 54.00459 time: 3.79 secs

# Predict the test set
nnet.c <- compute(nnet, test_scl[,1:8])
nnet.c <- nnet.c$net.result
hist(nnet.c)

nnet.c <- ifelse(nnet.c > threshold, 1, 0)
# If we're missing data, predict 0.
nnet.c[is.na(nnet.c)] <- 0

summary(nnet.c)

##        V1           
##  Min.   :0.0000000  
##  1st Qu.:0.0000000  
##  Median :0.0000000  
##  Mean   :0.3755981  
##  3rd Qu.:1.0000000  
##  Max.   :1.0000000

all <- data.frame(test$PassengerId, nnet.c)
colnames(all) <- c("PassengerID", "Survived")
write_csv(all, "../../data/Titanic/predictions/nnet_prediction.csv")

I’ve run a lot of other computation on a variety of neural networks, with up to five layers and a variety of node amounts - I only ever matched random forest accuracy with a relatively uncomplicated neural network with three layers of two nodes, at 79%. I suspect that for this data set, predicting survival is best suited to other algorithms.

I used random forests and decision trees as my first submissions. DataCamp’s tutorial does an excellent job explaining the methodology and code, so you can check out the hyperlink above if you’re interested.↩

Cameron's Blog

Finance, statistics, data, and other trash

About ∘ Blog ∘ Links ∘ Github ∘ CV

Kaggle Titanic

Neural Networks