Kaggle Titanic
I thought I’d start getting into Kaggle to work on some non-finance data to get a feel for the messiness of real-world information. Kaggle’s introductory competition is about predicting which passengers on the Titanic are going to survive using a handful of features, so let’s launch into mucking about. This post follows a “lab book” style and is quite scattered, as I develop ideas about what to do.
# Libraries
library(tidyverse)
## Warning: package 'purrr' was built under R version 3.4.1
library(stringr)
# Load data
train <- read_csv("../../data/Titanic/train.csv")
test <- read_csv("../../data/Titanic/test.csv")
Now let’s take a look at the data.
str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 12
## .. ..$ PassengerId: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Survived : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Pclass : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Sex : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Age : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ SibSp : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Parch : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Ticket : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Fare : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Cabin : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Embarked : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
What I want to do first is add a couple of features. DataCamp’s excellent tutorial1 on this data set uses Title
and FamilySize
, which I’ll add now. I also thought it might be cool to separate out family names to see if certain families were likely to survive.
# Do it with test and train, don't want to reconcile them later.
test <- test %>%
mutate(Surname = as.factor(word(test$Name, sep = fixed(","))),
Title = word(test$Name, start = 1, sep = fixed(".")))
test$Title <- test$Title %>%
str_replace("[.]", "") %>%
word(start = -1) %>%
as.factor(.)
# Remove uncommon titles
uncommon <- test %>%
group_by(Title) %>%
count() %>%
filter (n >=5)
levels(test$Title) <- c(levels(test$Title), "Other")
test$Title[!(test$Title %in% uncommon$Title)] <- as.factor("Other")
test$Title <- droplevels.data.frame(test)$Title
# Update embarkment location to factor
test$Embarked <- as.factor(test$Embarked)
# Gender to factor
test$Sex <- as.factor(test$Sex)
test$FamilySize <- test$Parch + test$SibSp + 1
# Change training dataset
train <- train %>%
mutate(Surname = as.factor(word(train$Name, sep = fixed(","))),
Title = word(train$Name, start = 1, sep = fixed(".")))
train$Title <- train$Title %>%
str_replace("[.]", "") %>%
word(start = -1) %>%
as.factor(.)
# Remove uncommon titles
uncommon <- train %>%
group_by(Title) %>%
count() %>%
filter (n >=5)
levels(train$Title) <- c(levels(train$Title), "Other")
train$Title[!(train$Title %in% uncommon$Title)] <- as.factor("Other")
train$Title <- droplevels.data.frame(train)$Title
# Update embarkment location to factor
train$Embarked <- as.factor(train$Embarked)
# Gender to factor
train$Sex <- as.factor(train$Sex)
train$FamilySize <- train$Parch + train$SibSp + 1
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## female:314 Min. : 0.42 Min. :0.000 Min. :0.0000
## male :577 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 C :168
## Class :character 1st Qu.: 7.91 Class :character Q : 77
## Mode :character Median : 14.45 Mode :character S :644
## Mean : 32.20 NA's: 2
## 3rd Qu.: 31.00
## Max. :512.33
##
## Surname Title FamilySize
## Andersson: 9 Dr : 7 Min. : 1.000
## Sage : 7 Master: 40 1st Qu.: 1.000
## Carter : 6 Miss :182 Median : 1.000
## Goodwin : 6 Mr :517 Mean : 1.905
## Johnson : 6 Mrs :125 3rd Qu.: 2.000
## Panula : 6 Rev : 6 Max. :11.000
## (Other) :851 Other : 14
That’s a lot of Anderssons! I wonder if they’re related - let’s check family size by surname.
train %>%
filter(Surname == "Andersson") %>%
select(Name, FamilySize, Survived, SibSp, Parch)
## # A tibble: 9 × 5
## Name FamilySize
## <chr> <dbl>
## 1 Andersson, Mr. Anders Johan 7
## 2 Andersson, Miss. Erna Alexandra 7
## 3 Andersson, Miss. Ellis Anna Maria 7
## 4 Andersson, Mr. August Edvard ("Wennerstrom") 1
## 5 Andersson, Miss. Ingeborg Constanzia 7
## 6 Andersson, Miss. Sigrid Elisabeth 7
## 7 Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren) 7
## 8 Andersson, Miss. Ebba Iris Alfrida 7
## 9 Andersson, Master. Sigvard Harald Elias 7
## # ... with 3 more variables: Survived <int>, SibSp <int>, Parch <int>
They are all related - except for Erna and August, and the whole family died. This is a really sad data set.
Na’s are the bane of any good analysis, and I want to try to remove some of them. Let’s try to pull out as many as we can.
clean_age <- function(df) {
# Turns missing values into the average for the column.
NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
df[,'Age'] <- lapply(df[,'Age'], NA2mean)
return(df)
}
clean_embarkment <- function(df) {
# The most people embarked from 'S', so I'm just setting
# the two missing values to 'S'.
df[is.na(df[,'Embarked']), 'Embarked'] <- 'S'
return (df)
}
test <- clean_age(test)
train <- clean_age(train)
test <- clean_embarkment(test)
train <- clean_embarkment(train)
I also want to scale all my features to between 0 and 1, to make processing easier. This also means scrapping the names and turning all numerical values into numbers.
scaler <- function(x, ...){(x - min(x, ...)) / (max(x, ...) - min(x, ...))}
cleanse <- function(df) {
# Remove character variables
df <- subset(df, select = -c(Name, Ticket, Cabin, Surname))
# If it's a prediction set or otherwise, break it out
if('Survived' %in% colnames(df)){
id <- select(df, Survived, PassengerId)
df <- subset(df, select = -c(Survived, PassengerId))
} else {
id <- select(df, PassengerId)
df <- subset(df, select = -c(PassengerId))
}
# Convert factors to numbers
factname = c('Embarked', 'Title', 'Sex')
df[,factname] <- lapply(df[,factname] , as.integer)
# Scale variables
df <- as.tibble(map(df, na.rm = TRUE, scaler))
# Again, separate by labeled or not
if('Survived' %in% colnames(id)){
df$PassengerId <- id$PassengerId
df$Survived <- id$Survived
} else {
df$PassengerId <- id$PassengerId
}
return(df)
}
train_scl <- cleanse(train)
test_scl <- cleanse(test)
summary(train_scl)
## Pclass Sex Age SibSp
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.5000 1st Qu.:0.0000 1st Qu.:0.2712 1st Qu.:0.00000
## Median :1.0000 Median :1.0000 Median :0.3679 Median :0.00000
## Mean :0.6543 Mean :0.6476 Mean :0.3679 Mean :0.06538
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.4345 3rd Qu.:0.12500
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Parch Fare Embarked Title
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.01544 1st Qu.:0.5000 1st Qu.:0.3333
## Median :0.0000 Median :0.02821 Median :1.0000 Median :0.5000
## Mean :0.0636 Mean :0.06286 Mean :0.7682 Mean :0.4805
## 3rd Qu.:0.0000 3rd Qu.:0.06051 3rd Qu.:1.0000 3rd Qu.:0.5000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## FamilySize PassengerId Survived
## Min. :0.00000 Min. : 1.0 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:223.5 1st Qu.:0.0000
## Median :0.00000 Median :446.0 Median :0.0000
## Mean :0.09046 Mean :446.0 Mean :0.3838
## 3rd Qu.:0.10000 3rd Qu.:668.5 3rd Qu.:1.0000
## Max. :1.00000 Max. :891.0 Max. :1.0000
Let’s start the analysis with a good old-fashioned logistic regression. Throw everything we’ve got into the pot.
logit <- glm(Survived ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked + Title,
family = binomial(),
data = train_scl,
na.action = na.omit)
summary(logit)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked + Title, family = binomial(), data = train_scl,
## na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6460 -0.5874 -0.4168 0.6330 2.4134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.4881 0.5158 8.701 < 2e-16 ***
## Pclass -2.1785 0.2789 -7.811 5.66e-15 ***
## Sex -2.7493 0.1995 -13.783 < 2e-16 ***
## Age -2.7969 0.6686 -4.183 2.87e-05 ***
## SibSp -2.7339 0.8755 -3.123 0.00179 **
## Parch -0.5260 0.7104 -0.740 0.45910
## Fare 0.9325 1.2190 0.765 0.44427
## Embarked -0.4345 0.2300 -1.889 0.05891 .
## Title -0.8603 0.6672 -1.289 0.19724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 783.39 on 882 degrees of freedom
## AIC: 801.39
##
## Number of Fisher Scoring iterations: 5
Basically what the above tells us is that Pretty much everything decreases your chances of living. You start at a high level (the intercept has a coefficient of 4.6) and decrease from there. Men have a sex of 1, and women have a sex of 0, so being a man is a strong predictor of dying. The strongest indicator by far is age - being older decreases your chances of living. Let’s take the testing data set and predict what we think the results are likely to be.
# Now we predict using the model
threshold <- 0.5
logit_pred <- predict(logit, newdata = test_scl, type = 'response')
hist(logit_pred)
logit_pred <- ifelse(logit_pred > threshold, 1, 0)
# If we're missing data, predict 0.
logit_pred[is.na(logit_pred)] <- 0
summary(logit_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3589 1.0000 1.0000
Cool. Let’s export it and see what results we get!
all <- data.frame(test$PassengerId, logit_pred)
colnames(all) <- c("PassengerID", "Survived")
write_csv(all, "../../data/Titanic/predictions/logit_prediction.csv")
Neural Networks
After submitting to Kaggle, this method gives me an accuracy of 76%, worse than the random forest method, which gave 79%. Let me see if a neural network is any better.
library(neuralnet)
set.seed(91)
# Model the neural network
nnet <- neuralnet(Survived ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked + Title,
hidden = c(2,2,2),
threshold = 0.035,
stepmax = 400000000,
data = train_scl,
lifesign = 'full')
## hidden: 2, 2, 2 thresh: 0.035 rep: 1/1 steps: 1000 min thresh: 0.1103394579
## 2000 min thresh: 0.05215387676
## 3000 min thresh: 0.04034919191
## 4000 min thresh: 0.03822729544
## 5000 min thresh: 0.03554090753
## 5122 error: 54.00459 time: 3.79 secs
# Predict the test set
nnet.c <- compute(nnet, test_scl[,1:8])
nnet.c <- nnet.c$net.result
hist(nnet.c)
nnet.c <- ifelse(nnet.c > threshold, 1, 0)
# If we're missing data, predict 0.
nnet.c[is.na(nnet.c)] <- 0
summary(nnet.c)
## V1
## Min. :0.0000000
## 1st Qu.:0.0000000
## Median :0.0000000
## Mean :0.3755981
## 3rd Qu.:1.0000000
## Max. :1.0000000
all <- data.frame(test$PassengerId, nnet.c)
colnames(all) <- c("PassengerID", "Survived")
write_csv(all, "../../data/Titanic/predictions/nnet_prediction.csv")
I’ve run a lot of other computation on a variety of neural networks, with up to five layers and a variety of node amounts - I only ever matched random forest accuracy with a relatively uncomplicated neural network with three layers of two nodes, at 79%. I suspect that for this data set, predicting survival is best suited to other algorithms.
I used random forests and decision trees as my first submissions. DataCamp’s tutorial does an excellent job explaining the methodology and code, so you can check out the hyperlink above if you’re interested.↩