On April 15, 1912, the RMS Titanic sank into the depths of the Atlantic Ocean. This mighty ship was traveling from Southampton, England, to New York City with 2,224 souls, and the unthinkable happened — a collision with an iceberg. Without enough lifeboats to save all the passengers and crew on board, more than 1,500 people died.
In Kaggle’s Titanic — Machine Learning from Disaster challenge, you are given train and test data, and using machine learning your goal is to predict which passengers will survive or die. Before diving deep into the prediction part of the challenge, we must ask a few critical questions to understand the underlying data better.
- How many passenger classes did Titanic have, and what was the median age in each class?
- Were there more families or single passengers on the Titanic?
- What is the one characteristic among the passengers that determined the highest probability of survival?
- Lastly, what are the important features (or characteristics) highly correlated to survival?
How many passenger classes did Titanic have, and what was the median age in each class?
On the Titanic, there were three classes for passengers. First-class was the most expensive and mainly for the wealthier passengers and then the second and third-class fares. Let’s take a closer look at the age distribution for each class. In Figure 1, the three age density plots are pretty telling. The median age for first-class passengers was about 40 years old while second and third were 30 and 29, respectively. We can delineate that wealth was positively correlated with age during this time.
It’s also interesting to note that the third-class passenger age density plot showed lower variability around the mean and higher peak than the rest. This finding indicates that most third-class passengers were relatively young. In addition, the third-class passengers made up about 55% of the total passenger population.
The boxplots in Figure 2 show another perspective. We observe decreasing variability (interquartile range) as we move from first-class to second and third. This data indicates that first-class passengers’ age varied more from young to old while third-class passengers were relatively more clustered around the median (around the 50th percentile).
Were there more families or single passengers on the Titanic?
Let’s take a quick look to see if passengers were traveling alone or part of a family. This particular data point would show us if survival increased for solo passengers versus families. To get the most accurate view of this, I combined train and test data.
To determine each passenger’s family size, it’s logical to add SibSp, ParCh, and the current passenger (+1). However, rather than defining this at the passenger level, it made more sense to determine the maximum family size based on ticket number and surname.
I’m sticking with the assumption that families traveled together and, in this case, on the same ticket. As a result, the sum of SibSp, ParCh, and current passenger, which represents the family size, is matched with a ticket number and surname.
Figure 3 shows that most passengers were traveling alone. If the remaining crew on board prioritized saving families over single passengers, the latter group would face a more dire situation.
What is the one characteristic among the passengers that determined the highest probability of survival?
As you dive deeper into the data, a clear pattern begins to emerge — female passengers had the highest survival rate of approximately 74.2%. The single overwhelming driving force determining survival was being female. On the other hand, the overall male survival rate was about 18.9%.
One might think that a predictive model, in this scenario, would need one feature (sex) to determine survival, and the prediction would be pretty close to the truth. However, to improve model predictions, we often tease out more information from the available train data. As a result, if we were to examine only the male passengers and slice the data with the given titles, such as “Mr” or “Master”, we are essentially dividing the male passenger group into adults and boys.
If we focus our attention on Mr and Master in Figure 5, we see that (male) children had a survival rate of 57.5% while adult male survival rate was about 16.01%.
What are the important features (or characteristics) highly correlated to survival?
In summary, I shared a few compelling characteristics among the Titanic passengers that determined their survival. The reality was that many of the passengers had no chance of surviving even though they most likely fought to survive.
To make pretty robust predictions, I used Random Forest algorithm to complete my model. I also created my own features using the original training data and measured all features for importance. After leveraging two frameworks to weigh each feature for importance, I ranked them and eliminated non-important features. Figure 6 shows all the relevant features used in the predictive model.
Under Feature Importance in Figure 6, the top three features are highly related to gender. The top three features indicate that if you had the right title, a woman or a child, and the right surname, your chance of surviving would have increased dramatically. Furthermore, if you were in the right cabin level, part of a family, and in first-class, survival also increased. In the end, it came down to having the right calibration in each important feature as well as a little bit of luck to survive the sinking ship and see another day on planet earth.