Exploring the Titanic Dataset: A Data Analysis Project Using SQL and Power BI
As a data analyst with a keen interest in historical datasets and their modern implications, I recently undertook a project analyzing the Titanic passenger dataset. This project involved database creation, SQL querying, and data visualization in Power BI to uncover patterns in survival and socio-economic factors aboard the ill-fated ship. The goal was to not only practice technical skills but also derive meaningful insights that could inform contemporary safety practices in the maritime industry.
I'll walk you through the step-by-step process, highlight the importance of each stage, share key objectives, insights, and provide forward-looking recommendations for how companies could enhance operations over the next 2-10 years.
Project Overview
The Titanic disaster of 1912 remains one of history's most poignant tragedies, with over 1,500 lives lost when the "unsinkable" ship struck an iceberg. The dataset used in this project is a cleaned version of the Titanic passenger manifest (train_clean.csv), containing 891 records with variables such as Age, Sex, Pclass (passenger class), Fare, SibSp (siblings/spouses aboard), Parch (parents/children aboard), Survived (0 or 1), Embarked (port of embarkation), Title (extracted from names, e.g., Mr., Mrs.), and Family_Size (SibSp + Parch, excluding self).
The project followed a structured approach: creating a SQL database, importing the data, writing queries to answer specific questions, and visualizing results in Power BI. This mirrors real-world data analysis workflows, where raw data is transformed into insights for decision-making.
Objectives
The primary objectives were:
- Technical Proficiency: Demonstrate skills in SQL for data manipulation and Power BI for visualization.
- Insight Generation: Uncover demographic and socio-economic factors influencing survival rates.
- Storytelling: Craft a narrative around the data to highlight human elements, such as family dynamics and class disparities.
- Practical Application: Draw lessons for modern industries, emphasizing how data-driven decisions can prevent future tragedies.
- Portfolio Enhancement: Create a professional showcase for LinkedIn, illustrating my ability to handle historical data with contemporary tools.
These objectives ensured the project was not just an exercise but a bridge between past events and future improvements.
Step-by-Step Process
Step 1: Data Preparation and Database Setup
I began by creating a SQL Server database named "Titanic." This step is crucial as it establishes a structured environment for querying, ensuring data integrity and efficient retrieval. Importance: Without a proper database, queries could lead to errors or inefficiencies, especially with larger datasets. Next, I imported the train_clean.csv file into a table named [dbo].[train_clean] using SQL Server Management Studio's import wizard. This process involved mapping columns to appropriate data types (e.g., Age as float, Survived as int) and handling any null values (e.g., in Age or Cabin).
Importance: Data import ensures cleanliness and accessibility. In this case, the dataset was pre-cleaned (e.g., Titles extracted from Names, Family_Size calculated), but verifying for duplicates or inconsistencies is key to reliable analysis. This step took about 15 minutes and set the foundation for accurate querying.
Step 2: SQL Query Development
With the data loaded, I wrote SQL queries to address 10 specific tasks. Each query was designed to extract targeted insights, using techniques like aggregation (AVG, COUNT), conditional logic (CASE), grouping (GROUP BY), and filtering (WHERE). Below, I detail each query, its execution, and why it's important.
Calculate the Average Age of Passengers Who Survived and Those Who Did Not
Survivors averaged ~28.34 years; non-survivors ~30.63 years. This reveals age-related vulnerabilities, highlighting how younger passengers (potentially more agile) had a slight edge in survival. It's essential for understanding demographic risks in emergencies.
Find the Top 5 Most Common Titles Among Passengers
Mr. (517), Miss (182), Mrs. (125), Master (40), Dr. (7). Titles reflect social structure and gender distribution, aiding in segmentation analysis. This is vital for targeted insights, like how titles correlate with class or survival.
Calculate the Average Fare Paid by Passengers in Each Pclass
1st Class (~84.15), 2nd (~20.66), 3rd (~13.68). Illustrates economic disparities, showing how fare (proxy for wealth) ties to amenities and survival odds. Critical for pricing strategy analysis in business contexts.
Retrieve Passengers with the Highest and Lowest Fare
Highest (~512.33, e.g., Cardeza family); Lowest (0, e.g., crew or complimentary tickets). Identifies outliers, revealing luxury vs. economy dynamics. Useful for anomaly detection in financial data.
Find the Survival Rate for Male and Female Passengers
Females (74.20% survival, 233/314); Males (18.89%, 109/577). Exposes gender biases in evacuation ("women and children first"), underscoring equity in crisis protocols.
Recommended by LinkedIn
Calculate the Average Family Size for Survivors and Non-Survivors
Survivors (~1.97); Non-survivors (~1.88). Shows family size's subtle impact on mobility during disasters, informing group-based safety planning.
Retrieve Names of Passengers with Missing Age
177 passengers (e.g., Moran, Mr. James). Handles missing data, essential for imputation strategies and ensuring complete analysis.
Find the Most Common Port of Embarkation and Passenger Count
Southampton (644). Reveals geographic trends, useful for logistics and market analysis.
Calculate Average Age by Pclass and Gender
1st Class Females (~34.61); 3rd Class Males (~26.51). Highlights intersections of class, gender, and age, key for nuanced segmentation.
Retrieve Passengers Traveling Alone Who Did Not Survive
~163 such passengers (adjusted for Family_Size=0 meaning alone). Identifies isolated individuals' higher risks, emphasizing support systems.
Each query was tested for accuracy, with execution times under 1 second due to the small dataset.
Step 3: Data Visualization in Power BI
I connected the SQL database to Power BI and created dashboards for each query's results. Images like (survival by gender) and (family size comparison) were exported.
Importance: Visualizations make complex data accessible, enabling storytelling. Tableau's interactivity allowed for filters, enhancing exploratory analysis.
Key Insights
- Gender and Class Disparities: Women in 1st class had near-100% survival, vs. 3rd class males at ~13%, showing "privilege" in crises.
- Age and Family Dynamics: Younger passengers and smaller families fared better, likely due to mobility.
- Economic Factors: Higher fares correlated with better survival, underscoring resource access.
These insights humanize the data, turning numbers into narratives about inequality.
Recommendations and Future Insights
Drawing from this analysis, modern cruise companies (e.g., a hypothetical "OceanSafe Cruises") can apply lessons to improve safety:
- Short-Term (2-5 Years): Implement AI-driven evacuation simulations prioritizing vulnerable groups (e.g., elderly, solo travelers). Enhance data collection on passenger demographics for personalized safety briefings. Invest in equitable lifeboat access, addressing class-based disparities.
- Long-Term (5-10 Years): Integrate IoT sensors for real-time risk assessment (e.g., iceberg detection via satellite). Use predictive analytics on passenger data to forecast needs, reducing missing info issues. Promote inclusivity training to eliminate gender biases. By 2035, aim for 100% survival in simulations through VR training and automated systems.
These steps could reduce risks by 50-70%, based on historical parallels, turning data into lifesaving strategies.
Conclusion
This Titanic project honed my skills in SQL, Power BI, and data storytelling while revealing timeless lessons. It's a testament to how analyzing the past can shape a safer future. I'm eager to apply these techniques in professional roles and connect with me on LinkedIn to discuss collaborations!
Tools Used: SQL Server, Power BI. Dataset Source: Public Titanic repository (cleaned version).
Data Analyst | Power BI & SQL Expert | Turning Business Data into Actionable Insights
1moGood
Freelance Data Analyst | Transform Row Data Into Actionable Insights | Power BI Dashboards | SQL | Python | Data Visualization & Reporting | Excel Dashboard | Data Cleaning
1moWell done! A neat visualization that delivers strong insights.”
B.Sc. Statistics, First Class Honours || Data Scientist || Machine Learning || Deep Learning || Computer Vision || Artificial Intelligence || Researcher
1moYou're doing well Godwill Okwuchukwu
Area Manager at Samsung Electronics India Pvt Ltd
1moWell done
AI Software Engineer @ Nahda University | Prompt Engineering, Linux
1moGodwill Okwuchukwu, your analysis on the Titanic dataset is amazing! The key insights and recommendations you've shared are incredibly inspiring.