Data Integration
Data Transformation
Data Exploring
Data Wrangling & Subsetting
Data Merging
Geographical Visualizations in Python
Supervised Machine Learning - Regression
Unsupervised Machine Learning - Clustering
Visualizations & Forecasting
Analyze the world's state of children's education using out of school and completion rate data. This analysis will provide insights into countries with significant education accessibility and gender gaps for further action.
1. UNICEF’s State of the World’s Children 2021 on Education. Dataset can be downloaded from Completion Rate, Statistical tables, Other References.
2. GDP Per Capita data from World Bank. Dataset can be downloaded here.
3. 2022 Global Happiness Rank by World Happiness Report. Dataset can be downloaded here.
Standardizing country names every time data is merged from a new source and geojson file.
Solution: Create a reference country list based on UNICEF data. Conduct merge and look for left only or right only values and proceed to rename the countries not joined based on the reference list.
Cons: Does not work with geojson file. Had to rename to suit geojson file for visualization in Python. Working on solution that works by integrating geometry into main dataframe.
This section gives you an excerpt of what was performed throughout the analysis. It is to give you an idea of the tools used and the process taken to formulate the findings. The full workings can be found in github.
We begin with conducting Exploratory Analysis in Python using Pandas. Relationships were explored by running correlations, scatterplots, pair plots and categorical plots. These relationships were plotted using matplotlib. The following is a snapshot of a heatmap for female completion rates at primary level.
This shows that there seems to be a slight correlation between completion rates and GDP per capita. It also correlates with the Happiness Rank (inverse correlation coefficient due to Top Rank = smaller number). The Top rank in Happiness Rank(low number) also correlates with high GDP.
The figure below is an example of a scatterplot for male and female completion rates at upper secondary level against their country’s GDP. Countries with very low GDP per capita mostly have very low secondary completion rates. However, having high completion rate doesn't necessarily rank high in GDP.
Folium has been selected for geographical visualization in Python. Below is a visualization of Male out of school rate at primary level.
We have also explored running a supervised machine learning algorithm on the GDP per Capita vs Completion Rates. The regression linear model from the scikit library has been used for this purpose.
The data is first split into a training set and a test set based on 30% test size. Then the regression is fit into the training set. We will then predict the GDP value based on the Completion Rate from the test set.
Once this value has been predicted and plotted, we cancalculate the summary statistics (root mean squared error and r2 score). We can also create a table to compare the actual and predicted GDP values.
For the unsupervised machine learning analysis for this data, we started with using elbow technique to determine potential number of clusters to set.
Using the k-Means algorithm, we try to fit the algorithm into the data and then predict the cluster numbers for the data. Plotting the results gives you this:
We can clearly see these clusters assigned to the data, which can be categorized into Low GDP, Mid-Low GDP, Mid-High GDP and High GDP.
All of these information are formulated into Tableau’s storyboard below.
View in full screen if there is any cut off or visit the storyboard on Tableau here.