doeasyeda Tutorial

exploring gapminder with doeasyeda

This tutorial is adapted from Exploring Seattle Weather from Vega-Altair.

In this comprehensive guide, you’ll learn advanced techniques for creating insightful visualizations using the doeasyeda package with the Gapminder dataset as a practical example. This guide assumes you have basic familiarity with Python and data visualization concepts. If you’re new to the doeasyeda package or need a refresher on the basics with Altair, consider reviewing Basic Statistical Visualization starting first.

For this tutorial, we’ll delve into the rich, multidimensional Gapminder dataset. This dataset provides a treasure trove of information, detailing global socio-economic indicators across various countries and continents. The key metrics include population, GDP per capita, life expectancy, and more, spanning from 1952 to 2007. Each row represents a unique country-year pair, offering a snapshot of that nation’s status in the given year.

The doeasyeda package, tailored for efficient and straightforward exploratory data analysis, seamlessly integrates with data in the form of pandas DataFrames. To facilitate your learning journey, we’ll cover the following key steps:

Visual Analysis

Crafting compelling visual narratives with doeasyeda’s visualization capabilities.
Creating various types of plots (like histogram, line plots, area plots, and scatter plots) to represent the socio-economic trends effectively.

Interactive Features:

Leveraging doeasyeda’s interactive features to create dynamic visualizations.
Enabling user-driven query capabilities to explore the dataset from multiple angles.

Best Practices and Tips:

Tips for effective data visualization.
Common pitfalls to avoid and best practices to adopt for meaningful data exploration.

Conclusion and Further Resources:

Summarizing key takeaways from the tutorial.
Acknowledge limitations of doeasyeda.
Providing resources for further learning and exploration using doeasyeda.

import pandas as pd

from doeasyeda.create_scatter_plot import create_scatter_plot
from doeasyeda.create_line_plot import create_line_plot
from doeasyeda.create_hist_plot import create_hist_plot
from doeasyeda.create_area_plot import create_area_plot

df = pd.read_csv('gapminder.csv')
df.head()

	country	year	population	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333	Asia	28.801	779.445314
1	Afghanistan	1957	9240934	Asia	30.332	820.853030
2	Afghanistan	1962	10267083	Asia	31.997	853.100710
3	Afghanistan	1967	11537966	Asia	34.020	836.197138
4	Afghanistan	1972	13079460	Asia	36.088	739.981106

Visual Analysis

Our data, sourced from the “gapminder.csv” file within the doc/ directory, has been meticulously loaded into a pandas DataFrame.

Dissecting Distribution with Scatter Plots

Let’s start by looking at the lifeExp, using scatters to see the distribution of life expectancy values within each continent:

create_scatter_plot(df, 'continent', 'lifeExp')

The above create_scatter_plot function instructs doeasyeda to render a scatter plot with ‘continent’ on the x-axis and ‘lifeExp’ on the y-axis. The resulting plot is more than a mere representation of data points—it is a window into the story of human longevity across continents.

This visualization plot can be potential useful regarding to the following aspects:

Visualizing Data Distribution: The spread of dots along the vertical axis within each continent’s category illustrates the distribution range of life expectancy. A wide vertical spread within a continent suggests significant disparities in life expectancy among its countries.
Capturing Data Density: Areas where dots are densely packed indicate a clustering of countries with similar life expectancy figures.
Detecting Outliers: Dots that stand apart from the main cluster within each continent can be viewed as outliers, suggesting countries with exceptionally high or low life expectancy relative to their continental peers.

The scatterplot can be improved by adding title and labels for x- and y- axis as the following example shows:

create_scatter_plot(df, 'continent', 'lifeExp', 
                    title='Life Exp by Continent', x_title= 'Continent', y_title='Life Exp')

We illustrated the distribution of life expectancy across continents using a basic scatter plot. Now, let’s enhance the visual appeal and functionality of our plot with additional features provided by the doeasyeda package.

Firstly, we will enrich our plot with a colorful palette to differentiate between continents more distinctly. The modified function call is as follows:

create_scatter_plot(df, 'continent', 'lifeExp', color='continent', 
                    title='Life Exp by Continent', x_title= 'Continent', y_title='Life Exp')

By specifying the color parameter as ‘continent’, we instruct the function to assign unique colors to each continent’s data points. This not only adds aesthetic value but also makes it easier to visually segregate the life expectancy data for each continent, thus enhancing the interpretability of our visualization.

We will now introduce three new functions from the doeasyeda package, each designed to illuminate distinct facets of the dataset through histograms, area plots, and line plots.

Exploring Averages with Histograms

Firstly, we shall compute the mean life expectancy for each continent to observe the average lifespan in these regions:

df_grouped1 = df.groupby(['continent'])['lifeExp'].mean().reset_index()
create_hist_plot(df_grouped1, 'continent', 'lifeExp', color='continent', 
                 title='Average Life Exp by Continent', x_title= 'Continent', y_title='Average Life Exp')

Using this aggregated data, we can employ create_hist_plot to generate a histogram that illustrates the average life expectancy across continent. This histogram will provide us with a visual summary of the average life expectancies, allowing us to easily compare the aggregated metric across different geographical regions.

In our analysis of the intricate Gapminder dataset, we encounter three hierachies — specifically continent, country and year data points. This data structure, rich in detail, can sometimes overshadow broader trends when presented in its raw, multi-dimensional form. To distill this complexity and uncover the overarching patterns, we turn to aggregate measures. By calculating the mean life expectancy for each continent.

Visualizing Trends with Area Plots

Next, the create_area_plot can be adopted to gain insight into demographic changes over time by consolidating the population data. By crafting an area plot with the create_area_plot function, we can visualize the total population growth or decline across each continent throughout the years:

df_grouped2 = df.groupby(['continent', 'year'])['population'].sum().reset_index()
create_area_plot(df_grouped2, 'year', 'population', color='continent', 
                 title='Total Population by Continent', x_title= 'Continent', y_title='Total Population')

This area plot will not only depict the magnitude of populations by region but also the dynamic shifts over the timeline, providing a powerful narrative of demographic evolution.

The choice of an area plot to analyze total population counts by continent is particularly apt due to the plot’s ability to convey volumes and changes over time. The cumulative nature of the area plot, with its filled regions under the lines, provides a visual sense of the weight and progression of population growth across different continents. It allows for an intuitive comparison of scales and trends, making it easier to discern how the population has expanded or contracted over the years within each geographic region. By layering these regions, the area plot not only shows individual continent trends but also how these trends stack up against each other, offering a comprehensive view of demographic shifts in a single, cohesive visualization.

Assessing Progress with Line Plots

Lastly, the fourth function of create_line_plot function can also be used for analyzing trend over time. For example, to examine economic progression, we calculate the GDP per capita. With this calculation at hand, we turn to create_line_plot to draw a line plot that tracks the changes in GDP per capita over time, segregated by continent:

df['gdp'] = df['gdpPercap'] * df['population']
df_grouped3 = df.groupby(['continent', 'year'])[['population', 'gdp']].sum().reset_index()
df_grouped3['gdpPercap'] = df_grouped3['gdp']/df_grouped3['population']

create_line_plot(df_grouped3, 'year', 'gdpPercap', color='continent', 
                 title=' GDP per capita by Continent', x_title= 'Continent', y_title='GDP per capita')

This line plot will serve as a graphical chronicle of economic health, highlighting the growth trajectories or potential stagnations experienced by each continent.

The line plot, with its continuous line tracing the rise and fall of values, offers a clear narrative of economic development and provides immediate visual cues about upward or downward trajectories. Moreover, the ability to plot multiple lines differentiated by color allows for direct comparison between continents, making it easy to identify which regions are experiencing growth, stagnation, or decline.

Each of these plots—scatters, histogram, area, and line—will contribute a unique vantage point, enriching our analysis and storytelling with multi-dimensional views of the life, population growth, and prosperity within our data.

Interactive Features

To further our exploration, we will now incorporate interactivity into our scatter plot. Interactive visualizations are instrumental in conducting a deeper, more intuitive examination of the data. By simply adding the interactive=True parameter, along with a tooltip feature, our scatter plot becomes a dynamic tool for users to engage with:

create_scatter_plot(df, 'continent', 'lifeExp', color='continent', 
                    title='Life Exp by Continent', x_title= 'Continent', y_title='Life Exp',
                    interactive=True, tooltip='lifeExp')

To elevate the analytical powerness by intergrating interactive elemtns, the line plot can also be enhanced with the interactive=True parameter and specifying tooltip='gdpPercap', we transform our static visualization into an engaging, user-driven experience:

create_line_plot(df_grouped3, 'year', 'gdpPercap', color='continent', 
                 title=' GDP per capita by Continent', x_title= 'Continent', y_title='GDP per capita',
                 interactive=True, tooltip='gdpPercap')

This enhancement of interactivity transforms the static plot into an interactive experience, where users can hover over individual data points to display a tooltip. Building upon the interactivity of our plots, users can utilize the scrolling feature of their mouse to seamlessly zoom in for a closer examination of specific data points or zoom out to regain a broader perspective of the data distribution. Additionally, by clicking and dragging the plot area, one can effortlessly navigate vertically, allowing for a thorough exploration across the full range of data values. This feature is not only informative but also encourages users to delve into the specifics of each continent’s data.

Best Practices and Tips

Mastering the art of data visualization involves more than just understanding how to use the tools at your disposal—it’s about storytelling and clarity. When leveraging the four powerful functions of the doeasyeda package (scatter plots, histograms, area plots, and line plots), it’s essential to observe certain best practices to ensure your visualizations are not only informative but also compelling:

Understand Your Data’s Story: Before diving into visualization, thoroughly understand the narrative behind your data. This comprehension will guide your choice of the doeasyeda function that best suits your storytelling needs—be it the distribution insights from a histogram, the trend examination in a line plot, the cumulative understanding from an area plot, or the detailed comparison in a scatter plot.
Embrace Clarity and Simplicity: The goal of visualization is to make complex data understandable. Opt for simplicity and clarity over complexity. A cluttered chart may hide the crucial insights your data holds. Use doeasyeda functions to create clear, uncluttered visualizations that speak directly to your audience.
Choose Colors Wisely: Colors are powerful, but when misused, they can lead to confusion. Use the color encoding features in doeasyeda thoughtfully to differentiate data categories or highlight key data points, ensuring that your color choices enhance the interpretability of your visualizations.
Purposeful Interactivity: Interactive features should enrich the user’s experience, offering them deeper insights or alternative perspectives. Utilize the interactive capabilities of doeasyeda to create plots that allow users to explore the nuances of your data, not just as a visual gimmick.
Preparation of Data: Before visualization, ensure your data is in the right structure, especially for aggregated metrics. Appropriate aggregation, like summing population for an area plot or averaging GDP per capita for a line plot, is crucial to convey accurate and meaningful insights.
Consistency in Design: Maintain a consistent design language across all your visualizations. Consistency in axis labeling, color schemes, and overall design ensures that your suite of visualizations tells a cohesive story, enhancing both the professionalism and the interpretability of your presentation.

By adhering to these best practices, each function within doeasyeda can be utilized to its fullest potential, transforming raw data into compelling stories and insightful analyses.

Conclusion and Further Resources

As we conclude this tutorial, remember that the journey of data exploration and visualization is an iterative and evolving process. The doeasyeda package provides a robust foundation for creating insightful visual narratives with your data.

However, while doeasyeda excels in its simplicity and ease of use, it’s important to acknowledge its limitations. The package is designed for foundational exploratory data analysis, and there might be instances where you require more specialized or advanced visualizations. In such cases, the integration with other comprehensive visualization libraries may be necessary.

The landscape of data science is dynamic and constantly evolving. As you continue your journey, stay curious and proactive. Delve into advanced visualization tools, engage with the vibrant community on forums like GitHub and Stack Overflow, and seek out scholarly articles for deeper insights. Practice is paramount—regularly challenge yourself with diverse datasets and strive to refine your storytelling abilities. Remember, each dataset tells a story, and with doeasyeda, you’re equipped to narrate these stories compellingly and informatively.