Malaria Data Visualization in Python

Interaction data visualization in Python with Altair and Plotly.

Image credit: Getty Image

In this post, I will use the malaria data from the World Health Organization (WHO) for demonstrate how to implement interactive data visualization in Python. There are three data sets we will be using that have different aggregation levels. The three data sets:

I experiment with two Python libraries for interactive data visualization: Altair and Plotly. Plotly is a popular library for making interactive data visualization. Altair is gaining more popularity as a Python visualization library. It is a wrapper for Vega-Lite, a JavaScript high-level visualization library.


First, I would like to know where in the world that has the most incidences of malaria from 2000 to 2015. The malaria_inc.csv can help us answer the question. After loading the data set in with pandas, I notice that there are too many countries/regions in this data set. Plotting all these countries on a single graph would not be ideal, because the plot would look really busy. Therefore, I group the countries into five categories depending on the geological regions: East Asia & Pacific, Latin America & Caribbean, South Africa, South Asia, and Sub-Saharan Africa. I use the convenient str methods in pandas. Here is my code:

import pandas as pd

inc_path = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_inc.csv"
inc = pd.read_csv(inc_path)

inc.columns = ['Entity', 'Code', 'Year', 'inc_per_100k']

inc["region"] = inc.Entity.str.extract(r"(East Asia & Pacific|Latin America & Caribbean|South Africa|South Asia|Sub-Saharan Africa)")

Now, we have a column in the malaria incidences data set called region that indicates the region of a given country.

To visualize where has the most cases of malaria, I want to make a bar plot that shows the average malaria case numbers between 2000 and 2015 (the data set only only contains data from 2000 and 2015 with a 5 years increment). I also want to show the trend in malaria incidence over the years. This can be shown with time series line pots. Using Altair, I can easily make the two plots side-by-side. Furthermore, I can define a selector in Altair that enables users to highlight one certain region by clicking on it. Here is the code to implement it:

deaths_path = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths.csv"
deaths = pd.read_csv(deaths_path)
deaths.columns = ['Entity', 'Code', 'Year', 'deaths_per_100k']

import altair as alt
inc_region = inc.loc[~inc.region.isna()]
selector = alt.selection_single(empty='all', fields=['region'])

    width=500,
    height=300
).add_selection(selector)

points = base.mark_bar(filled=True, size=50).encode(
    x='region', 
    
    color=alt.condition(selector, 'region:N', alt.value('lightgray'), legend=None)
)


timeseries = base.mark_line().encode(
    x='Year:O',
    y=alt.Y('mean(inc_log)', scale=alt.Scale(domain=(0, 7)), title = "Log of The Average Incidence of Malaria per 100K"),
    color=alt.Color('region:N', legend=None)
).transform_filter(
    selector
)

points | timeseries

Here is a demo of the resulting plot:

Notice that the line plot is on log scale. By plotting it on the log scale, it helps to rescale the y-axis so the plot looks nicer.

From the plot, we see that Sub-Saharan Africa has the highest incidences of malaria, followed by South Asia. However, I am happy to see that all regions experience decline in the malaria incidences over the years.


Second, I want to know which country has the highest malaria deaths. Plotting a world map that show the number of deaths would visualize this. Using Plotly Express scatter_geo is an easy and easy way to accomplish this. There is minimal data preprocessing needed for this. Here is the code to implement this:

fig = px.scatter_geo(deaths,
                     locations="Code",
                     color="Entity",
                     size='deaths_per_100k',
                     size_max=40,
                     animation_frame ='Year',
                     hover_name = 'Entity',
                     title = 'Malaria Deaths Per 100,000 People between 1990 and 2016',
                     width = 1000, height = 700)
fig.update_layout(showlegend=False)
fig.show()

Here is the resulting animation.

In the animation, the bubble size corresponds to the number of malaria deaths. The slider at the bottom animate the changes in the number of deaths from 1990 to 2016. The plot also displays more granular details when we hover the mouse over a specific country.

From the animation, we see that Africa has the highest number of malaria. However, we can see again the number of malaria deaths decreases over time.


The third graph uses the malaria_deaths_age.csv data set. From the plot, I want to know whether younger or older people tend to die from malaria. Using the line function in Plotly Express, I make a line plot of malaria deaths from 1990 to 2015 separated by age groups. Again, it is easy to implement in Plotly.

death_age_path = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths_age.csv"
death_age = pd.read_csv(death_age_path, index_col = 0)

c = 'World'
fig = px.line(death_age[death_age.entity == c],
              x='year',
              y='deaths',
              color='age_group',
              markers=True,
              title=f"Malaria Death by Age Group in The {c}",
              labels={'age_group': "Age Group"},
              width = 700, height = 400
             )
fig.show()

Here is the resulting plot:

We can see that the plot shows more details when the mouse is hovering over the lines. From the plot, we see that children who are under 5 are more likely to die from malaria.

Refer to the complete code and notebook here!

Jiaman Betty Wu
Jiaman Betty Wu
Data Scientist

Related