Python graphics: Matplotlib fundamentals

Overview. We introduce and apply Python's popular graphics package, Matplotlib. We produce line plots, bar charts, scatterplots, and more. We do all this in Jupyter using a single notebook.

Python tools. Graphing with Matplotlib: dataframe plot methods, figure and axis objects.

Buzzwords. Data visualization

Applications. US GDP, GDP per capita and life expectancy, Fama-French asset returns, PISA math scores.

Code. Link.

Computer graphics are one of the great advances of the modern world. Graphs have always been helpful in describing data or concepts, and now they're a lot easier to produce. We've gotten so good at drawing pictures that we invented a new term for it: visualization. Done well, a graph tells us something new -- and gets us thinking about other things we'd like to know.

That's the good news. The bad news is that graphics are inherently complicated. Programs like Excel do their best to hide this fact, but if you ever try to customize a chart it quickly rears its ugly head. Have you ever spent a couple hours trying to fine-tune an Excel graph? More? The problem is that even simple graphs have lots of moving parts: the type (line, bar, scatter, etc); the color and thickness of lines, bars, or markers; title and axis labels; their location, fonts, and font sizes; tick marks (location, size); background color; grid lines (on or off); and so on. That's not an Excel problem, it's a problem with graphics in general.

Our goal here is to produce graphs with Matplotlib, Python's leading graphics package. There's a lot here, but don't panic, that's the nature of graphics. And it gets easier with experience.

One more thing before we start: Save the Jupyter notebook at the Code link above in your Data_Bootcamp directory/folder. The link goes to a display of the notebook; you need to click on the Raw button to get the real file. Be sure to download it as filetype ipynb.

Reminders

  • Packages. Collections of tools that extend Python's capabilities. We add them with import statements.

  • Pandas. Python's data management package. We typically add it to our programs with

    import pandas as pd
  • Objects and methods. Recall -- again! -- that we apply the method justdoit() to the object x with x.justdoit().

  • Dataframe. A data structure like a spreadsheet that includes a table of data plus row and column labels. Typically columns are variables and rows are observations. We get column labels for a dataframe df with df.columns and row labels with df.index.

  • Series. We express a single variable x in a dataframe df as df['x'], a series.

  • Reading spreadsheets. We "read" spreadsheet data into Python with the read_csv() and read_excel() functions in Pandas.

Getting ready

We need to do a few things before we're ready to produce graphs.

Open the graphics notebook. If you followed instructions -- and we're confident you did -- you saved the notebook for this chapter in your Data_Bootcamp directory. Return to the Jupyter tab in your browser that points to that directory. Look for the file named bootcamp_graphics.ipynb. Click to open it. That will open the notebook in a new tab. The notebook will say at the top: "Python graphics: Matplotlib fundamentals" in large bold letters.

Import packages. We need to tell our program what packages we plan to use. The following code also checks their versions and prints the date:

import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # pyplot module
import datetime as dt # date and time module
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())

All of these statements generally go at the top of our program -- right after the description.

Process data. We use three dataframes to illustrate Matplotlib graphics.

US GDP. The first one is several years of US GDP and Consumption. We got the numbers from FRED, but have written them out here for simplicity. The code is

gdp = [13271.1, 13773.5, 14234.2, 14613.8, 14873.7, 14830.4, 14418.7,
14783.8, 15020.6, 15369.2, 15710.3]
pce = [8867.6, 9208.2, 9531.8, 9821.7, 10041.6, 10007.2, 9847.0, 10036.3,
10263.5, 10449.7, 10699.7]
year = list(range(2003,2014)) # use range for years 2003-2013
# Note that we set the index
us = pd.DataFrame({'gdp': gdp, 'pce': pce}, index=year)
print(us)

Note that we created a dataframe from a dictionary. That's convenient here, but in most real applications we'll read in spreadsheets or access the data online through an "API".

World Bank. Our second dataframe contains 2013 data for GDP per capita (basically income per person) for several countries:

code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
country = ['United States', 'France', 'Japan', 'China', 'India',
'Brazil', 'Mexico']
gdppc = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]
wbdf = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)
wbdf

In a notebook, the last line -- the dataframe name wbdf on its own -- results in the display of wbdf. That works as long as it's the last statement in the cell.

Fama-French returns. Our third dataframe consist of annual returns from our friends Fama and French:

import pandas.io.data as web
ff = web.DataReader('F-F_Research_Data_factors', 'famafrench')[1]
ff.columns = ['xsm', 'smb', 'hml', 'rf']
ff['rm'] = ff['xsm'] + ff['rf']
ff = ff[['rm', 'rf']] # extract rm (market) and rf (riskfree)
ff.head(5)

This gives us a dataframe with two variables: rm is the return on the equity market overall and rf is the riskfree return.

Exercise. What kind of object is wbdf? What are its column and row labels?

Exercise. What is ff.index? What does that tell us?

Digression: Graphing in Excel

Before charging ahead, let's review how we would create what Excel calls a "chart". We need to choose:

  • Data. We would highlight a block of cells in a spreadsheet.

  • Chart type. Lines, bars, scatter plots, and so on.

  • x and y variables. Typically we graph some y variable -- or perhaps several of them -- against an x variable, with x on the horizontal axis and y on the vertical axis. We need to tell Excel which is which.

This might be followed by a long list of fine-tuning: what the lines look like, how the axes are labeled, and so on. We'll see the same in Matplotlib.

Two approaches to graphics in Matplotlib

Back to graphics. Python's leading graphics package is Matplotlib. Matplotlib can be used in a number of different ways:

  • Approach #1: Apply plot methods to dataframes.

  • Approach #2: Create figure objects and apply methods to them.

They call on similar functionality, but use different syntax to get it.

Approach #1: Apply plot methods to dataframes

The simplest way to produce graphics from a dataframe is to apply a plot method to it. Simple is good, we do this a lot.

If we compare this to Excel, we will see that a number of things are preset for us:

  • Data. By default (meaning, if we don't do anything to change it) the data consists of the whole dataframe.

  • Chart type. We'll see below that we have options for lines, bars, or other things.

  • x and y variables. By default, the x variable is the dataframe's index and the y variables are the columns of the dataframe -- all of them that can be plotted (e.g. columns with a numeric dtype).

We can change all of these things, just as we can in Excel, but that's the starting point.

Example (line plot). Enter the statement us.plot() into a code cell and run it. This plots every column of the dataframe us as a line against the index, the year of the observation. The lines have different colors. We didn't ask for this, it's built in. A legend associates each variable name with a line color. This is also built in.

Example (single line plot). We just plotted all the variables -- all two of them -- in the dataframe us. To plot one line, we apply the same method to a single variable -- a series. The statement us['gdp'].plot() plots GDP alone. The first part -- us['gdp'] -- is the single variable GDP. The second part -- .plot() -- plots it.

Example (single line plot 2). In addition to getting a series from our dataframe and then plotting the series, we could also set the y argument when we call the plot method. The statement us.plot(y="gdp") will produce the same plot as us['gdp'].plot().

Example (bar chart). The statement us.plot(kind='bar') produces a bar chart of the same data.

Example (scatter plot). In a scatter plot we need to be explicit about x and y. We'll use gdp as x and pce (consumption) as y. The general syntax for a dataframe df is df.plot.scatter(x,y). In this case we use

us.plot.scatter('gdp', 'pce')

The scatter here is not far from a straight line; evidently consumption and GDP go up and down together.

Exercise. Enter us.plot(kind='bar') and us.plot.bar() in separate cells. Show that they produce the same bar chart.

Exercise. Add each of these arguments, one at a time, to us.plot():

  • kind='area'

  • subplots=True

  • sharey=True

  • figsize=(3,6)

  • ylim=(0,16000)

What do they do?

Exercise. Type us.plot? in a new cell. Run the cell (shift-enter or click on the run cell icon). What options do you see for the kind= argument? Which ones have we tried? What are the other ones?

We can do similar things with the Fama-French dataframe ff. The basic plot statement is

ff.plot()

This has one series (the equity market return rm) that varies a lot and one (the riskfree return rf) that does not.

Let's think about the returns a little. What does the data tell us about them? That's an easier question to answer if we use a different plot. We like histograms because they describe all the outcomes in a convenient form. Try this code:

ff.plot(kind='hist', # histogram
bins=20, # 20 bins
subplots=True) # two separate subplots

It produces separate histograms of the two variables with 20 "bins" in each, as noted in the comments.

Exercise. Let's see if we can dress up the histogram a little. Try adding, one at a time, the arguments title='Fama-French returns', grid=True, and legend=False. What does the documentation say about them? What do they do?

Exercise. What do the histograms tell us about the two returns? How do they differ?

Exercise. Use the World Bank dataframe wbdf to create a bar chart of GDP per capita, the variable 'gdppc'. Bonus points: Create a horizontal bar chart. Which do you prefer?

Approach #2: Create figure objects and apply methods

This approach was mysterious to us at first, but it's now our favorite. The idea is to generate an object -- two objects, in fact -- and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on.

We do this -- as usual -- one step at a time.

Create objects. We'll see these two lines over and over:

import matplotlib.pyplot as plt # import pyplot module
fig, ax = plt.subplots() # create fig and ax objects

Note that we're using the pyplot function subplots(), which creates the objects fig and ax on the left. The subplot() function produces a blank figure, which is displayed in the Jupyter notebook. The names fig and ax can be anything, but these choices are standard.

We say fig is a figure object and ax is an axis object. (Try type(fig) and type(ax) to see why.) Once more, the words don't mean what we might think they mean:

  • fig is a blank canvas for creating a figure.

  • ax is everything in it: axes, labels, lines or bars, legend, and so on.

Once we have the objects, we apply methods to them to create graphs.

Create graphs. We create graphs by applying plot-like methods to ax. We typically do this with dataframe plot methods:

fig, axe = plt.subplots() # create axis object axe
us.plot(ax=axe) # ax= looks for axis object, axe is it

(Note again that we need to create and use the axis object in the same code cell.)

Example. Let's do the same with the Fama-French data:

fig, ax = plt.subplots()
ff.plot(ax=ax,
kind='line', # line plot
color=['blue', 'magenta'], # line color
title='Fama-French market and riskfree returns')

Exercise. Let's see if we can teach ourselves the rest:

  • Add the argument kind='bar' to convert this into a bar chart.

  • Add the argument alpha=0.65 to the bar chart. What does it do?

  • What would you change in the bar chart to make it look better? Use the help facility to find options that might help. Which ones appeal to you?

Exercise (somewhat challenging). Use the same approach to reproduce our earlier histograms of the Fama-French series.

Let's review

Take a deep breath. We've covered a lot of ground, it's time to recapitulate.

We looked at three ways to use Matplotlib:

  • Approach #1: Apply plot methods to dataframes.

  • Approach #2: Use the plot(x,y) function.

  • Approach #3: Create fig, ax objects and apply plot methods to them.

This is what their syntax looks like applied to US GDP:

us['gdp'].plot() # Approach #1
plt.plot(us.index, us['gdp']) # Approach #2
fig, ax = plt.subplots() # Approach #3
us['gdp'].plot(ax=ax)

Each one produces the same graph.

Which one should we use? Use Approach #3. Really. This is a case where choice is confusing.

We also suggest you not commit any of this to memory. If you use end up using it a lot, you'll remember it. If you don't, it's not worth remembering. We typically start with examples anyway rather than creating new graphs from scratch.

Bells and whistles

We now know how to create graphs, but if we're honest with ourselves we'd admit they're a little basic. Fortunately, we just got started. We have a huge number of methods available for changing our plots in any way we wish: Add titles and axis labels, change axis limits, and many other things that haven't crossed our minds yet. Here's a short introduction.

Adding things to graphs. So far we've added things to our graph with arguments. Axis methods offer us a lot more flexibility. Consider these:

fig, ax = plt.subplots()
us.plot(ax=ax)
ax.set_title('US GDP and Consumption', fontsize=14, loc='left')
ax.set_ylabel('Billions of 2013 USD')
ax.legend(['GDP', 'Consumption']) # more descriptive variable names
ax.set_xlim(2002.5, 2013.5) # shrink x axis limits
ax.tick_params(labelcolor='red') # change tick labels to red

In this way we add a title (14-point type, left justified), add a label to the y axis, change the limits of the x axis, make the tick labels red, and use more descriptive names in the legend. The tick labels, in particular, are extremely ugly, but they illustrate the control we have over figures.

Exercise. Use the set_xlabel() method to add an x-axis label. What would you choose? Or would you prefer to leave it empty?

Exercise. Enter ax.legend? to access the documentation for the legend method. What options appeal to you?

Exercise. Change the line width to 2 and the line colors to blue and magenta. Hint: Use us.plot? to get the documentation.

Exercise (challenging). Use the set_ylim() method to start the y axis at zero. Hint: Use ax.set_ylim? to get the documentation.

Exercise. Create a line plot for the Fama-French dataframe ff that includes both returns. Bonus points: Add a title with the set_title method.

Multiple plots. We've produced, for the most part, single plots. But the same tools can produce multiple plots in one figure.

Here's an example that produces separate "subplots" of US GDP and consumption. We start by creating the objects:

fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
print('Object ax has dimension', len(ax))

The subplot statement asks for a graph with two rows (top and bottom) and one column. That is, two graphs, one on top of the other. The sharex=True argument makes the x axes the same. The print statement tells us "Object ax has dimension 2", one for the GDP graph, and one for the consumption graph.

Now do the same with content:

fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
us['gdp'].plot(ax=ax[0], color='green') # first plot
us['pce'].plot(ax=ax[1], color='red') # second plot

(Note that we start numbering the components of ax at zero, which should be getting familiar by now.) This gives us a double graph, with GDP at the top and consumption at the bottom. Put another way, the figure fig contains two axis (ax[0] and ax[1]) and each axis has one plot in it.

Examples

We conclude with examples that take data from the previous chapter and make better graphs than we did there.

PISA test scores. Recall that we had a simple plot, but it didn't look very good. The code was

import pandas as pd
import matplotlib.pyplot as plt
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
parse_cols=[0,1,9,13], # select columns of interest
index_col=0, # set the index as the first column
header=[0,1] # set the variable names
)
pisa = pisa.dropna() # drop blank lines
pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names
fig, ax = plt.subplots()
pisa['Math'].plot(kind='barh', ax=ax) # create bar chart

Comment. Yikes! That's horrible! What can we do about it? Any suggestions?

The problem seems to be that the bars and labels are squeezed together, so perhaps we should make the figure taller. We set the figure's dimensions with the argument figsize=(width, height). The sizes are measured in inches, which get shrunk a bit when we display them in Jupyter. Here's a version with a much larger height that we discovered by experimenting:

fig, ax = subplots()
pisa['Math'].plot(kind='barh', ax=ax, figsize=(4,13))
ax.set_title('PISA Math Score', loc='left')

This creates a figure that is 4 inches wide and 13 inches tall. We added a title, too, to be clear about what we have. The title has a fontsize of 14 and is left justified.

Here's a more advanced version in which we made the US bar red. This is ridiculously complicated, but we used our Google fu and found a solution. (Remember: The solution to many programming problems is a combination of Google fu and patience.) The code is

fig, ax = plt.subplots()
pisa['Math'].plot(ax=ax, kind='barh', figsize=(4,13))
ax.set_title('PISA Math Score', loc='left')
ax.get_children()[36].set_color('r')

The 36 comes from experimenting. We count from the bottom starting with zero.

World Bank data. Our second example comes from using the World Bank's API, which gives us access to a huge amount of data for countries. We use it to produce two kinds of graphs and illustrate some tools we haven't seen yet:

  • Bar charts of GDP and GDP per capita

  • Scatter plot (bubble plot) of life expectancy v GDP per capita

We start with the data:

# load packages (redundancy is ok)
import pandas as pd # data management tools
from pandas.io import wb # World Bank api
import matplotlib.pyplot as plt # plotting tools
# variable list (GDP, GDP per capita, life expectancy)
var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN']
# country list (ISO codes)
iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
year = 2013
# get data from World Bank
df = wb.download(indicator=var, country=iso, start=year, end=year)
# munge data
df = df.reset_index(level='year', drop=True)
df.columns = ['gdppc', 'gdp', 'life'] # rename variables
df['pop'] = df['gdp']/df['gdppc'] # population
df['gdp'] = df['gdp']/10**12 # convert to trillions
df['gdppc'] = df['gdppc']/10**3 # convert to thousands
df['order'] = [5, 3, 1, 4, 2, 6, 0] # reorder countries
df = df.sort_values(by='order', ascending=False)
df

Note that the index here is the country name -- that will be our x axis.

Here's a horizontal bar chart for (total) GDP:

fig, ax = plt.subplots()
df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')

What do you see? What's the takeaway?

We think the horizontal bar chart looks better than the usual vertical bar chart, which we'd get if we replaced barh above with bar. (Try it and see what you think.)

Here's a similar chart for GDP per capita:

fig, ax = plt.subplots()
df['gdppc'].plot(ax=ax, kind='barh', color='m', alpha=0.5)
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')

What do you see here? What's the takeway?

And just because it's fun, here's an example of Tufte-like axes from Matplotlib examples:

fig, ax = plt.subplots()
df['gdppc'].plot(ax=ax, kind='barh', color='b', alpha=0.5)
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')
# Tufte-like axes
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('outward', 10))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')

This gives us axes on the left and bottom only, separated slightly from the bars. It's another illustration of the benefits of Google fu.

We finish off with a bubble plot: a scatter plot in which the size of the dots ("bubbles") varies with a third variable. (Count them: we have x on the horizontal axis, y on the vertical axis, and a third variable represented by the size of the bubble.) From a technical perspective, this is simply another argument in a scatter plot. Here's an example in which x is GDP per capita, y is life expectancy, and the third variable is population:

fig, ax = plt.subplots()
ax.scatter(df['gdppc'], df['life'], # x,y variables
s=df['pop']/10**6, # size of bubbles
alpha=0.5)
ax.set_title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Life Expectancy')
ax.text(58, 66, 'Bubble size represents population', horizontalalignment='right')

The only odd thing is the 10**6 "scaling" on the second line. The bubble size is a little tricky to calibrate. Without the scaling, the bubbles are larger than the graph. We played around until they looked reasonable.

Styles

Ok, we lied, that wasn't the conclusion. But we think this is fun, and it's optional in any case.

Matplotlib has a lot of basic settings for graphs. If we find some we like, we can set them once and be done with it. Or we can use some of their preset combinations, which they call styles.

We'll start with one of the bar charts we produced with World Bank data:

fig, ax = plt.subplots()
df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')

Now recreate the same graph with this statement at the top:

plt.style.use('fivethirtyeight')

Once we execute this statement, it stays executed, but we'll change it back at the end.

Here's another one, for fans of the popular xkcd webcomic:

plt.xkcd()
fig, ax = plt.subplots()
df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')

Note the wiggly lines, perfect for suggesting a hand-drawn graph.

Exercise. Try one of these styles: ggplot, bmh, dark_background, and grayscale. Which ones do you like? Why?

When we're done, we reset the style with these two lines in an code cell:

mpl.rcParams.update(mpl.rcParamsDefault)
%matplotlib inline

Review

Consider the data from Randal Olson's blog post:

import pandas as pd
data = {'Food': ['French Fries', 'Potato Chips', 'Bacon', 'Pizza', 'Chili Dog'],
'Calories per 100g': [607, 542, 533, 296, 260]}
cals = pd.DataFrame(data)

The dataframe cals contains the calories in 100 grams of several different foods.

Exercise. We'll create and modify visualizations of this data:

  • Set 'Food' as the index of cals.

  • Create a bar chart with cals using figure and axis objects.

  • Add a title.

  • Change the color of the bars. What color do you prefer?

  • Add the argument alpha=0.5. What does it do?

  • Change your chart to a horizontal bar chart. Which do you prefer?

  • Challenging. Eliminate the legend.

  • Challenging. Skim the top of Olson's blog post. What do you see that you'd like to imitate?

Resources

We haven't found many non-technical resources on Matplotlib we like, but these are pretty good:

  • One of the best is Matplotlib's gallery of examples. It's a good starting point for learning new things. Find an example you like, download the code, and adapt it to your needs. We also like the Pandas summary of dataframe methods.

  • The documentation of Pandas plot methods is also pretty good.

  • The SciPy lectures are good overall. The Matplotlib section focusses on plot(x,y), which wouldn't be our choice, but the content is very good.

  • Randal Olson has lots of good examples on his blog.

If you find others you like, let us know.