**Overview.** We introduce and apply Python's popular graphics package, Matplotlib. We produce line plots, bar charts, scatterplots, and more. We do all this in Jupyter using a single notebook.

**Python tools.** Graphing with Matplotlib: dataframe plot methods, figure and axis objects.

**Buzzwords.** Data visualization

**Applications.** US GDP, GDP per capita and life expectancy, Fama-French asset returns, PISA math scores.

**Code.** Link.

Computer graphics are one of the great advances of the modern world. Graphs have always been helpful in describing data or concepts, and now they're a lot easier to produce. We've gotten so good at drawing pictures that we invented a new term for it: **visualization**. Done well, a graph tells us something new -- and gets us thinking about other things we'd like to know.

That's the good news. The bad news is that graphics are inherently complicated. Programs like Excel do their best to hide this fact, but if you ever try to customize a chart it quickly rears its ugly head. Have you ever spent a couple hours trying to fine-tune an Excel graph? More? The problem is that even simple graphs have lots of moving parts: the type (line, bar, scatter, etc); the color and thickness of lines, bars, or markers; title and axis labels; their location, fonts, and font sizes; tick marks (location, size); background color; grid lines (on or off); and so on. That's not an Excel problem, it's a problem with graphics in general.

Our goal here is to produce graphs with **Matplotlib**, Python's leading graphics package. There's a lot here, but don't panic, that's the nature of graphics. And it gets easier with experience.

One more thing before we start: **Save the Jupyter notebook** at the Code link above in your `Data_Bootcamp`

directory/folder. The link goes to a display of the notebook; you need to click on the Raw button to get the real file. Be sure to download it as filetype ipynb.

Packages. Collections of tools that extend Python's capabilities. We add them with

`import`

statements.Pandas. Python's data management package. We typically add it to our programs with

import pandas as pdObjects and methods. Recall -- again! -- that we apply the method

`justdoit()`

to the object`x`

with`x.justdoit()`

.Dataframe. A data structure like a spreadsheet that includes a table of data plus row and column labels. Typically columns are variables and rows are observations. We get column labels for a dataframe

`df`

with`df.columns`

and row labels with`df.index`

.Series. We express a single variable

`x`

in a dataframe`df`

as`df['x']`

, a series.Reading spreadsheets. We "read" spreadsheet data into Python with the

`read_csv()`

and`read_excel()`

functions in Pandas.

We need to do a few things before we're ready to produce graphs.

**Open the graphics notebook.** If you followed instructions -- and we're confident you did -- you saved the notebook for this chapter in your `Data_Bootcamp`

directory. Return to the Jupyter tab in your browser that points to that directory. Look for the file named `bootcamp_graphics.ipynb`

. Click to open it. That will open the notebook in a new tab. The notebook will say at the top: "Python graphics: Matplotlib fundamentals" in large bold letters.

**Import packages.** We need to tell our program what packages we plan to use. The following code also checks their versions and prints the date:

import sys # system moduleimport pandas as pd # data packageimport matplotlib as mpl # graphics packageimport matplotlib.pyplot as plt # pyplot moduleimport datetime as dt # date and time moduleâ€‹# check versions (overkill, but why not?)print('Python version:', sys.version)print('Pandas version: ', pd.__version__)print('Matplotlib version: ', mpl.__version__)print('Today: ', dt.date.today())

All of these statements generally go at the top of our program -- right after the description.

**Process data.** We use three dataframes to illustrate Matplotlib graphics.

*US GDP.* The first one is several years of US GDP and Consumption. We got the numbers from FRED, but have written them out here for simplicity. The code is

gdp = [13271.1, 13773.5, 14234.2, 14613.8, 14873.7, 14830.4, 14418.7,14783.8, 15020.6, 15369.2, 15710.3]pce = [8867.6, 9208.2, 9531.8, 9821.7, 10041.6, 10007.2, 9847.0, 10036.3,10263.5, 10449.7, 10699.7]year = list(range(2003,2014)) # use range for years 2003-2013â€‹# Note that we set the indexus = pd.DataFrame({'gdp': gdp, 'pce': pce}, index=year)print(us)

Note that we created a dataframe from a dictionary. That's convenient here, but in most real applications we'll read in spreadsheets or access the data online through an "API".

*World Bank.* Our second dataframe contains 2013 data for GDP per capita (basically income per person) for several countries:

code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']country = ['United States', 'France', 'Japan', 'China', 'India','Brazil', 'Mexico']gdppc = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]â€‹wbdf = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)wbdf

In a notebook, the last line -- the dataframe name `wbdf`

on its own -- results in the display of `wbdf`

. That works as long as it's the last statement in the cell.

*Fama-French returns.* Our third dataframe consist of annual returns from our friends Fama and French:

import pandas.io.data as webff = web.DataReader('F-F_Research_Data_factors', 'famafrench')[1]ff.columns = ['xsm', 'smb', 'hml', 'rf']ff['rm'] = ff['xsm'] + ff['rf']ff = ff[['rm', 'rf']] # extract rm (market) and rf (riskfree)ff.head(5)

This gives us a dataframe with two variables: `rm`

is the return on the equity market overall and `rf`

is the riskfree return.

**Exercise.** What kind of object is `wbdf`

? What are its column and row labels?

**Exercise.** What is `ff.index`

? What does that tell us?

Before charging ahead, let's review how we would create what Excel calls a "chart". We need to choose:

Data. We would highlight a block of cells in a spreadsheet.

Chart type. Lines, bars, scatter plots, and so on.

`x`

and`y`

variables. Typically we graph some`y`

variable -- or perhaps several of them -- against an`x`

variable, with`x`

on the horizontal axis and`y`

on the vertical axis. We need to tell Excel which is which.

This might be followed by a long list of fine-tuning: what the lines look like, how the axes are labeled, and so on. We'll see the same in Matplotlib.

Back to graphics. Python's leading graphics package is **Matplotlib**. Matplotlib can be used in a number of different ways:

Approach #1: Apply plot methods to dataframes.

Approach #2: Create figure objects and apply methods to them.

They call on similar functionality, but use different syntax to get it.

The simplest way to produce graphics from a dataframe is to apply a plot method to it. Simple is good, we do this a lot.

If we compare this to Excel, we will see that a number of things are preset for us:

Data. By default (meaning, if we don't do anything to change it) the data consists of the whole dataframe.

Chart type. We'll see below that we have options for lines, bars, or other things.

`x`

and`y`

variables. By default, the`x`

variable is the dataframe's index and the`y`

variables are the columns of the dataframe -- all of them that can be plotted (e.g. columns with a numeric dtype).

We can change all of these things, just as we can in Excel, but that's the starting point.

**Example (line plot).** Enter the statement `us.plot()`

into a code cell and run it. This plots every column of the dataframe `us`

as a line against the index, the year of the observation. The lines have different colors. We didn't ask for this, it's built in. A legend associates each variable name with a line color. This is also built in.

**Example (single line plot).** We just plotted all the variables -- all two of them -- in the dataframe `us`

. To plot one line, we apply the same method to a single variable -- a series. The statement `us['gdp'].plot()`

plots GDP alone. The first part -- `us['gdp']`

-- is the single variable GDP. The second part -- `.plot()`

-- plots it.

**Example (single line plot 2)**. In addition to getting a series from our dataframe and then plotting the series, we could also set the `y`

argument when we call the plot method. The statement `us.plot(y="gdp")`

will produce the same plot as `us['gdp'].plot()`

.

**Example (bar chart).** The statement `us.plot(kind='bar')`

produces a bar chart of the same data.

**Example (scatter plot).** In a scatter plot we need to be explicit about `x`

and `y`

. We'll use `gdp`

as `x`

and `pce`

(consumption) as `y`

. The general syntax for a dataframe `df`

is `df.plot.scatter(x,y)`

. In this case we use

us.plot.scatter('gdp', 'pce')

The scatter here is not far from a straight line; evidently consumption and GDP go up and down together.

**Exercise.** Enter `us.plot(kind='bar')`

and `us.plot.bar()`

in separate cells. Show that they produce the same bar chart.

**Exercise.** Add each of these arguments, one at a time, to `us.plot()`

:

`kind='area'`

`subplots=True`

`sharey=True`

`figsize=(3,6)`

`ylim=(0,16000)`

What do they do?

**Exercise.** Type `us.plot?`

in a new cell. Run the cell (shift-enter or click on the run cell icon). What options do you see for the `kind=`

argument? Which ones have we tried? What are the other ones?

We can do similar things with the Fama-French dataframe `ff`

. The basic plot statement is

ff.plot()

This has one series (the equity market return `rm`

) that varies a lot and one (the riskfree return `rf`

) that does not.

Let's think about the returns a little. What does the data tell us about them? That's an easier question to answer if we use a different plot. We like histograms because they describe all the outcomes in a convenient form. Try this code:

ff.plot(kind='hist', # histogrambins=20, # 20 binssubplots=True) # two separate subplots

It produces separate histograms of the two variables with 20 "bins" in each, as noted in the comments.

**Exercise.** Let's see if we can dress up the histogram a little. Try adding, one at a time, the arguments `title='Fama-French returns'`

, `grid=True`

, and `legend=False`

. What does the documentation say about them? What do they do?

**Exercise.** What do the histograms tell us about the two returns? How do they differ?

**Exercise.** Use the World Bank dataframe `wbdf`

to create a bar chart of GDP per capita, the variable `'gdppc'`

. *Bonus points:* Create a horizontal bar chart. Which do you prefer?

This approach was mysterious to us at first, but it's now our favorite. The idea is to generate an object -- two objects, in fact -- and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on.

We do this -- as usual -- one step at a time.

**Create objects.** We'll see these two lines over and over:

import matplotlib.pyplot as plt # import pyplot modulefig, ax = plt.subplots() # create fig and ax objects

Note that we're using the pyplot function `subplots()`

, which creates the objects `fig`

and `ax`

on the left. The `subplot()`

function produces a blank figure, which is displayed in the Jupyter notebook. The names `fig`

and `ax`

can be anything, but these choices are standard.

We say `fig`

is a **figure object** and `ax`

is an **axis object**. (Try `type(fig)`

and `type(ax)`

to see why.) Once more, the words don't mean what we might think they mean:

`fig`

is a blank canvas for creating a figure.`ax`

is everything in it: axes, labels, lines or bars, legend, and so on.

Once we have the objects, we apply methods to them to create graphs.

**Create graphs.** We create graphs by applying plot-like methods to `ax`

. We typically do this with dataframe plot methods:

fig, axe = plt.subplots() # create axis object axeus.plot(ax=axe) # ax= looks for axis object, axe is it

(Note again that we need to create and use the axis object in the same code cell.)

**Example.** Let's do the same with the Fama-French data:

fig, ax = plt.subplots()ff.plot(ax=ax,kind='line', # line plotcolor=['blue', 'magenta'], # line colortitle='Fama-French market and riskfree returns')

**Exercise.** Let's see if we can teach ourselves the rest:

Add the argument

`kind='bar'`

to convert this into a bar chart.Add the argument

`alpha=0.65`

to the bar chart. What does it do?What would you change in the bar chart to make it look better? Use the help facility to find options that might help. Which ones appeal to you?

**Exercise (somewhat challenging).** Use the same approach to reproduce our earlier histograms of the Fama-French series.

Take a deep breath. We've covered a lot of ground, it's time to recapitulate.

We looked at three ways to use Matplotlib:

Approach #1: Apply plot methods to dataframes.

Approach #2: Use the

`plot(x,y)`

function.Approach #3: Create

`fig, ax`

objects and apply plot methods to them.

This is what their syntax looks like applied to US GDP:

us['gdp'].plot() # Approach #1â€‹plt.plot(us.index, us['gdp']) # Approach #2â€‹fig, ax = plt.subplots() # Approach #3us['gdp'].plot(ax=ax)

Each one produces the same graph.

Which one should we use? **Use Approach #3.** Really. This is a case where choice is confusing.

We also suggest you not commit any of this to memory. If you use end up using it a lot, you'll remember it. If you don't, it's not worth remembering. We typically start with examples anyway rather than creating new graphs from scratch.

We now know how to create graphs, but if we're honest with ourselves we'd admit they're a little basic. Fortunately, we just got started. We have a huge number of methods available for changing our plots in any way we wish: Add titles and axis labels, change axis limits, and many other things that haven't crossed our minds yet. Here's a short introduction.

**Adding things to graphs.** So far we've added things to our graph with arguments. Axis methods offer us a lot more flexibility. Consider these:

fig, ax = plt.subplots()â€‹us.plot(ax=ax)ax.set_title('US GDP and Consumption', fontsize=14, loc='left')ax.set_ylabel('Billions of 2013 USD')ax.legend(['GDP', 'Consumption']) # more descriptive variable namesax.set_xlim(2002.5, 2013.5) # shrink x axis limitsax.tick_params(labelcolor='red') # change tick labels to red

In this way we add a title (14-point type, left justified), add a label to the y axis, change the limits of the x axis, make the tick labels red, and use more descriptive names in the legend. The tick labels, in particular, are extremely ugly, but they illustrate the control we have over figures.

**Exercise.** Use the `set_xlabel()`

method to add an x-axis label. What would you choose? Or would you prefer to leave it empty?

**Exercise.** Enter `ax.legend?`

to access the documentation for the `legend`

method. What options appeal to you?

**Exercise.** Change the line width to 2 and the line colors to blue and magenta. *Hint:* Use `us.plot?`

to get the documentation.

**Exercise (challenging).** Use the `set_ylim()`

method to start the `y`

axis at zero. *Hint:* Use `ax.set_ylim?`

to get the documentation.

**Exercise.** Create a line plot for the Fama-French dataframe `ff`

that includes both returns. *Bonus points:* Add a title with the `set_title`

method.

**Multiple plots.** We've produced, for the most part, single plots. But the same tools can produce multiple plots in one figure.

Here's an example that produces separate "subplots" of US GDP and consumption. We start by creating the objects:

fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)print('Object ax has dimension', len(ax))

The `subplot`

statement asks for a graph with two rows (top and bottom) and one column. That is, two graphs, one on top of the other. The `sharex=True`

argument makes the `x`

axes the same. The `print`

statement tells us "Object ax has dimension 2", one for the GDP graph, and one for the consumption graph.

Now do the same with content:

fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)â€‹us['gdp'].plot(ax=ax[0], color='green') # first plotus['pce'].plot(ax=ax[1], color='red') # second plot

(Note that we start numbering the components of `ax`

at zero, which should be getting familiar by now.) This gives us a double graph, with GDP at the top and consumption at the bottom. Put another way, the figure `fig`

contains two axis (`ax[0]`

and `ax[1]`

) and each axis has one plot in it.

We conclude with examples that take data from the previous chapter and make better graphs than we did there.

**PISA test scores.** Recall that we had a simple plot, but it didn't look very good. The code was

import pandas as pdimport matplotlib.pyplot as pltâ€‹url = 'http://dx.doi.org/10.1787/888932937035'pisa = pd.read_excel(url,skiprows=18, # skip the first 18 rowsskipfooter=7, # skip the last 7parse_cols=[0,1,9,13], # select columns of interestindex_col=0, # set the index as the first columnheader=[0,1] # set the variable names)pisa = pisa.dropna() # drop blank linespisa.columns = ['Math', 'Reading', 'Science'] # simplify variable namesâ€‹fig, ax = plt.subplots()pisa['Math'].plot(kind='barh', ax=ax) # create bar chart

**Comment.** Yikes! That's horrible! What can we do about it? Any suggestions?

The problem seems to be that the bars and labels are squeezed together, so perhaps we should make the figure taller. We set the figure's dimensions with the argument `figsize=(width, height)`

. The sizes are measured in inches, which get shrunk a bit when we display them in Jupyter. Here's a version with a much larger `height`

that we discovered by experimenting:

fig, ax = subplots()pisa['Math'].plot(kind='barh', ax=ax, figsize=(4,13))ax.set_title('PISA Math Score', loc='left')

This creates a figure that is 4 inches wide and 13 inches tall. We added a title, too, to be clear about what we have. The title has a fontsize of 14 and is left justified.

Here's a more advanced version in which we made the US bar red. This is ridiculously complicated, but we used our Google fu and found a solution. (Remember: The solution to many programming problems is a combination of Google fu and patience.) The code is

fig, ax = plt.subplots()pisa['Math'].plot(ax=ax, kind='barh', figsize=(4,13))ax.set_title('PISA Math Score', loc='left')ax.get_children()[36].set_color('r')

The `36`

comes from experimenting. We count from the bottom starting with zero.

**World Bank data.** Our second example comes from using the World Bank's API, which gives us access to a huge amount of data for countries. We use it to produce two kinds of graphs and illustrate some tools we haven't seen yet:

Bar charts of GDP and GDP per capita

Scatter plot (bubble plot) of life expectancy v GDP per capita

We start with the data:

# load packages (redundancy is ok)import pandas as pd # data management toolsfrom pandas.io import wb # World Bank apiimport matplotlib.pyplot as plt # plotting toolsâ€‹# variable list (GDP, GDP per capita, life expectancy)var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN']# country list (ISO codes)iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']year = 2013â€‹# get data from World Bankdf = wb.download(indicator=var, country=iso, start=year, end=year)â€‹# munge datadf = df.reset_index(level='year', drop=True)df.columns = ['gdppc', 'gdp', 'life'] # rename variablesdf['pop'] = df['gdp']/df['gdppc'] # populationdf['gdp'] = df['gdp']/10**12 # convert to trillionsdf['gdppc'] = df['gdppc']/10**3 # convert to thousandsdf['order'] = [5, 3, 1, 4, 2, 6, 0] # reorder countriesdf = df.sort_values(by='order', ascending=False)df

Note that the index here is the country name -- that will be our x axis.

Here's a horizontal bar chart for (total) GDP:

fig, ax = plt.subplots()df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)ax.set_title('GDP', loc='left', fontsize=14)ax.set_xlabel('Trillions of US Dollars')ax.set_ylabel('')

What do you see? What's the takeaway?

We think the horizontal bar chart looks better than the usual vertical bar chart, which we'd get if we replaced `barh`

above with `bar`

. (Try it and see what you think.)

Here's a similar chart for GDP per capita:

fig, ax = plt.subplots()df['gdppc'].plot(ax=ax, kind='barh', color='m', alpha=0.5)ax.set_title('GDP Per Capita', loc='left', fontsize=14)ax.set_xlabel('Thousands of US Dollars')ax.set_ylabel('')

What do you see here? What's the takeway?

And just because it's fun, here's an example of Tufte-like axes from Matplotlib examples:

fig, ax = plt.subplots()df['gdppc'].plot(ax=ax, kind='barh', color='b', alpha=0.5)ax.set_title('GDP Per Capita', loc='left', fontsize=14)ax.set_xlabel('Thousands of US Dollars')ax.set_ylabel('')â€‹# Tufte-like axesax.spines['left'].set_position(('outward', 10))ax.spines['bottom'].set_position(('outward', 10))ax.spines['right'].set_visible(False)ax.spines['top'].set_visible(False)ax.yaxis.set_ticks_position('left')ax.xaxis.set_ticks_position('bottom')

This gives us axes on the left and bottom only, separated slightly from the bars. It's another illustration of the benefits of Google fu.

We finish off with a bubble plot: a scatter plot in which the size of the dots ("bubbles") varies with a third variable. (Count them: we have `x`

on the horizontal axis, `y`

on the vertical axis, and a third variable represented by the size of the bubble.) From a technical perspective, this is simply another argument in a scatter plot. Here's an example in which `x`

is GDP per capita, `y`

is life expectancy, and the third variable is population:

fig, ax = plt.subplots()ax.scatter(df['gdppc'], df['life'], # x,y variabless=df['pop']/10**6, # size of bubblesalpha=0.5)ax.set_title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)ax.set_xlabel('GDP Per Capita')ax.set_ylabel('Life Expectancy')ax.text(58, 66, 'Bubble size represents population', horizontalalignment='right')

The only odd thing is the `10**6`

"scaling" on the second line. The bubble size is a little tricky to calibrate. Without the scaling, the bubbles are larger than the graph. We played around until they looked reasonable.

Ok, we lied, that wasn't the conclusion. But we think this is fun, and it's optional in any case.

Matplotlib has a lot of basic settings for graphs. If we find some we like, we can set them once and be done with it. Or we can use some of their preset combinations, which they call **styles**.

We'll start with one of the bar charts we produced with World Bank data:

fig, ax = plt.subplots()df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)ax.set_title('GDP', loc='left', fontsize=14)ax.set_xlabel('Trillions of US Dollars')ax.set_ylabel('')

Now recreate the same graph with this statement at the top:

plt.style.use('fivethirtyeight')

Once we execute this statement, it stays executed, but we'll change it back at the end.

Here's another one, for fans of the popular xkcd webcomic:

plt.xkcd()fig, ax = plt.subplots()df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)ax.set_title('GDP', loc='left', fontsize=14)ax.set_xlabel('Trillions of US Dollars')ax.set_ylabel('')

Note the wiggly lines, perfect for suggesting a hand-drawn graph.

**Exercise.** Try one of these styles: `ggplot`

, `bmh`

, `dark_background`

, and `grayscale`

. Which ones do you like? Why?

When we're done, we reset the style with these two lines in an code cell:

mpl.rcParams.update(mpl.rcParamsDefault)%matplotlib inline

Consider the data from Randal Olson's blog post:

import pandas as pddata = {'Food': ['French Fries', 'Potato Chips', 'Bacon', 'Pizza', 'Chili Dog'],'Calories per 100g': [607, 542, 533, 296, 260]}cals = pd.DataFrame(data)

The dataframe `cals`

contains the calories in 100 grams of several different foods.

**Exercise.** We'll create and modify visualizations of this data:

Set

`'Food'`

as the index of`cals`

.Create a bar chart with

`cals`

using figure and axis objects.Add a title.

Change the color of the bars. What color do you prefer?

Add the argument

`alpha=0.5`

. What does it do?Change your chart to a horizontal bar chart. Which do you prefer?

*Challenging.*Eliminate the legend.*Challenging.*Skim the top of Olson's blog post. What do you see that you'd like to imitate?

We haven't found many non-technical resources on Matplotlib we like, but these are pretty good:

One of the best is Matplotlib's gallery of examples. It's a good starting point for learning new things. Find an example you like, download the code, and adapt it to your needs. We also like the Pandas summary of dataframe methods.

The documentation of Pandas plot methods is also pretty good.

The SciPy lectures are good overall. The Matplotlib section focusses on

`plot(x,y)`

, which wouldn't be our choice, but the content is very good.â€‹Randal Olson has lots of good examples on his blog.

If you find others you like, let us know.