Histograms in Python

There are a number of important types of plots that are used in descriptive statistics. A common plot that is frequently used in newspapers and magazines is the histogram, a sequence of vertical bars, where the height of each bar represents a count of the data values falling in a "bin." The "count", "bins", and "bars" will be explained in the images that follow.

IPython

In this activity, we will use IPython, an interactive Python interpreter with powerful features. Documentation for Ipython can be found at the following URL:

http://ipython.scipy.org/moin/Documentation

There are a number of useful manuals and tips on the IPython documenation page, but one is particularly useful, a 5-part video series by Jeff Rush that introduces the viewer to the capabilities of IPython.

http://showmedo.com/videos/series?name=CnluURUTV

One of the most spectacular features of IPython is the ability to work interactively with matplotlib from the IPython shell. To do this, we simply start IPython with the -pylab switch. In the command that follows, code $ is the command prompt of our operating system. You need only type ipython -pylab that follows this prompt. You must have IPython installed to continue in this activity.

code $ ipython -pylab

When the IPython shell starts up, you are greeted with the following boilerplate.

Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) 
Type "copyright", "credits" or "license" for more information.

IPython 0.8.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

  Welcome to pylab, a matplotlib-based Python environment.
  For more information, type 'help(pylab)'.

In [1]: 

Following some information and copyright for your installed Python system, the version of IPython is announced, followed by some useful command features.

  • You can type ? to get a quick introduction and overview of IPython's features.
  • You can get a "reference card" by typing %quickref.
  • Typing help() takes you into an interactive help system for Python. You can get help for a particular Python object by typing help(object). As an example, to receive help on Python's for command, type help('for').
  • You can get help on any object in the namespace by typing object?. This is a powerful feature which we will employ in this activity.

Pylab Help

Note especially the last bit of boilerplate:

  Welcome to pylab, a matplotlib-based Python environment.
  For more information, type 'help(pylab)'.

This boilerplate announces that we are currently in a matplotlib-based Python environment. For those readers who are familiar with Matlab's interactive command line environment, you will be pleased to hear that this Python environment duplicates many of the Matlab's command-line features.

You can get more information by typing help(pylab). The IPython system responds with a wealth of information about the matplotlib system. As an example, after typing help(pylab), you will see that one of the plotting commands is axes, which is used to "Create a nex axes." Exit the pylab help system by typing q, then type the following command at the IPython prompt:

In [1]: axes?
Ipython responds with information, part of which is duplicated below:
Docstring:    Add an axes at positon rect specified by:: 
axes() by itself creates a default full subplot(111) window axis
axes(rect, axisbg='w') where rect=[left, bottom, width, height] in
  normalized (0,1) units.  
axisbg is the background color for the axis, default white

You can get more extensive help with axes??. Finally, you quit Pylab's help system by typing q.

Quitting IPython

You can quit IPython at any time by one of two methods:

  1. Type quit() at the IPython prompt. You will be prompted with "Do you really want to exit ([y]/n)?" Note that "y" is the default. Hit Enter to quit, "n" and Enter to change your mind.
  2. Typing Ctrl+d provides the same prompt and choices.

Creating a Histogram of the Standard Normal Distribution

If you quit the IPython shell, restart with the command ipython -pylab. The standard normal distribution is the famous "bell-shaped" curve of statistics, known to have a mean value of zero and a standard deviation of one. If you are not familiar with the standard normal distribution, read on. This distribution will be made clear in the images that follow.

We first ask for help on randn.

In [5]: randn?

Part of the response follows:

Docstring:
    Returns zero-mean, unit-variance Gaussian random numbers in an 
    array of shape (d0, d1, ..., dn).

This is a good response. It tells us that:

  1. we are drawing random numbers from a Gaussian (normal) distibution,
  2. the distribution has mean zero, and
  3. the distribution has "unit-variance." The variance is the square of the standard deviation, so taking the square root reveals that the standard deviation of this distribution is one.

So, let's begin by drawing 1000 numbers from this "standard normal" distribution.

In [6]: x=randn(1000)

Well, that was easy! And fast! The variable x now contains 1000 random numbers drawn from the standard normal distribution. You can view these numbers by typing x at the IPython prompt. The first few numbers are shown below.

In [7]: x
Out[7]: 
array([ -2.62954972e-01,   1.27938442e+00,  -1.83891371e-01,
         1.11853435e-02,  -1.23795745e+00,  -2.85110316e-01,
        -2.93591822e+00,   2.40757900e-01,   1.60797415e+00,
        ...

You can test the length of this vector with the len command, which is an abbreviation of the word "length."

In [10]: len(x)
Out[10]: 1000

We have our 1000 random numbers, so just how do we go about creating a histogram of these numbers? If we were doing the problem by hand, we would first define some categories called "bins", and then sort the random numbers into these "bins". The final count of occurrences in each bin might look like the following.

Counting Occurences In Each Bin
(-3.5,-3] 1
(-3,-2.5] 6
(-2.5,-2] 21
(-2,-1.5] 46
(-1.5,-1] 82
(-1,-0.5] 147
(-0.5,0] 163
(0,0.5] 207
(0.5,1] 171
(1,1.5] 103
(1.5,2] 33
(2,2.5] 13
(2.5,3] 5
(3,3.5] 2

The table shows that there was only one number falling between -3.5 and -3, there were 6 numbers falling between -3 and -2.5, etc. Proceeding by hand we would next set up axes on graph paper, partition the horizontal axis in "bins", then scale the vertical axes to accomodate the counts in our table. Over eac bin we would create a rectangle having a vertical height the corresponds to the number of occurences in that particular bin.

Performing this task by hand is a painstaking procedure. Python can automate the process for us, which will allow us to spend more time interpreting the results and less time having to deal with the tedium of sorting a 1000 number into bins and counting them by hand.

What follows is the simplest way to get a quick histogram of the data in the variable x.

In [19]: hist(x)

The command hist(x) responds by spewing some output to the terminal window (more on this later) and creating the nice looking histogram shown in Figure 1.

A simple histogram of the standard normal distribution data in the variable x

Figure 1. A histogram of the standard normal distribution data in the variable x.

Matplotlib offers fine-grain control over the appearance and form of its histograms. We can learn more about this command through the interactive shell of the IPython pylab environment.

In [20]: hist?

The help system responds with a wealth of information on Pylab's hist command, a snippet of which follows.

Docstring:    HIST(x, bins=10, normed=0, bottom=None, 
  align='edge', orientation='vertical', width=None,
  log=False, **kwargs)
  
Compute the histogram of x.  bins is either an integer number of
bins or a sequence giving the bins.  x are the data to be binned.

Let's first focus on the "bins" argument to the hist command. Apparently this can be used in one of two ways:

  1. You can simply declare the "number" of bins you want. For example, if we wanted 20 bins, we'd pass this request to the hist command as follows. The command clf() "clears" the current figure window.
    In [23]: clf()
    
    In [24]: hist(x,bins=20)
    

    Note that there are now 20 bins in Figure 2.

    This histogram show 20 bins

    Figure 2. The command hist(x,bins=20) produces 20 bins.

  2. There is a second way to proceed and that entails directing the hist command to use a specific set of bins. Let's say that we want the histogram to use the bins described in our tabular work. Use the arange command to produce this set of bins.
    In [25]: bins=arange(-3.5,4,0.5)
    
    We have to go a bit beyond 3.5 due to roundoff error. You can see the result of this command by typing bins at the IPython prompt.
    In [26]: bins
    Out[26]: 
    array([-3.5, -3. , -2.5, -2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5,
            2. ,  2.5,  3. ,  3.5])
    
    These are precisely the bins in our table. We can now request that Matplotlib use these bins with the following command.
    In [28]: clf(), hist(x, bins=bins)
    
    The result is shown in Figure 3.

    This histogram uses bins described by the user

    Figure 3. The command hist(x,bins=bins) uses the bins stored in the variable bins.

Finally, some students and instructors don't like the bit of spacing that occurs between consecutive rectangles in Figure 3. Matlplotlib allows use to force the width of the rectangles.

In [29]: clf(), hist(x, bins=bins, width=0.5)

The result of this command is shown in Figure 4.

Use the width argument to fix a bin width of your choice

Figure 4. The command hist(x, bins=bins, width=0.5) fixes the width of each bin at 0.5.

Things to Note in Figure 4:

  1. Note how the histogram is approximately "balanced" about its mean, which appears to be apprimately located at zero, which is precisely what we would anticipate, given the fact that the numbers in the variable x were randomly drawn from the standard normal distribution, which has mean zero.
  2. Because the numbers in the variable x were drawn from the standard normal distribution, which has a standard deviation of one, note that almost all of the data occurs within 3 standard deviations of the mean, either way. That is, not that almost all of the data in the histogram the data in the variable x occurs between -3 and 3, which represents 3 standard deviations of 1 on either side of the mean 0.

Enjoy!

We hope you enjoyed this introduction to the IPython system. This interactive system, coupled with the Pylab environment, provides a strong interactive interface for exploration in mathematics, science, and engineering.

We encourage you to explore further. Use the URL's at the top of the activity to learn more about the IPython environment. Use the command hist? to learn more about what you can do with the hist command. The full use of the hist command is also outlined at the following URL:

http://matplotlib.sourceforge.net/matplotlib.pyplot.html#-hist