Working with Python packages

By Matt Murbach

This page is the third post in a series of introductory python tutorials:
1. Installing Python using Conda
2. Getting started with Jupyter
3. Using Python packages - this tutorial
4. Working with data
5. Making figures in Python


Python is a powerful language, however, the real strength of the Python environment comes from the open-source community that has written 1000s of packages to make certain tasks easier.

Eventually, we will cover how to write your own packages to better organize your own code or share your work, but this tutorial will focus on introducing some of the most widely used packages in scientific research. This is not meant to be a comprehensive review of the Python ecosystem but rather a high-level overview to provide a flavor of different packages.

Installing Python Packages

There are several methods for installing Python packages. In our tutorials, we will use Conda as our default package manager due to it’s relative ease of use. For packages that aren’t on Conda, we suggest using pip to install from the Python Package Index (PyPI).

Conda is a non-python specific package manager

Notes:

  • Conda installs from binaries and is much better at handling non-Python dependencies than pip

Examples:

conda install scipy

conda list --export > package-list.txt

PyPI and pip

PyPI is the Python Package Index - an open repository that anyone can submit a package to.

Notes:

  • We can use pip to install packages from PyPI and use them in our own code.
  • pip installs packages from source (or from wheels) that are uploaded to PyPI.
  • Pip is specific to Python code, which can make it difficult to install packages with non-Python dependencies (things that rely on C or Fortran code for example)

Examples:

pip install pandas
pip freeze > requirements.txt
pip install --upgrade pandas

We won’t cover all of these in depth here, but here are a list of packages that tend to be useful in different settings

  • Plotting: seaborn, matplotlib, bokeh
  • Data Manipulation and numerical computing: numpy, scipy, pandas
  • Machine learning: scikit-learn, TensorFlow, theano
  • Image processing: scikit-image, Pillow
  • Documentation and testing: Pytest, sphinx, unittest

Introduction to several packages with short examples

SciPy is a collection of open-source Python libraries for mathematics, science, and engineering

  • Contains many useful tools for optimization, signal processing, curve fitting
  • Tip: SciPy can be on the more difficult side to install due to all of the C dependencies. Highly recommended to use Conda as a package management tool
  • Example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

# define the data to be fit with some noise

xdata = np.linspace(0, 4, 50)
y = func(xdata, 2.5, 1.3, 0.5)
y_noise = 0.2 * np.random.normal(size=xdata.size)
ydata = y + y_noise
plt.plot(xdata, ydata, 'b-', label='data')

# Fit for the parameters a, b, c of the function func

popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'r-', label='fit')

# Constrain the optimization to the region of 0 < a < 3, 0 < b < 2 and 0 < c < 1:

popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 2., 1.]))
plt.plot(xdata, func(xdata, *popt), 'g--', label='fit-with-bounds')

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()


NumPy a very useful library for scientific computing. It adds support for arrays, linear algebra tools and optimized integration with C/Fortran

  • Used by many, many other numerical/scientific computing packages
  • Potentially useful: NumPy for MATLAB users
  • Example:
import numpy as np

x = np.linspace(0,10, num=10)
y = np.random.random(len(x))

y_mean = y.mean()
y_std = y.std()

print('Mean = {0:.3f} \t Std. Dev. = {1:.3f}'.format(y_mean, y_std))

plt.plot(x, y, 'o')
plt.ylim(0,1)
plt.show()


Python Data Analysis Library (http://pandas.pydata.org/)

  • Amazingly useful library for importing, manipulating, cleaning data from a wide variety of sources
  • Examples:
Easily read in data from csv, excel, json, etc.

Note: you can download the example file we use here

# import the pandas package
import pandas as pd

# read in the data from our file
data = pd.read_csv('./time-data(833).txt')

# assign names to each of the columns for easy reference
data.columns = ['time(s)', 'current(A)', 'potential(V)', 'frequency(Hz)', 'amplitude(A)']

# print out the first 5 rows of our data
data.head()

Manipulate data (remove columns with NaN, calculate mean, apply function to all rows)
# drop columns with NaN
data.dropna(axis=1)

#calculate mean
data["current(A)"].mean()

#calculate power directly
power1 = data["current(A)"]*data["potential(V)"]

# or apply lambda function row-by-row
power2 = data.apply(lambda x: x["current(A)"]*x["potential(V)"], axis=1)

# compare the two methods
(power1 == power2).all()
Easily save data
data["power(W)"] = power1

data.to_csv("./add_power.txt", index=False)

Matplotlib is a powerful Python plotting library and forms the basis for many other plotting libraries

  • Simple to use simply, but a LOT of low-level control if you want it…
  • Examples:
time = data["time(s)"]
voltage = data["potential(V)"]
current = data["current(A)"]

fig = plt.figure(figsize=(6.5,5))
ax1 = fig.add_subplot(211)
ax1.plot(time, voltage)
ax1.set_ylabel("Voltage (V)", fontsize=20)
plt.ylim(-.15,.15)

ax2 = fig.add_subplot(212)
ax2.plot(time, current)
ax2.set_ylabel("Current (A)", fontsize=20)
ax2.set_xlabel("Time (s)", fontsize=20)
plt.ylim(-.35,.35)

labels = ['(a)', '(b)']
for i, ax in enumerate([ax1, ax2]):
    ax.set_xlim(0,10)
    ax.text(.055, .85,labels[i], transform=ax.transAxes, fontsize=20, fontweight='bold')

plt.tight_layout()
plt.show()

fig.savefig('./example_plotting.png', dpi=300, bbox_inches="tight")


Seaborn is a high-level library for plotting many different statistical plots

https://seaborn.pydata.org/examples/index.html

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(color_codes=True)

x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma)
plt.show()


scikit-learn is machine learning in Python

  • Example:
# Code source: Gaƫl Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(5)

iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
              ('k_means_iris_3', KMeans(n_clusters=3)),
              ('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
                                               init='random'))]

fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
    fig = plt.figure(fignum, figsize=(4, 3))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2],
               c=labels.astype(np.float), edgecolor='k')

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    ax.set_title(titles[fignum - 1])
    ax.dist = 12
    fignum = fignum + 1

# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(),
              X[y == label, 0].mean(),
              X[y == label, 2].mean() + 2, name,
              horizontalalignment='center',
              bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))

y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

fig.show()