Python has a collection of scientific libraries that is called the SciPy stack ^{ 1 } . It consists of powerful data structures to manipulate multidimensional using simple expressions and powerful routines to perform analysis and data manipulation.
The SciPy stack consists of the following Python modules:
NumPy
.
Multidimensional arrays and basic operations
SciPy
.
A library for scientific computing which contains functionality
for statistics, signal processing, optimization and linear algebra.
matplotlib
for plotting
pandas
provides easy to use data structures for data with mixed data types.
Sympy
for symbolic math
IPython
for a rich Python shell
nose
for testing.
In addition to these libraries the
statsmodels
package for statistics and the
scikit-learn
package for machine learning are very useful.
We will use the following statement to import most important functionality from NumPy, SciPy and matplotlib into main namespace.
from pylab import *
This form is in my opinion easiest to use for beginners. The pylab module imports core components from Numpy, Scipy and matplotlib for easy interactive use. When you get more familiar with Python you can then import only the parts that you need. Bare in mind that there are also other forms that can be used to import modules when you read examples from various sources.
Numpy package contains efficient multidimensional arrays that form the
basics of data manipulation also in Scipy.
The
array
datatype is
the basic type we are using.
It can used in scalar
computations, like adding two arrays elementwise, or
in linear algebra operations.
Numpy has methods for reading data from files to arrays, creating arrays with a certain range and converting other Python sequences to arrays.
The
linspace
command can be used to create a vector between given
and start endpoints with predetermined number of elements and
arange
commands is used to create vector with elements with certain
difference between elements.
Here are some methods to create arrays:
>>> linspace(0, 1 ,5) #Five elements from 0 to 1
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
>>> arange(0, 2, 0.3) #From 0 to 2 by 0.3
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])
>>> array([3.2, 1, 4]) #From list
array([ 3.2, 1. , 4. ])
The advantage of arrays is that they can be used in equations easily without looping trough all elements like you would need to do with a list.
Suppose for instance that we have a 12V power source and we want to calculate the power for currents between 0 and 2A. This is given by formula \( P=VI \)
And the solution using Numpy arrays:
>>> V = 12
>>> I = linspace(0, 2, 6)
>>> I
array([ 0. , 0.4, 0.8, 1.2, 1.6, 2. ])
>>> P = V*I
>>> P
array([ 0. , 4.8, 9.6, 14.4, 19.2, 24. ])
The following table shows for basic operations on arrays. See Numpy refence for complete details http://docs.scipy.org/doc/numpy/reference/ .
Operation and remarks | Command |
Creating variables: | a = 1. |
vector: from 1-10 | a = arange(1. , 11) |
array | X = array([[3, 2], [4, 1], [6, 3]]) |
Basic element wise operations * + -/ | 2. + 5.5; a-3; a*b; X/Y |
Powers | 2**3; a**2; X**10 |
Square root | sqrt(a) |
Indexing arrays | b = a[0:5] |
Multidimensional arrays [row, column] | Y = X[0:2, 2:5] |
Number of elements | len(x); shape(X) |
Sum | sum(x); sum(X) |
by colums | sum(X, 0) |
by rows | sum(X, 1) |
Average | mean(x); mean(X) |
by colums | mean(X, 0) |
by rows | mean(X, 1) |
Standard deviation | std(x); std(X) |
by colums | std(X, 0) |
by rows | std(X, 1) |
Cumulative sum | cumsum(x); cumsum(X) |
by colums | cumsum(X, 0) |
by rows | cumsum(X, 1) |
Dot product | dot(a,b); dot(X,Y) |
Transposing a matrix | transpose(X) |
Random numbers | rand(100) |
from Normal distribution | randn(10,10) |
Delete a variable | del a |
Help for a command | help(mean); help(std) |
Numpy has easy methods for reading and writing numerical data to and from text files.
A simple example:
>>> x = randn(5, 3)
>>> savetxt('data.txt', x)
>>> x2 = loadtxt('data.txt')
>>> x
array([[ 0.57299917, 0.47370978, -2.60736374],
[ 0.5965621 , -0.85368156, 1.629525 ],
[ 1.6277449 , -1.07544523, -0.57351649],
[-2.44221353, -1.2686372 , 0.14524342],
[-1.24633481, 1.12646824, 1.00443323]])
>>> x2
array([[ 0.57299917, 0.47370978, -2.60736374],
[ 0.5965621 , -0.85368156, 1.629525 ],
[ 1.6277449 , -1.07544523, -0.57351649],
[-2.44221353, -1.2686372 , 0.14524342],
[-1.24633481, 1.12646824, 1.00443323]])
>>> x == x2
array([[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
You can also use converters to read in e.g. dates. The following example will load a datafile with dates in the first column and convert the iso formatted date to numeric format.
from matplotlib.dates import datestr2num
data = loadtxt("data.txt", converters = {0: datestr2num})
NumPy arrays are very useful and efficient when you only have numerical data, but they are a lot less convenient to work with when you have data consisting of multiple data types such as strings, factors and booleans. The pandas library provides data structures and functions that make working with mixed data types a lot easier. The library is very well documented at http://pandas.pydata.org , but I will show few examples to give an idea about how the library can be used.
The main pandas data structure is the DataFrame. The library provides a several functions to read a DataFrame from files (csv, excel, hdf5) or from SQL database. You can also construct DataFrames from several data structures e.g. lists, dicts and NumPy arrays.
Here is a simple example about making a DataFrame from a list of Python dicts:
>>> import pandas as pd
>>> cowList = [{"id" : 101, "weight" : 650},
... {"id" : 102, "weight" : 720},
... {"id" : 103, "weight" : 800},
... {"id" : 104, "weight" : 560}]
...
>>> #Create DataFrame
>>> df = pd.DataFrame(cowList)
>>>
>>> #Look at the contents
>>> df
id weight
0 101 650
1 102 720
2 103 800
3 104 560
Let’s look at a more interesting example of roughage intake data from dairy cows in feedintake.csv file. The dataset contains roughage feeder trough visit information from 48 cows during 24 hours. Each row corresponds to one visit from a cow with recorded trough number, start and end time, visit duration in seconds and feed intake in kilograms.
We will read the data from csv file using pandas:
>>> feed = pd.read_csv("data/feed_intake.csv")
>>> #Look at the first rows using head function
>>> feed.head()
cowID trough begin end duration intake
0 9 18 2015-08-18 00:03:50 2015-08-18 00:12:03 493 2.5
1 2 13 2015-08-18 00:06:27 2015-08-18 00:12:18 351 1.3
2 9 18 2015-08-18 00:12:32 2015-08-18 00:17:51 319 1.3
3 43 11 2015-08-18 00:08:06 2015-08-18 00:20:09 723 3.5
4 42 23 2015-08-18 00:03:38 2015-08-18 00:21:10 1052 4.6
We can access the columns of the DataFrame with the column name:
>>> feed["intake"].head()
0 2.5
1 1.3
2 1.3
3 3.5
4 4.6
Name: intake, dtype: float64
>>> feed["duration"].head()
0 493
1 351
2 319
3 723
4 1052
Name: duration, dtype: int64
And also remove columns, note that the inplace option can be used to modify an existing object and is more efficient.
# Make a copy with dropped column
feed2 = feed.drop(["trough"], 1)
# Modify existing object
feed.drop(["trough"], 1, inplace =True)
And select data from the DataFrame e.g. select all data from cow number 1 and plot feed intake vs feeding bout duration:
cow1 = feed[feed["cowID"] == 1]
plot(cow1["duration"], cow1["intake"], 'o')
xlabel("Feeding time (s)")
ylabel("Feed intake (kg)")
We can easily calculate the total feed intake and feeding time for each cow by grouping the dataset by “cowID” column and applying the sum function.
>>> feed_sums = feed.groupby("cowID").sum()
>>> feed_sums.head()
duration intake
cowID
1 9778 73.6
2 11868 66.0
3 858 2.4
4 9796 68.8
5 12848 69.7
You can also apply several aggregate function passing them as a list
>>> feed.groupby("cowID").agg([np.sum, np.std]).head()
duration intake
sum std sum std
cowID
1 9778 136.811512 73.6 0.995150
2 11868 235.607678 66.0 1.480723
3 858 109.874474 2.4 0.551362
4 9796 491.439436 68.8 3.548356
5 12848 227.878629 69.7 1.546469