Show Sidebar Hide Sidebar

Pandas 101 in Pandas

How to use Pandas, the Python data analysis tools, to manipulate and analyse data in plotly.

Installation

Python will be required here. You can install it from the source releases if you don't have it.
Pandas is supported in Python versions: 2.7, 3.4, and 3.5.

  • PyPI Installation
    $ sudo pip install pandas
    
    This will automatically download all the dependencies, but it will take some time in the process.
  • Linux Distribution Installation This commands will install Pandas for Python 2 from your distribution.
    To install it for Python 3, you should use "python3-pandas" instead.
  • Ubuntu & Debian Installation
    $ sudo apt-get install python-pandas
    
  • Fedora & OpenSuse Installation
    $ zypper in  python-pandas
    
    It will install from "Official Debian Repository" and "OpenSuse Repository" respectively.
  • Anaconda Installation

Anaconda is a cross-platform Python distribution for scientific computing and data analytics. (It makes the installation phase easy for beginners.)
You can get Anaconda from here.

You can play with Pandas online at the IPython Notebook service by Wakari, without installing it.

In [1]:
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

Reading, Writing, and Creating data

Pandas let you use your data from multiple formats like CSV, EXCEL, JSON etc.

You can use the data file on your local system or from an external URL.

Create data

DataFrame is the most commonly used data-structure of Pandas, it's a 2-dimensional table like structure that can hold columns of multiple data-types.

In [2]:
subjects = ['Mathematics', 'English', 'History', 'Science', 'Arts']
marks = [67, 60, 36, 61, 58]

Using the Python's zip function, we can merge these two list sequences into one.

In [3]:
marks_dataset = zip(subjects, marks)
marks_dataset
Out[3]:
[('Mathematics', 67),
 ('English', 60),
 ('History', 36),
 ('Science', 61),
 ('Arts', 58)]
In [4]:
marks_df = pd.DataFrame(marks_dataset, columns=['Subjects', 'Marks'])
marks_df
Out[4]:
Subjects Marks
0 Mathematics 67
1 English 60
2 History 36
3 Science 61
4 Arts 58

The columns argument list represents the labels of the respective columns.

Marks Distribution

The following bar chart represents the marks distribution per subject.

In [5]:
marks_data = [go.Bar(x=marks_df.Subjects, y=marks_df.Marks)]

py.iplot({ 'data': marks_data,
            'layout': {
               'title': 'Marks Distribution',
               'xaxis': {
                 'title': 'Subjects'},
               'yaxis': {
                'title': 'Marks '}
        }})
Out[5]:

Add a new column 'Result' using numpy.where, set it as 'Pass' if marks>=40 else 'Fail'.

In [6]:
marks_df['Result'] = np.where(marks_df['Marks']>=40, 'Pass', 'Fail')
marks_df
Out[6]:
Subjects Marks Result
0 Mathematics 67 Pass
1 English 60 Pass
2 History 36 Fail
3 Science 61 Pass
4 Arts 58 Pass

To delete a column (say 'Result'), we can use marks_df.pop('Result').

Write data

We can write the DataFrame object to different file types.

In [7]:
# save the marks dataframe to a csv(comma-separated values) file in your directory
marks_df.to_csv('marks.csv', index=False)

The argument index=False is to prevent writing the index for each row (0...4) in file.

Read data

Here, we will be using the YouTube Channel Dataset.

Note:- Some of the "Views" and "Comments" columns have missing values, represented as -1.

In [8]:
channels_df = pd.read_csv('http://bit.ly/1rq29Xf')

Selecting and Filtering dataframes

You can use the describe mehod to show statistics of dataset.

In [9]:
channels_df.describe()
Out[9]:
Views Comments
count 362.000000 362.000000
mean 288.229282 17.383978
std 147.604341 12.479304
min -1.000000 -1.000000
25% 172.750000 8.000000
50% 295.500000 14.000000
75% 399.000000 26.000000
max 533.000000 54.000000

Customer Reach Comparison between "WorldNews" and "WorldWeather"

In [10]:
reach_df = channels_df.groupby('Channel').describe()
reach_df
Out[10]:
Comments Views
Channel
WorldNews count 181.000000 181.000000
mean 16.182320 288.348066
std 11.131283 151.969936
min -1.000000 -1.000000
25% 7.000000 154.000000
50% 13.000000 294.000000
75% 24.000000 405.000000
max 54.000000 533.000000
WorldWeather count 181.000000 181.000000
mean 18.585635 288.110497
std 13.620638 143.527810
min -1.000000 -1.000000
25% 8.000000 180.000000
50% 15.000000 296.000000
75% 28.000000 394.000000
max 53.000000 532.000000

You can notice that it's a MultiIndex DataFrame.

In [11]:
channels = []

views_reach_data, comments_reach_data = [], []
views_mean, comments_mean = [], []

# MultiIndex DataFrame Iteration
for channel, new_df in reach_df.groupby(level=0):
    channels.append(channel)
    
    for col in new_df.columns:
        new_col = new_df[col][channel]
        x_label = channel
        
        # using the Box method of Plotly's graph objects
        box = go.Box(y=[new_col['min'], new_col['max']], name=x_label)
        
        # to prepare two different plots for each section ('views', 'comments')
        if (col == 'Views'):
            views_reach_data.append(box)
            views_mean.append(new_col['mean'])
        elif (col == 'Comments'):
            comments_reach_data.append(box)
            comments_mean.append(new_col['mean'])
In [12]:
# add lines for 'mean' value of 'views'
views_reach_data.append(go.Scatter(x=channels, y=views_mean, mode='lines', name='mean'))

py.iplot({
        'data': views_reach_data,
        'layout': {
            'title': 'Views Comparison',
            'xaxis': {'title': 'Channels'},
            'yaxis': {'title': 'Views'}
        }
})
Out[12]:
In [13]:
# add lines for 'mean' values of 'comments'
comments_reach_data.append(go.Scatter(x=channels, y=comments_mean, mode='lines', name='mean'))

py.iplot({
        'data': comments_reach_data,
        'layout': {
            'title': 'Comments Comparison',
            'xaxis': {'title': 'Channels'},
            'yaxis': {'title': 'Comments'}
}})
Out[13]:

We can replace the missing data values (-1) with NaN for some statistical ease.

In [14]:
channels_df = channels_df.replace(-1, np.nan)

head method select top 5 row for the dataframe. (tail method will return last 5 rows)

In [15]:
channels_df.head()
Out[15]:
Channel Date Anchor Views Comments
0 WorldNews 2015-01-01 bob 495 39
1 WorldWeather 2015-01-01 emily 191 24
2 WorldNews 2015-01-02 bob 293 4
3 WorldWeather 2015-01-02 harry 238 NaN
4 WorldNews 2015-01-03 alice 509 13

Select rows only for the 'WorldNews' channel.

In [16]:
worldnews_df = channels_df[channels_df['Channel'] == 'WorldNews']

Select some sample rows from the new dataframe.

In [17]:
worldnews_df.sample(3)
Out[17]:
Channel Date Anchor Views Comments
140 WorldNews 2015-03-12 alice 293 25
82 WorldNews 2015-02-11 bob 432 9
124 WorldNews 2015-03-04 bob 133 13

Filter out the rows for 'WorldNews' channel having less than 100 views.

In [18]:
worldnews_less_views_df = worldnews_df[worldnews_df['Views'] < 100]

Count the days when 'WorldNews' has received less than 100 views.

In [19]:
worldnews_less_views_df.size
Out[19]:
105

Indexing

To get and set subsets of data object, Pandas supports 3 types of multi-axis indexing.

  • loc
    • It is based on the index labels.
  • iloc
    • It is based on the index positions, only integer values are accepted.
  • ix
    • Generally it behaves like loc but falls back to iloc when the label is not present in the index.

In our dataset the index contains only integer values, so all these methods will work similarly.

Note:- The DataFrame worldnews_df contains only even integers in the index.

In [20]:
worldnews_df.index
Out[20]:
Int64Index([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,
            ...
            342, 344, 346, 348, 350, 352, 354, 356, 358, 360],
           dtype='int64', length=181)

Selection By Label

In the label based indexing, it follows strict inclusion. For example [10:14] will look for every value in between 10 and 14 including both. At least 1 labels should be present in the index, otherwise a KerError will be raised.

This will return the top 3 rows having index as 0, 2, and 4; Because it looks at the labels.

In [21]:
worldnews_df.loc[0:4]
Out[21]:
Channel Date Anchor Views Comments
0 WorldNews 2015-01-01 bob 495 39
2 WorldNews 2015-01-02 bob 293 4
4 WorldNews 2015-01-03 alice 509 13

Selection By Position

In the integer based indexing, it follows 0-based slicing similar to Python lists that you are used to. Where the starting bound is included but the upper bound is excluded. Using a non-integer, even a valid label will raise a IndexError.

Hence, this will return the top 4 rows, Because it looks at the positions.

In [22]:
worldnews_df.iloc[0:4]
Out[22]:
Channel Date Anchor Views Comments
0 WorldNews 2015-01-01 bob 495 39
2 WorldNews 2015-01-02 bob 293 4
4 WorldNews 2015-01-03 alice 509 13
6 WorldNews 2015-01-04 alice 490 54

This will work like loc because the index contains integer values.

In [23]:
worldnews_df.ix[0:4]
Out[23]:
Channel Date Anchor Views Comments
0 WorldNews 2015-01-01 bob 495 39
2 WorldNews 2015-01-02 bob 293 4
4 WorldNews 2015-01-03 alice 509 13

Grouping, Aggregating, and Pivoting data

Grouping

We can group (combine) the dataset based on certain parameters. For example, here we are grouping the 'WorldNews' dataset based on their 'Anchor' columns.

In [24]:
worldnews_anchors_group = worldnews_df.groupby(worldnews_df['Anchor'])

This will return a DataFrameGroupBy object.

In [25]:
type(worldnews_anchors_group)
Out[25]:
pandas.core.groupby.DataFrameGroupBy

Aggregating

Aggregating the resultant group will count the respective 'Views' and 'Columns' for each anchor.

In [26]:
worldnews_anchors_group.aggregate(np.sum)
Out[26]:
Views Comments
Anchor
alice 17498 1227
bob 17754 821
jenn 16946 888

Views and Comments Share per Anchor

In [27]:
# we take the aggregation information as a DataFrame
share_df = worldnews_anchors_group.aggregate(np.sum)
anchors = share_df.index

# for each sections ('views', 'comments') update the data accordingly
share_data = [
    {
        'values': share_df['Views'],
        'labels': anchors,
        'domain': {'x': [0, .48]},
        'name': 'Views',
        'hoverinfo': 'label+value+name',
        'hole': 0.4,
        'type': 'pie'
    },
    {
        'values': share_df['Comments'],
        'labels': anchors,
        'domain': {'x': [0.52, 1]},
        'name': 'Comments',
        'hoverinfo': 'label+value+name',
        'hole': 0.4,
        'type': 'pie'
    }
]

layout = {
    'title': 'Views and Comments Share per Anchor [WorldNews]',
    'annotations': [
        {
            'text': 'Views',
            'font': {'size': 15},
            'x': 0.21,
            'y': 0.5,
            'showarrow': False
        },
        {
            'text': 'Comments',
            'font': {'size': 15},
            'x': 0.81,
            'y': 0.5,
            'showarrow': False
        }
    ]
}

py.iplot({'data': share_data, 'layout': layout})
Out[27]:

Pivoting

Using the pivot function, we can generate a new table from the original one.

It takes three arguments, that decides the index, columns, and cell values of the new table we want.

Consider this sample table for example.

In [28]:
worldnews_df.sample(10)
Out[28]:
Channel Date Anchor Views Comments
162 WorldNews 2015-03-23 alice 500 13
34 WorldNews 2015-01-18 bob 376 13
240 WorldNews 2015-05-01 jenn 97 9
102 WorldNews 2015-02-21 alice 73 9
28 WorldNews 2015-01-15 alice 118 4
244 WorldNews 2015-05-03 alice 99 47
228 WorldNews 2015-04-25 jenn 504 10
70 WorldNews 2015-02-05 jenn 497 9
138 WorldNews 2015-03-11 jenn 316 24
248 WorldNews 2015-05-05 jenn 135 6

We can try reshaping this table to find out how many 'Views' each anchor has received.

So we will create a new table whose 'index' is 'Date', 'columns' will be different values of 'Anchor column'.

The cell values will be the respective value of the 'Views' column.

In [29]:
worldnews_pivoted_df = worldnews_df.pivot(index='Date', columns='Anchor', values='Views')
worldnews_pivoted_df.sample(10)
Out[29]:
Anchor alice bob jenn
Date
2015-03-08 382 NaN NaN
2015-04-18 470 NaN NaN
2015-03-24 NaN NaN 154
2015-02-12 390 NaN NaN
2015-06-04 NaN 113 NaN
2015-01-28 NaN NaN 212
2015-04-17 NaN 301 NaN
2015-05-25 NaN 298 NaN
2015-01-12 NaN NaN 506
2015-03-23 500 NaN NaN

This is our derived table, you can see how many views different anchors have received.

Similarly you can reshape your tables according to your usecase.

Interpolation

Sometimes your dataset has missing values in it.Pandas provide the interpolate function to insert data between fixed points. It uses different methods to regularize the missing values, by default it uses linear interpolation for the same.

We can count the absent values (NaN) using the isnull method.

In [30]:
worldnews_df.isnull().sum()
Out[30]:
Channel     0
Date        0
Anchor      0
Views       7
Comments    7
dtype: int64

Now let's interpolate the Series object.

In [31]:
worldnews_interpolated_df = worldnews_df.interpolate()

After interpolation, we can see that there are no more NaN values.

In [32]:
worldnews_interpolated_df.isnull().sum()
Out[32]:
Channel     0
Date        0
Anchor      0
Views       0
Comments    0
dtype: int64

Joining data

Pandas provides multiple methods to easily combine DataFrame and Series objects, like we see in SQL's join operations.

Consider these two small DataFrames.

In [33]:
Adf = worldnews_df.iloc[3:8]
Bdf = channels_df[channels_df['Channel'] == 'WorldWeather'].iloc[5:10]
In [34]:
Adf
Out[34]:
Channel Date Anchor Views Comments
6 WorldNews 2015-01-04 alice 490 54
8 WorldNews 2015-01-05 bob 231 13
10 WorldNews 2015-01-06 jenn 399 17
12 WorldNews 2015-01-07 jenn 463 26
14 WorldNews 2015-01-08 bob 479 27
In [35]:
Bdf
Out[35]:
Channel Date Anchor Views Comments
11 WorldWeather 2015-01-06 harry 293 9
13 WorldWeather 2015-01-07 emily 304 6
15 WorldWeather 2015-01-08 emily 470 10
17 WorldWeather 2015-01-09 tom NaN 15
19 WorldWeather 2015-01-10 harry 337 4

Concat

This function performs the concatenation operation along a given axis. axis=0 means the operation will be performed along the rows, and for axis=1 it will be along columns. (By default, it's acted along rows, ie axis=0.)

In [36]:
pd.concat([Adf, Bdf])
Out[36]:
Channel Date Anchor Views Comments
6 WorldNews 2015-01-04 alice 490 54
8 WorldNews 2015-01-05 bob 231 13
10 WorldNews 2015-01-06 jenn 399 17
12 WorldNews 2015-01-07 jenn 463 26
14 WorldNews 2015-01-08 bob 479 27
11 WorldWeather 2015-01-06 harry 293 9
13 WorldWeather 2015-01-07 emily 304 6
15 WorldWeather 2015-01-08 emily 470 10
17 WorldWeather 2015-01-09 tom NaN 15
19 WorldWeather 2015-01-10 harry 337 4
In [37]:
pd.concat([Adf, Bdf], axis=1)
Out[37]:
Channel Date Anchor Views Comments Channel Date Anchor Views Comments
6 WorldNews 2015-01-04 alice 490 54 NaN NaN NaN NaN NaN
8 WorldNews 2015-01-05 bob 231 13 NaN NaN NaN NaN NaN
10 WorldNews 2015-01-06 jenn 399 17 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN WorldWeather 2015-01-06 harry 293 9
12 WorldNews 2015-01-07 jenn 463 26 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN WorldWeather 2015-01-07 emily 304 6
14 WorldNews 2015-01-08 bob 479 27 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN WorldWeather 2015-01-08 emily 470 10
17 NaN NaN NaN NaN NaN WorldWeather 2015-01-09 tom NaN 15
19 NaN NaN NaN NaN NaN WorldWeather 2015-01-10 harry 337 4

Join

Pandas provides a single function, merge, for all standard join operations between DataFrame objects. We will look the basic arguments for this function.

merge(left, right, how='inner', on=None)
  • left - A DataFrame object
  • right - Another DataFrame to combine with.
  • how - Type of join operation, defaults to inner. Can be one of left, right, inner, and outer.
  • on - Columns names on which the join operation will be performed. Must be available in both the left and right DataFrame.

Inner Join

By default, join operates like the inner join of SQL. Here, we are performing the join on the 'Date' column. Uses intersection of keys from both the DataFrames.

We will get 3 rows because in the 'Date' columns there are only three matches.

In [38]:
pd.merge(Adf, Bdf, on='Date')
Out[38]:
Channel_x Date Anchor_x Views_x Comments_x Channel_y Anchor_y Views_y Comments_y
0 WorldNews 2015-01-06 jenn 399 17 WorldWeather harry 293 9
1 WorldNews 2015-01-07 jenn 463 26 WorldWeather emily 304 6
2 WorldNews 2015-01-08 bob 479 27 WorldWeather emily 470 10

Left Outer Join

Uses the keys from left DataFrame only.

how='left'

We can see that for the 'Date' values '2015-01-04' and '2015-01-05', there is not any corresponding matches (NaN) in the right DataFrame.

Because these rows are not present in the right DataFrame, Bdf.

In [39]:
pd.merge(Adf, Bdf, how='left', on='Date')
Out[39]:
Channel_x Date Anchor_x Views_x Comments_x Channel_y Anchor_y Views_y Comments_y
0 WorldNews 2015-01-04 alice 490 54 NaN NaN NaN NaN
1 WorldNews 2015-01-05 bob 231 13 NaN NaN NaN NaN
2 WorldNews 2015-01-06 jenn 399 17 WorldWeather harry 293 9
3 WorldNews 2015-01-07 jenn 463 26 WorldWeather emily 304 6
4 WorldNews 2015-01-08 bob 479 27 WorldWeather emily 470 10

Right Outer Join

Uses the keys from right DataFrame only.

how='right'
In [40]:
pd.merge(Adf, Bdf, how='right', on='Date')
Out[40]:
Channel_x Date Anchor_x Views_x Comments_x Channel_y Anchor_y Views_y Comments_y
0 WorldNews 2015-01-06 jenn 399 17 WorldWeather harry 293 9
1 WorldNews 2015-01-07 jenn 463 26 WorldWeather emily 304 6
2 WorldNews 2015-01-08 bob 479 27 WorldWeather emily 470 10
3 NaN 2015-01-09 NaN NaN NaN WorldWeather tom NaN 15
4 NaN 2015-01-10 NaN NaN NaN WorldWeather harry 337 4

Full Outer Join

Uses the intersection of keys from both DataFrames.

how='outer'
In [41]:
pd.merge(Adf, Bdf, how='outer', on='Date')
Out[41]:
Channel_x Date Anchor_x Views_x Comments_x Channel_y Anchor_y Views_y Comments_y
0 WorldNews 2015-01-04 alice 490 54 NaN NaN NaN NaN
1 WorldNews 2015-01-05 bob 231 13 NaN NaN NaN NaN
2 WorldNews 2015-01-06 jenn 399 17 WorldWeather harry 293 9
3 WorldNews 2015-01-07 jenn 463 26 WorldWeather emily 304 6
4 WorldNews 2015-01-08 bob 479 27 WorldWeather emily 470 10
5 NaN 2015-01-09 NaN NaN NaN WorldWeather tom NaN 15
6 NaN 2015-01-10 NaN NaN NaN WorldWeather harry 337 4

Cleaning Data: Dates, Strings, and Formatting

We want our data to be in the cleanest form, so that we can use it further. For this, we need to get it in the desired structure, that's called cleaning the data.

Numerical Formatting

Remember the interpolated DataFrame result? You can notice that the resultant values are Float64 type.

Some of the values are like 11.5, 7.5 etc.

In [42]:
type(worldnews_interpolated_df['Comments'][0])
Out[42]:
numpy.float64

As we know that 'Views' or 'Comments' should be a perfect number by definition. We can format them by applying the ceil function on the 'Comments' column.

In [43]:
worldnews_interpolated_df['Comments'] = worldnews_interpolated_df['Comments'].apply(np.ceil)

That will map a number to the smallest integer greater than or equal to it.

Dates Formatting

We can see that the Series objects in the column 'Date' are 'str' types.

In [44]:
type(worldnews_df['Date'][0])
Out[44]:
str

We can use the to_datetime function to convert them into 'Timestamp' type. The 'format' argument can be used to define the format of date in the 'Series', '2015-01-01' matches with '%Y-%m-%d'.

This can be used to timestamp related calculations.

In [46]:
worldnews_df.loc[:, 'Date'] = pd.to_datetime(worldnews_df['Date'], format="%Y-%m-%d")
In [47]:
type(worldnews_df['Date'][0])
Out[47]:
pandas.tslib.Timestamp

Pandas also provides functinoality equivalent to "datetime.timedelta", which is called "Timedelta".

In [48]:
time_a = worldnews_df['Date'][0]
time_b = worldnews_df['Date'][2]

time_diff = time_b - time_a
type(time_diff)
Out[48]:
pandas.tslib.Timedelta

We can collect different components of "Timedelta" object like "days", "hours", "minutes" etc.

In [49]:
time_diff.components
Out[49]:
Components(days=1, hours=0, minutes=0, seconds=0, milliseconds=0, microseconds=0, nanoseconds=0)

String Formatting

Pandas provide you a lot of string formatting methods. You can count the occurence of a particular pattern in a column.

In [50]:
worldnews_df['Anchor'].str.count('jenn').sum()
Out[50]:
57

Using the unique method you can check that different values in a column.

You can observe that the name of 'Anchor' is in smallcase.

In [51]:
worldnews_df['Anchor'].unique()
Out[51]:
array(['bob', 'alice', 'jenn'], dtype=object)

Using the capitalize method, we can format the string values. There are other methods such as upper, lower etc.

In [52]:
worldnews_df['Anchor'].str.capitalize().unique()
Out[52]:
array(['Bob', 'Alice', 'Jenn'], dtype=object)

Now suppose that there was a problem in data collection, the name of an anchor was "Matt" instead of "Bob".

We do have a method for this situation too, replace. Taking arguments as the "pattern" and "expected replacing string".

In [53]:
worldnews_df['Anchor'].apply(lambda s: s.replace('bob', 'matt')).unique()
Out[53]:
array(['matt', 'alice', 'jenn'], dtype=object)
Still need help?
Contact Us

For guaranteed 24 hour response turnarounds, upgrade to a Developer Support Plan.