48 Pandas

Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.

48.1 Core Features

Data Structures: Pandas introduces two primary data structures: DataFrame and Series. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Series, on the other hand, is a one-dimensional array with axis labels.
Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.
File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.
Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.

48.2 Installation

Pandas can be installed using pip, a package installer for Python:

{python} !pip install pandas !pip install numpy

48.2.1 Basic Usage

Importing Pandas is typically done using the pd alias:

import pandas as pd
import numpy as np

Creating Data Structures

Series:

import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])

DataFrame:

dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

Viewing Data

View the top and bottom rows of the frame:

df.head()

                   A         B         C         D
2023-01-01  0.362678  1.880498 -0.686170 -1.094909
2023-01-02  0.075910  1.686341  0.765444  1.109194
2023-01-03  0.107877  0.570339  2.743528 -0.721506
2023-01-04  0.512200  1.050278 -0.391836 -0.441960
2023-01-05 -0.843603  1.105325  0.546459 -0.417343

df.tail(3)

                   A         B         C         D
2023-01-04  0.512200  1.050278 -0.391836 -0.441960
2023-01-05 -0.843603  1.105325  0.546459 -0.417343
2023-01-06  1.585084  1.293396 -0.427987  0.450398

Display the index, columns, and the underlying NumPy data:

df.index

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.to_numpy()

array([[ 0.36267751,  1.88049838, -0.6861699 , -1.0949088 ],
       [ 0.07591005,  1.68634052,  0.76544427,  1.1091943 ],
       [ 0.10787749,  0.57033876,  2.74352781, -0.72150573],
       [ 0.51219984,  1.05027816, -0.39183599, -0.44195953],
       [-0.84360338,  1.10532453,  0.54645899, -0.41734259],
       [ 1.58508405,  1.29339627, -0.42798653,  0.45039807]])

Data Selection

Selecting a single column, which yields a Series:

df['A']

2023-01-01    0.362678
2023-01-02    0.075910
2023-01-03    0.107877
2023-01-04    0.512200
2023-01-05   -0.843603
2023-01-06    1.585084
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:

df[0:3]

                   A         B         C         D
2023-01-01  0.362678  1.880498 -0.686170 -1.094909
2023-01-02  0.075910  1.686341  0.765444  1.109194
2023-01-03  0.107877  0.570339  2.743528 -0.721506

Selection by label:

df.loc[dates[0]]

A    0.362678
B    1.880498
C   -0.686170
D   -1.094909
Name: 2023-01-01 00:00:00, dtype: float64

Selection by position:

df.iloc[3]

A    0.512200
B    1.050278
C   -0.391836
D   -0.441960
Name: 2023-01-04 00:00:00, dtype: float64

Missing Data

Pandas primarily uses np.nan to represent missing data. It is by default not included in computations.

To drop any rows that contain missing data:

df.dropna(how='any')

                   A         B         C         D
2023-01-01  0.362678  1.880498 -0.686170 -1.094909
2023-01-02  0.075910  1.686341  0.765444  1.109194
2023-01-03  0.107877  0.570339  2.743528 -0.721506
2023-01-04  0.512200  1.050278 -0.391836 -0.441960
2023-01-05 -0.843603  1.105325  0.546459 -0.417343
2023-01-06  1.585084  1.293396 -0.427987  0.450398

Filling missing data:

df.fillna(value=5)

                   A         B         C         D
2023-01-01  0.362678  1.880498 -0.686170 -1.094909
2023-01-02  0.075910  1.686341  0.765444  1.109194
2023-01-03  0.107877  0.570339  2.743528 -0.721506
2023-01-04  0.512200  1.050278 -0.391836 -0.441960
2023-01-05 -0.843603  1.105325  0.546459 -0.417343
2023-01-06  1.585084  1.293396 -0.427987  0.450398

Operations

Stats:

df.mean()

A    0.300024
B    1.264363
C    0.424906
D   -0.186021
dtype: float64

Applying functions to the data:

df.apply(np.cumsum)

                   A         B         C         D
2023-01-01  0.362678  1.880498 -0.686170 -1.094909
2023-01-02  0.438588  3.566839  0.079274  0.014286
2023-01-03  0.546465  4.137178  2.822802 -0.707220
2023-01-04  1.058665  5.187456  2.430966 -1.149180
2023-01-05  0.215061  6.292780  2.977425 -1.566522
2023-01-06  1.800146  7.586177  2.549439 -1.116124

Grouping

Group by operations:

df.groupby('A').sum()

                  B         C         D
A                                      
-0.843603  1.105325  0.546459 -0.417343
 0.075910  1.686341  0.765444  1.109194
 0.107877  0.570339  2.743528 -0.721506
 0.362678  1.880498 -0.686170 -1.094909
 0.512200  1.050278 -0.391836 -0.441960
 1.585084  1.293396 -0.427987  0.450398

Merging

Concatenating pandas objects together:

import pandas as pd

# Create first DataFrame
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Amit', 'Bina', 'Chirag'],
    'Marks': [85, 90, 78]
})

# Create second DataFrame
df2 = pd.DataFrame({
    'ID': [4, 5, 6],
    'Name': ['Divya', 'Eshan', 'Farah'],
    'Marks': [88, 92, 80]
})

# Concatenate the two DataFrames
df_combined = pd.concat([df1, df2])

# Display the combined DataFrame
print(df_combined)

   ID    Name  Marks
0   1    Amit     85
1   2    Bina     90
2   3  Chirag     78
0   4   Divya     88
1   5   Eshan     92
2   6   Farah     80

SQL style merges:

import pandas as pd

# Create the first DataFrame (left table)
left = pd.DataFrame({
    'key': ['K1', 'K2', 'K3', 'K4'],
    'Name': ['Amit', 'Bina', 'Chirag', 'Divya']
})

# Create the second DataFrame (right table)
right = pd.DataFrame({
    'key': ['K1', 'K2', 'K3', 'K5'],
    'Marks': [85, 90, 78, 88]
})

# Perform SQL-style inner join on the column 'key'
merged_df = pd.merge(left, right, on='key')

# Display the result
print(merged_df)

  key    Name  Marks
0  K1    Amit     85
1  K2    Bina     90
2  K3  Chirag     78

File I/O

Reading and writing to CSV:

{python} pd.read_csv('filename.csv') df.to_csv('my_dataframe.csv')

Reading and writing to Excel:

{python} pd.read_excel('filename.xlsx', sheet_name='Sheet1') df.to_excel('my_dataframe.xlsx', sheet_name='Sheet1')

Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods to perform efficient data manipulation and analysis. Its robust features for handling complex data operations make it an indispensable tool for data scientists and analysts working in Python.

To load csv data in python:

{python} pd.read_csv("file path or file name") `

`{python} import pandas as pd # load data data1 = pd.read_csv(“customers-100.csv”)