48  Pandas

Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.

48.1 Core Features

  • Data Structures: Pandas introduces two primary data structures: DataFrame and Series. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Series, on the other hand, is a one-dimensional array with axis labels.

  • Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.

  • File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.

  • Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.

48.2 Installation

Pandas can be installed using pip, a package installer for Python:

{python} !pip install pandas !pip install numpy

48.2.1 Basic Usage

Importing Pandas is typically done using the pd alias:

import pandas as pd
import numpy as np

Creating Data Structures

  • Series:
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
  • DataFrame:
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

Viewing Data

  • View the top and bottom rows of the frame:
df.head()
                   A         B         C         D
2023-01-01 -0.360407  0.198217 -0.666282  1.384958
2023-01-02  0.611754  0.358632  0.199586 -0.861172
2023-01-03  0.356891 -1.234747 -0.269720 -0.045148
2023-01-04  0.654953 -2.800221 -1.243245  0.717758
2023-01-05  0.987816 -1.338432  0.251386  1.445098
df.tail(3)
                   A         B         C         D
2023-01-04  0.654953 -2.800221 -1.243245  0.717758
2023-01-05  0.987816 -1.338432  0.251386  1.445098
2023-01-06  1.021667 -0.390022 -1.560890 -0.106053
  • Display the index, columns, and the underlying NumPy data:
df.index
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.to_numpy()
array([[-0.36040673,  0.19821671, -0.66628248,  1.38495845],
       [ 0.61175417,  0.35863229,  0.19958589, -0.86117245],
       [ 0.35689058, -1.23474669, -0.2697196 , -0.0451483 ],
       [ 0.65495322, -2.80022109, -1.24324498,  0.71775785],
       [ 0.9878161 , -1.33843215,  0.25138646,  1.44509816],
       [ 1.02166745, -0.39002205, -1.56088974, -0.1060528 ]])

Data Selection

  • Selecting a single column, which yields a Series:
df['A']
2023-01-01   -0.360407
2023-01-02    0.611754
2023-01-03    0.356891
2023-01-04    0.654953
2023-01-05    0.987816
2023-01-06    1.021667
Freq: D, Name: A, dtype: float64
  • Selecting via [], which slices the rows:
df[0:3]
                   A         B         C         D
2023-01-01 -0.360407  0.198217 -0.666282  1.384958
2023-01-02  0.611754  0.358632  0.199586 -0.861172
2023-01-03  0.356891 -1.234747 -0.269720 -0.045148
  • Selection by label:
df.loc[dates[0]]
A   -0.360407
B    0.198217
C   -0.666282
D    1.384958
Name: 2023-01-01 00:00:00, dtype: float64
  • Selection by position:
df.iloc[3]
A    0.654953
B   -2.800221
C   -1.243245
D    0.717758
Name: 2023-01-04 00:00:00, dtype: float64

Missing Data

Pandas primarily uses np.nan to represent missing data. It is by default not included in computations.

  • To drop any rows that contain missing data:
df.dropna(how='any')
                   A         B         C         D
2023-01-01 -0.360407  0.198217 -0.666282  1.384958
2023-01-02  0.611754  0.358632  0.199586 -0.861172
2023-01-03  0.356891 -1.234747 -0.269720 -0.045148
2023-01-04  0.654953 -2.800221 -1.243245  0.717758
2023-01-05  0.987816 -1.338432  0.251386  1.445098
2023-01-06  1.021667 -0.390022 -1.560890 -0.106053
  • Filling missing data:
df.fillna(value=5)
                   A         B         C         D
2023-01-01 -0.360407  0.198217 -0.666282  1.384958
2023-01-02  0.611754  0.358632  0.199586 -0.861172
2023-01-03  0.356891 -1.234747 -0.269720 -0.045148
2023-01-04  0.654953 -2.800221 -1.243245  0.717758
2023-01-05  0.987816 -1.338432  0.251386  1.445098
2023-01-06  1.021667 -0.390022 -1.560890 -0.106053

Operations

  • Stats:
df.mean()
A    0.545446
B   -0.867762
C   -0.548194
D    0.422573
dtype: float64
  • Applying functions to the data:
df.apply(np.cumsum)
                   A         B         C         D
2023-01-01 -0.360407  0.198217 -0.666282  1.384958
2023-01-02  0.251347  0.556849 -0.466697  0.523786
2023-01-03  0.608238 -0.677898 -0.736416  0.478638
2023-01-04  1.263191 -3.478119 -1.979661  1.196396
2023-01-05  2.251007 -4.816551 -1.728275  2.641494
2023-01-06  3.272675 -5.206573 -3.289164  2.535441

Grouping

  • Group by operations:
df.groupby('A').sum()
                  B         C         D
A                                      
-0.360407  0.198217 -0.666282  1.384958
 0.356891 -1.234747 -0.269720 -0.045148
 0.611754  0.358632  0.199586 -0.861172
 0.654953 -2.800221 -1.243245  0.717758
 0.987816 -1.338432  0.251386  1.445098
 1.021667 -0.390022 -1.560890 -0.106053

Merging

  • Concatenating pandas objects together:
import pandas as pd

# Create first DataFrame
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Amit', 'Bina', 'Chirag'],
    'Marks': [85, 90, 78]
})

# Create second DataFrame
df2 = pd.DataFrame({
    'ID': [4, 5, 6],
    'Name': ['Divya', 'Eshan', 'Farah'],
    'Marks': [88, 92, 80]
})

# Concatenate the two DataFrames
df_combined = pd.concat([df1, df2])

# Display the combined DataFrame
print(df_combined)
   ID    Name  Marks
0   1    Amit     85
1   2    Bina     90
2   3  Chirag     78
0   4   Divya     88
1   5   Eshan     92
2   6   Farah     80
  • SQL style merges:
import pandas as pd

# Create the first DataFrame (left table)
left = pd.DataFrame({
    'key': ['K1', 'K2', 'K3', 'K4'],
    'Name': ['Amit', 'Bina', 'Chirag', 'Divya']
})

# Create the second DataFrame (right table)
right = pd.DataFrame({
    'key': ['K1', 'K2', 'K3', 'K5'],
    'Marks': [85, 90, 78, 88]
})

# Perform SQL-style inner join on the column 'key'
merged_df = pd.merge(left, right, on='key')

# Display the result
print(merged_df)
  key    Name  Marks
0  K1    Amit     85
1  K2    Bina     90
2  K3  Chirag     78

File I/O

  • Reading and writing to CSV:

{python} pd.read_csv('filename.csv') df.to_csv('my_dataframe.csv')

  • Reading and writing to Excel:

{python} pd.read_excel('filename.xlsx', sheet_name='Sheet1') df.to_excel('my_dataframe.xlsx', sheet_name='Sheet1')

Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods to perform efficient data manipulation and analysis. Its robust features for handling complex data operations make it an indispensable tool for data scientists and analysts working in Python.

To load csv data in python:

{python} pd.read_csv("file path or file name") `

`{python} import pandas as pd # load data data1 = pd.read_csv(“customers-100.csv”)

`