import pandas as pd
import numpy as np48 Pandas
Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.
48.1 Core Features
Data Structures: Pandas introduces two primary data structures:
DataFrameandSeries. ADataFrameis a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). ASeries, on the other hand, is a one-dimensional array with axis labels.Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.
File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.
Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.
48.2 Installation
Pandas can be installed using pip, a package installer for Python:
{python} !pip install pandas !pip install numpy
48.2.1 Basic Usage
Importing Pandas is typically done using the pd alias:
Creating Data Structures
- Series:
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])- DataFrame:
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))Viewing Data
- View the top and bottom rows of the frame:
df.head() A B C D
2023-01-01 0.362678 1.880498 -0.686170 -1.094909
2023-01-02 0.075910 1.686341 0.765444 1.109194
2023-01-03 0.107877 0.570339 2.743528 -0.721506
2023-01-04 0.512200 1.050278 -0.391836 -0.441960
2023-01-05 -0.843603 1.105325 0.546459 -0.417343
df.tail(3) A B C D
2023-01-04 0.512200 1.050278 -0.391836 -0.441960
2023-01-05 -0.843603 1.105325 0.546459 -0.417343
2023-01-06 1.585084 1.293396 -0.427987 0.450398
- Display the index, columns, and the underlying NumPy data:
df.indexDatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
'2023-01-05', '2023-01-06'],
dtype='datetime64[ns]', freq='D')
df.columnsIndex(['A', 'B', 'C', 'D'], dtype='object')
df.to_numpy()array([[ 0.36267751, 1.88049838, -0.6861699 , -1.0949088 ],
[ 0.07591005, 1.68634052, 0.76544427, 1.1091943 ],
[ 0.10787749, 0.57033876, 2.74352781, -0.72150573],
[ 0.51219984, 1.05027816, -0.39183599, -0.44195953],
[-0.84360338, 1.10532453, 0.54645899, -0.41734259],
[ 1.58508405, 1.29339627, -0.42798653, 0.45039807]])
Data Selection
- Selecting a single column, which yields a Series:
df['A']2023-01-01 0.362678
2023-01-02 0.075910
2023-01-03 0.107877
2023-01-04 0.512200
2023-01-05 -0.843603
2023-01-06 1.585084
Freq: D, Name: A, dtype: float64
- Selecting via
[], which slices the rows:
df[0:3] A B C D
2023-01-01 0.362678 1.880498 -0.686170 -1.094909
2023-01-02 0.075910 1.686341 0.765444 1.109194
2023-01-03 0.107877 0.570339 2.743528 -0.721506
- Selection by label:
df.loc[dates[0]]A 0.362678
B 1.880498
C -0.686170
D -1.094909
Name: 2023-01-01 00:00:00, dtype: float64
- Selection by position:
df.iloc[3]A 0.512200
B 1.050278
C -0.391836
D -0.441960
Name: 2023-01-04 00:00:00, dtype: float64
Missing Data
Pandas primarily uses np.nan to represent missing data. It is by default not included in computations.
- To drop any rows that contain missing data:
df.dropna(how='any') A B C D
2023-01-01 0.362678 1.880498 -0.686170 -1.094909
2023-01-02 0.075910 1.686341 0.765444 1.109194
2023-01-03 0.107877 0.570339 2.743528 -0.721506
2023-01-04 0.512200 1.050278 -0.391836 -0.441960
2023-01-05 -0.843603 1.105325 0.546459 -0.417343
2023-01-06 1.585084 1.293396 -0.427987 0.450398
- Filling missing data:
df.fillna(value=5) A B C D
2023-01-01 0.362678 1.880498 -0.686170 -1.094909
2023-01-02 0.075910 1.686341 0.765444 1.109194
2023-01-03 0.107877 0.570339 2.743528 -0.721506
2023-01-04 0.512200 1.050278 -0.391836 -0.441960
2023-01-05 -0.843603 1.105325 0.546459 -0.417343
2023-01-06 1.585084 1.293396 -0.427987 0.450398
Operations
- Stats:
df.mean()A 0.300024
B 1.264363
C 0.424906
D -0.186021
dtype: float64
- Applying functions to the data:
df.apply(np.cumsum) A B C D
2023-01-01 0.362678 1.880498 -0.686170 -1.094909
2023-01-02 0.438588 3.566839 0.079274 0.014286
2023-01-03 0.546465 4.137178 2.822802 -0.707220
2023-01-04 1.058665 5.187456 2.430966 -1.149180
2023-01-05 0.215061 6.292780 2.977425 -1.566522
2023-01-06 1.800146 7.586177 2.549439 -1.116124
Grouping
- Group by operations:
df.groupby('A').sum() B C D
A
-0.843603 1.105325 0.546459 -0.417343
0.075910 1.686341 0.765444 1.109194
0.107877 0.570339 2.743528 -0.721506
0.362678 1.880498 -0.686170 -1.094909
0.512200 1.050278 -0.391836 -0.441960
1.585084 1.293396 -0.427987 0.450398
Merging
- Concatenating pandas objects together:
import pandas as pd
# Create first DataFrame
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Amit', 'Bina', 'Chirag'],
'Marks': [85, 90, 78]
})
# Create second DataFrame
df2 = pd.DataFrame({
'ID': [4, 5, 6],
'Name': ['Divya', 'Eshan', 'Farah'],
'Marks': [88, 92, 80]
})
# Concatenate the two DataFrames
df_combined = pd.concat([df1, df2])
# Display the combined DataFrame
print(df_combined) ID Name Marks
0 1 Amit 85
1 2 Bina 90
2 3 Chirag 78
0 4 Divya 88
1 5 Eshan 92
2 6 Farah 80
- SQL style merges:
import pandas as pd
# Create the first DataFrame (left table)
left = pd.DataFrame({
'key': ['K1', 'K2', 'K3', 'K4'],
'Name': ['Amit', 'Bina', 'Chirag', 'Divya']
})
# Create the second DataFrame (right table)
right = pd.DataFrame({
'key': ['K1', 'K2', 'K3', 'K5'],
'Marks': [85, 90, 78, 88]
})
# Perform SQL-style inner join on the column 'key'
merged_df = pd.merge(left, right, on='key')
# Display the result
print(merged_df) key Name Marks
0 K1 Amit 85
1 K2 Bina 90
2 K3 Chirag 78
File I/O
- Reading and writing to CSV:
{python} pd.read_csv('filename.csv') df.to_csv('my_dataframe.csv')
- Reading and writing to Excel:
{python} pd.read_excel('filename.xlsx', sheet_name='Sheet1') df.to_excel('my_dataframe.xlsx', sheet_name='Sheet1')
Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods to perform efficient data manipulation and analysis. Its robust features for handling complex data operations make it an indispensable tool for data scientists and analysts working in Python.
To load csv data in python:
{python} pd.read_csv("file path or file name") `
`{python} import pandas as pd # load data data1 = pd.read_csv(“customers-100.csv”)
`