import pandas as pd
import numpy as np48 Pandas
Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.
48.1 Core Features
Data Structures: Pandas introduces two primary data structures:
DataFrameandSeries. ADataFrameis a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). ASeries, on the other hand, is a one-dimensional array with axis labels.Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.
File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.
Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.
48.2 Installation
Pandas can be installed using pip, a package installer for Python:
{python} !pip install pandas !pip install numpy
48.2.1 Basic Usage
Importing Pandas is typically done using the pd alias:
Creating Data Structures
- Series:
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])- DataFrame:
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))Viewing Data
- View the top and bottom rows of the frame:
df.head() A B C D
2023-01-01 -0.360407 0.198217 -0.666282 1.384958
2023-01-02 0.611754 0.358632 0.199586 -0.861172
2023-01-03 0.356891 -1.234747 -0.269720 -0.045148
2023-01-04 0.654953 -2.800221 -1.243245 0.717758
2023-01-05 0.987816 -1.338432 0.251386 1.445098
df.tail(3) A B C D
2023-01-04 0.654953 -2.800221 -1.243245 0.717758
2023-01-05 0.987816 -1.338432 0.251386 1.445098
2023-01-06 1.021667 -0.390022 -1.560890 -0.106053
- Display the index, columns, and the underlying NumPy data:
df.indexDatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
'2023-01-05', '2023-01-06'],
dtype='datetime64[ns]', freq='D')
df.columnsIndex(['A', 'B', 'C', 'D'], dtype='object')
df.to_numpy()array([[-0.36040673, 0.19821671, -0.66628248, 1.38495845],
[ 0.61175417, 0.35863229, 0.19958589, -0.86117245],
[ 0.35689058, -1.23474669, -0.2697196 , -0.0451483 ],
[ 0.65495322, -2.80022109, -1.24324498, 0.71775785],
[ 0.9878161 , -1.33843215, 0.25138646, 1.44509816],
[ 1.02166745, -0.39002205, -1.56088974, -0.1060528 ]])
Data Selection
- Selecting a single column, which yields a Series:
df['A']2023-01-01 -0.360407
2023-01-02 0.611754
2023-01-03 0.356891
2023-01-04 0.654953
2023-01-05 0.987816
2023-01-06 1.021667
Freq: D, Name: A, dtype: float64
- Selecting via
[], which slices the rows:
df[0:3] A B C D
2023-01-01 -0.360407 0.198217 -0.666282 1.384958
2023-01-02 0.611754 0.358632 0.199586 -0.861172
2023-01-03 0.356891 -1.234747 -0.269720 -0.045148
- Selection by label:
df.loc[dates[0]]A -0.360407
B 0.198217
C -0.666282
D 1.384958
Name: 2023-01-01 00:00:00, dtype: float64
- Selection by position:
df.iloc[3]A 0.654953
B -2.800221
C -1.243245
D 0.717758
Name: 2023-01-04 00:00:00, dtype: float64
Missing Data
Pandas primarily uses np.nan to represent missing data. It is by default not included in computations.
- To drop any rows that contain missing data:
df.dropna(how='any') A B C D
2023-01-01 -0.360407 0.198217 -0.666282 1.384958
2023-01-02 0.611754 0.358632 0.199586 -0.861172
2023-01-03 0.356891 -1.234747 -0.269720 -0.045148
2023-01-04 0.654953 -2.800221 -1.243245 0.717758
2023-01-05 0.987816 -1.338432 0.251386 1.445098
2023-01-06 1.021667 -0.390022 -1.560890 -0.106053
- Filling missing data:
df.fillna(value=5) A B C D
2023-01-01 -0.360407 0.198217 -0.666282 1.384958
2023-01-02 0.611754 0.358632 0.199586 -0.861172
2023-01-03 0.356891 -1.234747 -0.269720 -0.045148
2023-01-04 0.654953 -2.800221 -1.243245 0.717758
2023-01-05 0.987816 -1.338432 0.251386 1.445098
2023-01-06 1.021667 -0.390022 -1.560890 -0.106053
Operations
- Stats:
df.mean()A 0.545446
B -0.867762
C -0.548194
D 0.422573
dtype: float64
- Applying functions to the data:
df.apply(np.cumsum) A B C D
2023-01-01 -0.360407 0.198217 -0.666282 1.384958
2023-01-02 0.251347 0.556849 -0.466697 0.523786
2023-01-03 0.608238 -0.677898 -0.736416 0.478638
2023-01-04 1.263191 -3.478119 -1.979661 1.196396
2023-01-05 2.251007 -4.816551 -1.728275 2.641494
2023-01-06 3.272675 -5.206573 -3.289164 2.535441
Grouping
- Group by operations:
df.groupby('A').sum() B C D
A
-0.360407 0.198217 -0.666282 1.384958
0.356891 -1.234747 -0.269720 -0.045148
0.611754 0.358632 0.199586 -0.861172
0.654953 -2.800221 -1.243245 0.717758
0.987816 -1.338432 0.251386 1.445098
1.021667 -0.390022 -1.560890 -0.106053
Merging
- Concatenating pandas objects together:
import pandas as pd
# Create first DataFrame
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Amit', 'Bina', 'Chirag'],
'Marks': [85, 90, 78]
})
# Create second DataFrame
df2 = pd.DataFrame({
'ID': [4, 5, 6],
'Name': ['Divya', 'Eshan', 'Farah'],
'Marks': [88, 92, 80]
})
# Concatenate the two DataFrames
df_combined = pd.concat([df1, df2])
# Display the combined DataFrame
print(df_combined) ID Name Marks
0 1 Amit 85
1 2 Bina 90
2 3 Chirag 78
0 4 Divya 88
1 5 Eshan 92
2 6 Farah 80
- SQL style merges:
import pandas as pd
# Create the first DataFrame (left table)
left = pd.DataFrame({
'key': ['K1', 'K2', 'K3', 'K4'],
'Name': ['Amit', 'Bina', 'Chirag', 'Divya']
})
# Create the second DataFrame (right table)
right = pd.DataFrame({
'key': ['K1', 'K2', 'K3', 'K5'],
'Marks': [85, 90, 78, 88]
})
# Perform SQL-style inner join on the column 'key'
merged_df = pd.merge(left, right, on='key')
# Display the result
print(merged_df) key Name Marks
0 K1 Amit 85
1 K2 Bina 90
2 K3 Chirag 78
File I/O
- Reading and writing to CSV:
{python} pd.read_csv('filename.csv') df.to_csv('my_dataframe.csv')
- Reading and writing to Excel:
{python} pd.read_excel('filename.xlsx', sheet_name='Sheet1') df.to_excel('my_dataframe.xlsx', sheet_name='Sheet1')
Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods to perform efficient data manipulation and analysis. Its robust features for handling complex data operations make it an indispensable tool for data scientists and analysts working in Python.
To load csv data in python:
{python} pd.read_csv("file path or file name") `
`{python} import pandas as pd # load data data1 = pd.read_csv(“customers-100.csv”)
`