Working with data using Pandas
Table of contents
- Adding and Removing Rows in a DataFrame
- Handling NaN Values in DataFrames
- Replace data
- Using bins
- Exercises
Adding and Removing Rows in a DataFrame
Manipulating rows in a DataFrame is a common task when working with data. Pandas provides various methods to add and remove rows efficiently. In this section, we’ll explore how to perform these operations.
Adding Rows
Using .loc[]
You can add a row to a DataFrame using the .loc[]
indexer. To add a new row, specify the index and assign the values using a dictionary.
# Syntax: df.loc[index_label] = {'column1': value1, 'column2': value2, ...}
df.loc[index_label] = {'column1': value1, 'column2': value2, ...}
Example:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
data_df = pd.DataFrame(data)
# Add a new row with index 'Diana'
data_df.loc['Diana'] = {'Name': 'Diana', 'Age': 28}
data_df
---
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
Diana Diana 28
Using .append()
You can also use the .append()
method to add a row to a DataFrame. Create a new DataFrame with the row you want to add and then append it to the original DataFrame.
# Syntax: df = df.append(new_row, ignore_index=True)
df = df.append(new_row, ignore_index=True)
Example:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
data_df = pd.DataFrame(data)
# Create a new row
new_row = {'Name': 'Diana', 'Age': 28}
# Add the new row to the DataFrame
data_df = data_df.append(new_row, ignore_index=True)
data_df
---
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
Diana Diana 28
Removing Rows using .drop()
You can remove rows by using the .drop()
method and specifying the index label or row index to be dropped. Set inplace=True
to modify the DataFrame in place.
# Syntax: df.drop(index_label, inplace=True)
df.drop(index_label, inplace=True)
Example:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]}
data_df = pd.DataFrame(data)
# Remove the row with index label 1 (Bob)
data_df.drop(1, inplace=True)
data_df
---
Name Age
0 Alice 25
2 Charlie 22
Handling NaN Values in DataFrames
NaN (Not a Number) values are commonly encountered in real-world data. They represent missing or unavailable data in a DataFrame. Properly managing these values is essential to ensure the accuracy and reliability of your data analysis. Pandas offers various methods to deal with NaN values in a DataFrame.
Detecting NaN Values
Before handling NaN values, it’s important to detect them. Here are some methods to identify NaN values in a DataFrame:
isna()
and isnull()
These methods return a boolean mask indicating the locations where NaN values are present. You can use them as follows:
# Create a boolean mask to detect NaNs
mask = df.isna()
# Or, equivalently
mask = df.isnull()
Example:
import numpy as np
data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8]}
data_df = pd.DataFrame(data)
# Detect NaN values
mask = data_df.isna()
mask
---
A B
0 False False
1 False True
2 True True
3 False False
info()
The info()
method displays information about the DataFrame, including the number of non-null values and the data type for each column. NaN values are implicitly identified as missing data.
Example:
# Display information about the DataFrame
data_df.info()
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null float64
1 B 2 non-null float64
dtypes: float64(2)
memory usage: 192.0 bytes
Handling NaN Values
Once you’ve detected NaN values, you can decide how to handle them based on your specific needs.
Removing Rows or Columns
dropna()
: This method allows you to remove rows or columns containing at least one NaN value. For example, to remove rows containing NaNs:
Remove rows containing NaNs:
data_df.dropna(axis=0)
---
A B
0 1.0 5.0
3 4.0 8.0
Remove columns containing NaNs:
data_df.dropna(axis=1)
---
0
1
2
3
It’s empty because all the columns contains at least one NaN
Replacing NaN Values
fillna()
: You can usefillna()
to replace NaN values with a specific value. For instance, to replace all NaN values with zero:
Replace NaNs with 0:
data_df.fillna(0)
---
A B
0 1.0 5.0
1 2.0 0.0
2 0.0 0.0
3 4.0 8.0
Interpolation
interpolate()
: Interpolation is used to replace NaN values with values calculated based on adjacent values. This can be useful for sequential data like time series.
Linear interpolation to replace NaNs:
data_df.interpolate(method='linear')
---
A B
0 1.0 5.0
1 2.0 6.0
2 3.0 7.0
3 4.0 8.0
Handling NaN Values in Context
The choice of NaN value handling method will depend on the context and the impact on your analyses. It’s essential to understand why data is missing and choose an appropriate approach to avoid biasing your results.
Replace data
Have you noticed that the column MainCategory
in df
has not real meaning, the meaning is in another file MainCategoryCatalogue.csv
in 4 different languages. We already loaded the file into the df_MainCategory
DataFrame. We would like to replace the alias value by the real one.
This code can do the trick :
df['MainCategory'] = df_MainCategory.loc[df['MainCategory']]['en'].values
df.head(60)
The .loc
indexer in Pandas allows you to access data in a DataFrame using labels or boolean indexing. In this case, it’s being used to map values from the “MainCategory” column of df
to corresponding values from the en
column of df_MainCategory
based on the index.
The .values
is used to extract the values from the Series and assign them back to the MainCategory
column of df
.
As a result, the MainCategory
column in df
is updated with values from the “en” column of df_MainCategory
based on the index-label mappings.
This approach is particularly useful when you have a reference DataFrame with mapping information, and you want to replace values in another DataFrame based on that mapping. It can be an efficient way to transform and update data in Pandas.
Using bins
Another powerfull categoriziing tool is the function cut()
. It allows you to create discrete categories from continuous data, which is useful for data analysis, visualization, and aggregation. By using cut()
, you can easily segment data into specific intervals and assign labels to them, making it easier to understand and manipulate your data. Whether you need to create age groups, value ranges, or other custom categories, cut()
is an essential tool for data processing with Pandas.
For example I would like to create year interval for the year of BeginningOfOperation
in df
The first thing todo is to convert the column into date form to use the powerful time function of pandas, and then extract the year into a new column :
df['BeginningOfOperation']=pd.to_datetime(df['BeginningOfOperation'])
df['YearOfOperation'] = df['BeginningOfOperation'].dt.year
I can now create my intervals (called bins
) and label them.
bins = [0, 2000, 2005, 2010, 2015, 2023]
labels = ['Before 2000', '2000-2005', '2005-2010', '2010-2015', '2015-2023']
Then I called the function .cut
to assign the intervals, and I create a new colunn called Period
:
df['Period']=pd.cut(df['YearOfOperation'], bins=bins, labels=labels)
I can now use groupby()
as before to group similar values :
df.groupby('Period')['TotalPower'].sum()
---
Period
Before 2000 18509498.94
2000-2005 150836.88
2005-2010 553208.85
2010-2015 1874547.10
2015-2023 3244985.45
Name: TotalPower, dtype: float64
Exercises
- Replace the columns
SubCategory
andPlantCategory
with their exact values (fromSubCategoryCatalogue.csv
, andPlantCategoryCatalogue.csv
respectively). - You might have some problem with
PlantCategory
containing Nan values. - You need to replace the NaN value by
Unknown
. - And create a new line on the Dataframe containing
PlantCategory
, the index should beUnknown
and at least the value for the columnen
should be filled.