Working with data using Pandas

Table of contents

  1. Adding and Removing Rows in a DataFrame
    1. Adding Rows
      1. Using .loc[]
      2. Using .append()
    2. Removing Rows using .drop()
  2. Handling NaN Values in DataFrames
    1. Detecting NaN Values
      1. isna() and isnull()
      2. info()
    2. Handling NaN Values
      1. Removing Rows or Columns
      2. Replacing NaN Values
      3. Interpolation
    3. Handling NaN Values in Context
  3. Replace data
  4. Using bins
  5. Exercises

Adding and Removing Rows in a DataFrame

Manipulating rows in a DataFrame is a common task when working with data. Pandas provides various methods to add and remove rows efficiently. In this section, we’ll explore how to perform these operations.

Adding Rows

Using .loc[]

You can add a row to a DataFrame using the .loc[] indexer. To add a new row, specify the index and assign the values using a dictionary.

# Syntax: df.loc[index_label] = {'column1': value1, 'column2': value2, ...}
df.loc[index_label] = {'column1': value1, 'column2': value2, ...}

Example:

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}

data_df = pd.DataFrame(data)

# Add a new row with index 'Diana'
data_df.loc['Diana'] = {'Name': 'Diana', 'Age': 28}
data_df
---


	Name 	Age
0 	Alice 	25
1 	Bob 	30
2 	Charlie 	22
Diana 	Diana 	28

Using .append()

You can also use the .append() method to add a row to a DataFrame. Create a new DataFrame with the row you want to add and then append it to the original DataFrame.

# Syntax: df = df.append(new_row, ignore_index=True)
df = df.append(new_row, ignore_index=True)

Example:

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}

data_df = pd.DataFrame(data)

# Create a new row
new_row = {'Name': 'Diana', 'Age': 28}

# Add the new row to the DataFrame
data_df = data_df.append(new_row, ignore_index=True)
data_df
---

	Name 	Age
0 	Alice 	25
1 	Bob 	30
2 	Charlie 	22
Diana 	Diana 	28

Removing Rows using .drop()

You can remove rows by using the .drop() method and specifying the index label or row index to be dropped. Set inplace=True to modify the DataFrame in place.

# Syntax: df.drop(index_label, inplace=True)
df.drop(index_label, inplace=True)

Example:

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22]}

data_df = pd.DataFrame(data)

# Remove the row with index label 1 (Bob)
data_df.drop(1, inplace=True)
data_df
---

	Name 	Age
0 	Alice 	25
2 	Charlie 	22

Handling NaN Values in DataFrames

NaN (Not a Number) values are commonly encountered in real-world data. They represent missing or unavailable data in a DataFrame. Properly managing these values is essential to ensure the accuracy and reliability of your data analysis. Pandas offers various methods to deal with NaN values in a DataFrame.

Detecting NaN Values

Before handling NaN values, it’s important to detect them. Here are some methods to identify NaN values in a DataFrame:

isna() and isnull()

These methods return a boolean mask indicating the locations where NaN values are present. You can use them as follows:

# Create a boolean mask to detect NaNs
mask = df.isna()
# Or, equivalently
mask = df.isnull()

Example:

import numpy as np

data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, np.nan, 8]}
data_df = pd.DataFrame(data)

# Detect NaN values
mask = data_df.isna()
mask
---

	A 	B
0 	False 	False
1 	False 	True
2 	True 	True
3 	False 	False

info()

The info() method displays information about the DataFrame, including the number of non-null values and the data type for each column. NaN values are implicitly identified as missing data.

Example:

# Display information about the DataFrame
data_df.info()
---

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      float64
 1   B       2 non-null      float64
dtypes: float64(2)
memory usage: 192.0 bytes

Handling NaN Values

Once you’ve detected NaN values, you can decide how to handle them based on your specific needs.

Removing Rows or Columns

  • dropna(): This method allows you to remove rows or columns containing at least one NaN value. For example, to remove rows containing NaNs:

Remove rows containing NaNs:

data_df.dropna(axis=0)
---

	A 	B
0 	1.0 	5.0
3 	4.0 	8.0

Remove columns containing NaNs:

data_df.dropna(axis=1)
---

0
1
2
3

It’s empty because all the columns contains at least one NaN

Replacing NaN Values

  • fillna(): You can use fillna() to replace NaN values with a specific value. For instance, to replace all NaN values with zero:

Replace NaNs with 0:

data_df.fillna(0)
---

	A 	B
0 	1.0 	5.0
1 	2.0 	0.0
2 	0.0 	0.0
3 	4.0 	8.0

Interpolation

  • interpolate(): Interpolation is used to replace NaN values with values calculated based on adjacent values. This can be useful for sequential data like time series.

Linear interpolation to replace NaNs:

data_df.interpolate(method='linear')
---

	A 	B
0 	1.0 	5.0
1 	2.0 	6.0
2 	3.0 	7.0
3 	4.0 	8.0

Handling NaN Values in Context

The choice of NaN value handling method will depend on the context and the impact on your analyses. It’s essential to understand why data is missing and choose an appropriate approach to avoid biasing your results.

Replace data

Have you noticed that the column MainCategory in df has not real meaning, the meaning is in another file MainCategoryCatalogue.csvin 4 different languages. We already loaded the file into the df_MainCategory DataFrame. We would like to replace the alias value by the real one.

This code can do the trick :

df['MainCategory'] = df_MainCategory.loc[df['MainCategory']]['en'].values
df.head(60)

The .loc indexer in Pandas allows you to access data in a DataFrame using labels or boolean indexing. In this case, it’s being used to map values from the “MainCategory” column of df to corresponding values from the en column of df_MainCategory based on the index.

The .values is used to extract the values from the Series and assign them back to the MainCategory column of df.

As a result, the MainCategory column in df is updated with values from the “en” column of df_MainCategory based on the index-label mappings.

This approach is particularly useful when you have a reference DataFrame with mapping information, and you want to replace values in another DataFrame based on that mapping. It can be an efficient way to transform and update data in Pandas.

Using bins

Another powerfull categoriziing tool is the function cut(). It allows you to create discrete categories from continuous data, which is useful for data analysis, visualization, and aggregation. By using cut(), you can easily segment data into specific intervals and assign labels to them, making it easier to understand and manipulate your data. Whether you need to create age groups, value ranges, or other custom categories, cut() is an essential tool for data processing with Pandas.

For example I would like to create year interval for the year of BeginningOfOperation in df

The first thing todo is to convert the column into date form to use the powerful time function of pandas, and then extract the year into a new column :

df['BeginningOfOperation']=pd.to_datetime(df['BeginningOfOperation'])
df['YearOfOperation'] = df['BeginningOfOperation'].dt.year

I can now create my intervals (called bins) and label them.

bins = [0, 2000, 2005, 2010, 2015, 2023]  
labels = ['Before 2000', '2000-2005', '2005-2010', '2010-2015', '2015-2023']

Then I called the function .cut to assign the intervals, and I create a new colunn called Period :

df['Period']=pd.cut(df['YearOfOperation'], bins=bins, labels=labels)

I can now use groupby() as before to group similar values :

df.groupby('Period')['TotalPower'].sum()
---

Period
Before 2000    18509498.94
2000-2005        150836.88
2005-2010        553208.85
2010-2015       1874547.10
2015-2023       3244985.45
Name: TotalPower, dtype: float64

Exercises

  • Replace the columns SubCategory and PlantCategory with their exact values (from SubCategoryCatalogue.csv, and PlantCategoryCatalogue.csv respectively).
  • You might have some problem with PlantCategory containing Nan values.
  • You need to replace the NaN value by Unknown.
  • And create a new line on the Dataframe containing PlantCategory, the index should be Unknownand at least the value for the column en should be filled.