Introduction to Pandas

Table of contents

  1. Download the data for this course
  2. Loading data
    1. .info()
    2. .describe()
  3. Pandas Objects
    1. DataFrame
    2. Series
    3. Index
    4. Using Objects

Download the data for this course

For this course we will use the Electricity production plants dataset from the Swiss confereration.

You can download the data here : https://opendata.swiss/en/dataset/elektrizitatsproduktionsanlagen. Download the CSV ressources.

  1. Create a folder for storing the files of this course
  2. Extract the zip file into the directory
  3. You should have multiple csv files :
    ElectricityProductionPlant.csv
    MainCategoryCatalogue.csv
    PlantCategoryCatalogue.csv
    PlantDetail.csv
    SubCategoryCatalogue.csv
    
  4. Using JupyterLab visualize ElectricityProductionPlant.csv

Loading data

You can use the touch TAB in order to use the autocompletion in your notebook

We will now load the data using Pandas.

The first thing to do is to load Pandas :

import pandas as pd

In order to display all the columns of our dataset we need to change the default values :

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping
pd.set_option('display.max_rows', 30) #Show more rows

then we load the ElectricityProductionPlant.csv into df :

df = pd.read_csv('ElectricityProductionPlant.csv')
df

The table should appear as output.

.info()

Run df.info()

df.info()
---

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182404 entries, 0 to 182403
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   xtf_id                182404 non-null  int64  
 1   Address               182404 non-null  object 
 2   PostCode              182404 non-null  int64  
 3   Municipality          182404 non-null  object 
 4   Canton                182404 non-null  object 
 5   BeginningOfOperation  182404 non-null  object 
 6   InitialPower          182404 non-null  float64
 7   TotalPower            182404 non-null  float64
 8   MainCategory          182404 non-null  object 
 9   SubCategory           182404 non-null  object 
 10  PlantCategory         177580 non-null  object 
 11  _x                    177483 non-null  float64
 12  _y                    177483 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 18.1+ MB

This command display : the number of lines in the table (here 182404) the number of columns

For each column : their name, how many non-null element and their types.

At the end the a summary of the types and the memory usages is display

.describe()

Run df.describe() .

df.describe()
---

	    xtf_id 	        PostCode         InitialPower 	TotalPower 	    _x 	             _y
count 	182404.000000 	182404.000000 	1.824040e+05 	1.824040e+05 	1.774830e+05 	1.774830e+05
mean 	148642.500291 	5038.845722 	1.321199e+02 	1.334021e+02 	2.638471e+06 	1.205438e+06
std 	66446.581318 	2836.307447 	8.324163e+03 	8.324198e+03 	7.252629e+04 	5.045017e+04
min 	5646.000000 	1000.000000 	2.000000e-01 	2.000000e-01 	2.486221e+06 	1.075930e+06
25% 	97921.500000 	2043.000000 	6.600000e+00 	6.720000e+00 	2.577604e+06 	1.169977e+06
50% 	150343.500000 	4702.000000 	1.013500e+01 	1.029000e+01 	2.634925e+06 	1.215729e+06
75% 	203753.250000 	8136.000000 	1.680000e+01 	1.722000e+01 	2.701701e+06 	1.249248e+06
max 	262989.000000 	9658.000000 	1.872000e+06 	1.872000e+06 	2.831207e+06 	1.294815e+06

Here are displayed some basics statistical analysis (only on columns with numerical data).

For example, we can read that for the total power (TotalPower column) the average of power is 133 MW, the minimum is 0.2 mW (ie 2kW) as described on the dataset website.

Pandas Objects

Pandas introduces several fundamental data structures, each designed for specific use cases. Understanding these Pandas objects is essential for effectively working with structured data.

In this section, we’ll explore the primary Pandas objects (DataFrame, Series, and Index):

DataFrame

  • Description: The DataFrame is the most commonly used Pandas object. It represents a two-dimensional, tabular data structure, similar to a spreadsheet or a SQL table. DataFrames consist of rows and columns, where each column can have a different data type.

  • Use Cases: DataFrames are ideal for storing and manipulating structured data, such as datasets from CSV files, SQL databases, or Excel spreadsheets. They provide a convenient way to perform operations like filtering, aggregation, and transformation.

  • Creation: DataFrames can be created from various data sources, including dictionaries, lists of dictionaries, NumPy arrays, and external data files.

  • Example:

The variable df we created is a Pandas DataFrame

type(df)
---

pandas.core.frame.DataFrame

Series

  • Description: A Series is a one-dimensional array-like object in Pandas. It consists of data values and an associated index, which can be used to label and access the data.

  • Use Cases: Series are commonly used to represent a single column of data or a single variable. They can be thought of as the building blocks of a DataFrame.

  • Creation: Series can be created from various data sources, including lists, NumPy arrays, and dictionaries.

  • Example:

Selecting one column of df.

df['TotalPower']
type(df['TotalPower'])
---

pandas.core.series.Series

Index

  • Description: An Index is an immutable sequence used to label data in Pandas objects like Series and DataFrame. Each Index object contains metadata about the data it labels.

  • Use Cases: Indexes are used for efficient data retrieval, selection, and alignment. They provide a way to uniquely identify and access data elements.

  • Creation: Indexes are typically created automatically when you create a Series or DataFrame. You can also create custom indexes if needed.

  • Example:

We can display the index of df

df.index
---

RangeIndex(start=0, stop=182404, step=1)

This can be read as a number for each line from 0 to 182404 (with a step of 1).

Using Objects

In practice, DataFrames and Series are the primary Pandas objects you’ll work with for data analysis and manipulation.

The Pandas Documentation is divided by object type as some operation only exist for certain type of Object.

It’s possible to convert a Serie into a DataFrame, the opposite is not.