Introduction to Pandas
Table of contents
Download the data for this course
For this course we will use the Electricity production plants dataset from the Swiss confereration.
You can download the data here : https://opendata.swiss/en/dataset/elektrizitatsproduktionsanlagen. Download the CSV ressources.
- Create a folder for storing the files of this course
- Extract the zip file into the directory
- You should have multiple csv files :
ElectricityProductionPlant.csv MainCategoryCatalogue.csv PlantCategoryCatalogue.csv PlantDetail.csv SubCategoryCatalogue.csv
- Using JupyterLab visualize
ElectricityProductionPlant.csv
Loading data
You can use the touch TAB
in order to use the autocompletion in your notebook
We will now load the data using Pandas.
The first thing to do is to load Pandas :
import pandas as pd
In order to display all the columns of our dataset we need to change the default values :
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.expand_frame_repr', False) # Prevent line wrapping
pd.set_option('display.max_rows', 30) #Show more rows
then we load the ElectricityProductionPlant.csv
into df :
df = pd.read_csv('ElectricityProductionPlant.csv')
df
The table should appear as output.
.info()
Run df.info()
df.info()
---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182404 entries, 0 to 182403
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 xtf_id 182404 non-null int64
1 Address 182404 non-null object
2 PostCode 182404 non-null int64
3 Municipality 182404 non-null object
4 Canton 182404 non-null object
5 BeginningOfOperation 182404 non-null object
6 InitialPower 182404 non-null float64
7 TotalPower 182404 non-null float64
8 MainCategory 182404 non-null object
9 SubCategory 182404 non-null object
10 PlantCategory 177580 non-null object
11 _x 177483 non-null float64
12 _y 177483 non-null float64
dtypes: float64(4), int64(2), object(7)
memory usage: 18.1+ MB
This command display : the number of lines in the table (here 182404) the number of columns
For each column : their name, how many non-null element and their types.
At the end the a summary of the types and the memory usages is display
.describe()
Run df.describe()
.
df.describe()
---
xtf_id PostCode InitialPower TotalPower _x _y
count 182404.000000 182404.000000 1.824040e+05 1.824040e+05 1.774830e+05 1.774830e+05
mean 148642.500291 5038.845722 1.321199e+02 1.334021e+02 2.638471e+06 1.205438e+06
std 66446.581318 2836.307447 8.324163e+03 8.324198e+03 7.252629e+04 5.045017e+04
min 5646.000000 1000.000000 2.000000e-01 2.000000e-01 2.486221e+06 1.075930e+06
25% 97921.500000 2043.000000 6.600000e+00 6.720000e+00 2.577604e+06 1.169977e+06
50% 150343.500000 4702.000000 1.013500e+01 1.029000e+01 2.634925e+06 1.215729e+06
75% 203753.250000 8136.000000 1.680000e+01 1.722000e+01 2.701701e+06 1.249248e+06
max 262989.000000 9658.000000 1.872000e+06 1.872000e+06 2.831207e+06 1.294815e+06
Here are displayed some basics statistical analysis (only on columns with numerical data).
For example, we can read that for the total power (TotalPower
column) the average of power is 133 MW, the minimum is 0.2 mW (ie 2kW) as described on the dataset website.
Pandas Objects
Pandas introduces several fundamental data structures, each designed for specific use cases. Understanding these Pandas objects is essential for effectively working with structured data.
In this section, we’ll explore the primary Pandas objects (DataFrame, Series, and Index):
DataFrame
-
Description: The DataFrame is the most commonly used Pandas object. It represents a two-dimensional, tabular data structure, similar to a spreadsheet or a SQL table. DataFrames consist of rows and columns, where each column can have a different data type.
-
Use Cases: DataFrames are ideal for storing and manipulating structured data, such as datasets from CSV files, SQL databases, or Excel spreadsheets. They provide a convenient way to perform operations like filtering, aggregation, and transformation.
-
Creation: DataFrames can be created from various data sources, including dictionaries, lists of dictionaries, NumPy arrays, and external data files.
-
Example:
The variable df
we created is a Pandas DataFrame
type(df)
---
pandas.core.frame.DataFrame
Series
-
Description: A Series is a one-dimensional array-like object in Pandas. It consists of data values and an associated index, which can be used to label and access the data.
-
Use Cases: Series are commonly used to represent a single column of data or a single variable. They can be thought of as the building blocks of a DataFrame.
-
Creation: Series can be created from various data sources, including lists, NumPy arrays, and dictionaries.
-
Example:
Selecting one column of df.
df['TotalPower']
type(df['TotalPower'])
---
pandas.core.series.Series
Index
-
Description: An Index is an immutable sequence used to label data in Pandas objects like Series and DataFrame. Each Index object contains metadata about the data it labels.
-
Use Cases: Indexes are used for efficient data retrieval, selection, and alignment. They provide a way to uniquely identify and access data elements.
-
Creation: Indexes are typically created automatically when you create a Series or DataFrame. You can also create custom indexes if needed.
-
Example:
We can display the index of df
df.index
---
RangeIndex(start=0, stop=182404, step=1)
This can be read as a number for each line from 0 to 182404 (with a step of 1).
Using Objects
In practice, DataFrames and Series are the primary Pandas objects you’ll work with for data analysis and manipulation.
The Pandas Documentation is divided by object type as some operation only exist for certain type of Object.
It’s possible to convert a Serie into a DataFrame, the opposite is not.