Accessing data using Pandas
Table of contents
Head and Tail
In data processing and analysis, the terms “head” and “tail” refer to two common operations used to inspect or extract specific portions of a dataset, typically at the beginning (head) or the end (tail) of the dataset. These operations are particularly useful for quickly previewing data or obtaining a sample of the data. Conversely, the “tail” of a dataset refers to the last few rows or elements of the dataset.
# Display the first 5 rows of a DataFrame
df.head()
# Display the last 5 rows of a DataFrame
df.tail()
You can also specify the number of lines you want to display :
# Display the top 10 rows of a DataFrame
df.head(10)
# Display the bottom 10 rows of a DataFrame
df.tail(10)
Indexing
Columns
We can display all the columns of the DataFrame by displaying the property columns
of df
df.columns
---
Index(['xtf_id', 'Address', 'PostCode', 'Municipality', 'Canton',
'BeginningOfOperation', 'InitialPower', 'TotalPower', 'MainCategory',
'SubCategory', 'PlantCategory', '_x', '_y'],
dtype='object')
To select one column we specify its name between []
, so for exemple if I want to select the column Municipality
:
df['Municipality']
---
0 Aesch BL
1 Kaisten
2 Wichtrach
3 Oey
4 Hinwil
...
182399 Oberägeri
182400 Zug
182401 Blumenstein
182402 Riehen
182403 Les Mosses
Name: Municipality, Length: 182404, dtype: object
To select multiple columns they should be in an array so for example if I want to select Municipality
, Postcode
,and Canton
:
df[['Municipality','PostCode','Canton']]
---
Municipality PostCode Canton
0 Aesch BL 4147 BL
1 Kaisten 5082 AG
2 Wichtrach 3114 BE
3 Oey 3753 BE
4 Hinwil 8340 ZH
... ... ... ...
182399 Oberägeri 6315 ZG
182400 Zug 6300 ZG
182401 Blumenstein 3638 BE
182402 Riehen 4125 BS
182403 Les Mosses 1862 VD
182404 rows × 3 columns
Slicing
iloc : selection by position
The .iloc
methods can be used to slice rows or/and columns by position (ie their number). So for example if I want to slice from the 5th to the 10th line of df
:
df.iloc[4:10]
---
xtf_id Address PostCode Municipality Canton BeginningOfOperation InitialPower TotalPower MainCategory SubCategory PlantCategory _x _y
4 9476 Holzweidstrasse 8 8340 Hinwil ZH 2006-04-21 4.8 4.8 maincat_2 subcat_2 plantcat_8 2705863.0 1240553.0
5 14729 Lärchentobelstrasse 33 8700 Küsnacht ZH 2010-11-04 21.8 21.8 maincat_2 subcat_2 plantcat_9 2687331.0 1241205.0
6 14730 Schulhaustrasse 24 9470 Buchs SG 2008-09-24 3.6 3.6 maincat_2 subcat_2 plantcat_8 2754146.0 1225474.0
7 9365 Renggerstrasse 58 5000 Aarau AG 2006-09-14 1.6 1.6 maincat_2 subcat_2 plantcat_8 2645858.0 1248560.0
8 14742 Gartenweg 8 5018 Erlinsbach AG 2011-09-19 7.7 7.7 maincat_2 subcat_2 plantcat_9 2643326.0 1250567.0
9 14747 Hofbergstrasse 21 9500 Wil SG 2011-12-16 50.9 50.9 maincat_2 subcat_2 plantcat_9 2721432.0 1258999.0
Array in Python (as most programming language) starts at 0
Python function range(i,n)
start at ì
but finish at n-1
If I want to select the 2nd columns to the end :
df.iloc[: , 1: ]
---
Address PostCode Municipality Canton BeginningOfOperation InitialPower TotalPower MainCategory SubCategory PlantCategory _x _y
0 Schlossstrasse 15 4147 Aesch BL BL 2009-05-05 14.65 18.81 maincat_2 subcat_2 plantcat_9 2611936.0 1257011.0
1 Ob der Steig 1 5082 Kaisten AG 2011-10-28 5.80 5.80 maincat_2 subcat_2 plantcat_8 2645758.0 1265094.0
2 Gässli 4 3114 Wichtrach BE 2008-10-07 3.00 3.00 maincat_2 subcat_2 plantcat_8 2610547.0 1188979.0
3 Diemtigtalstrasse 46 3753 Oey BE 2008-06-27 8.40 8.40 maincat_2 subcat_2 plantcat_8 2610529.0 1167346.0
4 Holzweidstrasse 8 8340 Hinwil ZH 2006-04-21 4.80 4.80 maincat_2 subcat_2 plantcat_8 2705863.0 1240553.0
... ... ... ... ... ... ... ... ... ... ... ... ...
182399 Kalchrainstrasse 12 6315 Oberägeri ZG 1996-04-04 3.00 3.00 maincat_2 subcat_2 NaN 2689646.0 1221007.0
182400 Bahnhofstrasse 18 6300 Zug ZG 2001-07-04 3.00 3.00 maincat_2 subcat_2 NaN 2681678.0 1224818.0
182401 Zugimattstrasse 7 3638 Blumenstein BE 2004-07-21 1.00 1.00 maincat_2 subcat_2 NaN 2606413.0 1176995.0
182402 Hirzenstrasse 14 4125 Riehen BS 2005-07-08 4.90 4.90 maincat_2 subcat_2 NaN 2616010.0 1269781.0
182403 (Lioson-Dessous) Les Fontaines-Goupe 3 1862 Les Mosses VD 1990-01-01 40.00 40.00 maincat_1 subcat_1 plantcat_5 NaN NaN
182404 rows × 12 columns
And if I want to slice rows and columns :
df.iloc[10:21 , 5:8 ]
---
BeginningOfOperation InitialPower TotalPower
10 2011-11-22 2.6 2.60
11 2008-11-20 12.5 12.50
12 2020-10-19 2550.0 2550.00
13 2008-11-14 4.0 4.00
14 2009-08-17 124.2 1541.08
15 2008-05-05 9.0 9.00
16 2009-03-24 10.4 10.40
17 2009-12-07 6.3 6.30
18 2008-11-07 3.2 3.20
19 2007-11-01 8.7 8.70
20 2020-11-16 2550.0 2550.00
loc : selection by index
the loc
works on the same way than loc
but with index.
Let’s load another dataset and specify a column as index :
df_MainCategory=pd.read_csv('MainCategoryCatalogue.csv',index_col='Catalogue_id')
df_MainCategory
---
Catalogue_id de fr it en
maincat_1 Wasserkraft Énergie hydraulique Forza idrica Hydroelectric power
maincat_2 Übrige erneuerbare Energien Autres énergies renouvelables Altre energie rinnovabili Other renewable energies
maincat_3 Kernenergie Énergie nucléaire Energia nucleare Nuclear energy
maincat_4 Fossile Energieträger Agents énergétiques fossiles Vettori energetici fossili Fossil fuel
I can display the index of this new DataFrame :
df_MainCategory.index
---
Index(['maincat_1', 'maincat_2', 'maincat_3', 'maincat_4'], dtype='object', name='Catalogue_id')
I can for example slice from maintcat_2
to maincat_3
the column en
:
df_MainCategory.loc['maincat_2':'maincat_3','en']
---
Catalogue_id
maincat_2 Other renewable energies
maincat_3 Nuclear energy
Name: en, dtype: object
Filtering
Filtering data is a fundamental operation in data analysis that allows you to select specific rows or columns from a DataFrame based on certain conditions or criteria. Pandas provides several methods and techniques for filtering data effectively.
Using Boolean Indexing
The primary way to filter data in Pandas is through boolean indexing. Boolean indexing involves creating a boolean mask—an array of True
and False
values—that specifies which rows or columns meet a certain condition. You can then use this boolean mask to extract the desired data.
Here’s how to perform filtering with Pandas:
Creating a Boolean Mask
To create a boolean mask, you apply a condition to a DataFrame or Series. For example, you can create a mask that checks if values in a column meet a certain criteria:
# Create a boolean mask for values greater than 50 in a DataFrame column TotalPower
mask = df['TotalPower'] > 50
mask
---
0 False
1 False
2 False
3 False
4 False
...
182399 False
182400 False
182401 False
182402 False
182403 False
Name: TotalPower, Length: 182404, dtype: bool
Applying the Mask
Once you have a boolean mask, you can use it to filter data by applying it to the DataFrame:
# Use the boolean mask to filter the DataFrame
filtered_data = df[mask]
filtered_data
---
xtf_id Address PostCode Municipality Canton BeginningOfOperation InitialPower TotalPower MainCategory SubCategory PlantCategory _x _y
9 14747 Hofbergstrasse 21 9500 Wil SG 2011-12-16 50.9 50.90 maincat_2 subcat_2 plantcat_9 2721432.0 1258999.0
12 13179 Passo San Gottardo 6780 Airolo TI 2020-10-19 2550.0 2550.00 maincat_2 subcat_3 NaN 2686253.0 1157143.0
14 11795 Via Industria 1 6934 Bioggio TI 2009-08-17 124.2 1541.08 maincat_2 subcat_2 plantcat_8 2714261.0 1097044.0
20 13171 Passo San Gottardo 6780 Airolo TI 2020-11-16 2550.0 2550.00 maincat_2 subcat_3 NaN 2686251.0 1156372.0
28 13159 Passo San Gottardo 6777 Quinto TI 2020-11-23 2550.0 2550.00 maincat_2 subcat_3 NaN 2685886.0 1156734.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
182380 27675 Giarsun 7545 Guarda GR 1998-01-22 350.0 350.00 maincat_1 subcat_1 plantcat_2 NaN NaN
182381 27685 Charal 7545 Guarda GR 1991-02-20 182.0 182.00 maincat_1 subcat_1 plantcat_5 NaN NaN
182382 27695 Chasura 7559 Tschlin GR 1994-11-07 300.0 300.00 maincat_1 subcat_1 plantcat_5 NaN NaN
182383 27705 Prättigauerstrasse 1 7208 Malans GR GR 1999-09-01 70.0 70.00 maincat_1 subcat_1 plantcat_4 2763587.0 1204857.0
182394 38488 Wispel 1 9472 Grabs SG 1995-08-01 287.0 287.00 maincat_1 subcat_1 plantcat_7 2751336.0 1227305.0
13373 rows × 13 columns
This results in a new DataFrame containing only the rows that satisfy the condition.
Combining Conditions
You can combine multiple conditions using logical operators such as &
(AND) and |
(OR). For example, to filter rows where two conditions must be met:
# Filter rows where 'TotalPower' is greater than 50 and 'Canton' is ZH
combined_condition = (df['TotalPower'] > 50) & (df['Canton'] == 'ZH')
filtered_rows = df[combined_condition]
filtered_rows
---
xtf_id Address PostCode Municipality Canton BeginningOfOperation InitialPower TotalPower MainCategory SubCategory PlantCategory _x _y
114 17067 Lochrütistrasse 12 8633 Wolfhausen ZH 2011-09-22 165.4 165.4 maincat_2 subcat_2 plantcat_8 2702965.0 1235106.0
124 17075 Salenstrasse 8162 Steinmaur ZH 2012-02-23 174.7 174.7 maincat_2 subcat_2 plantcat_8 NaN NaN
177 9517 Weberrütistrasse 2 8833 Samstagern ZH 2008-11-14 123.6 123.6 maincat_2 subcat_2 plantcat_8 2694034.0 1227382.0
225 17158 Neunfornerstrasse 56 8479 Altikon ZH 2011-12-21 189.4 189.4 maincat_2 subcat_2 plantcat_8 2701091.0 1270696.0
248 17244 Regensbergerstrasse 21 8162 Sünikon ZH 2011-06-16 64.4 64.4 maincat_2 subcat_2 plantcat_8 2675004.0 1260504.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
182111 9283 Zürichstrasse 23a 8607 Aathal-Seegräben ZH 1943-11-01 245.0 245.0 maincat_1 subcat_1 plantcat_2 2700523.0 1243585.0
182158 9549 Hagenholzstrasse 73 8050 Zürich ZH 2001-12-13 170.0 170.0 maincat_2 subcat_2 NaN 2684419.0 1252231.0
182159 9552 Tièchestrasse 99 8037 Zürich ZH 2004-05-13 55.5 55.5 maincat_2 subcat_2 NaN 2681641.0 1250361.0
182307 10752 Mühlau 21 8482 Sennhof ZH 1942-01-01 600.0 600.0 maincat_1 subcat_1 plantcat_2 2699195.0 1258012.0
182349 12365 Usterstrasse 206 8620 Wetzikon ZH 1900-01-01 170.0 170.0 maincat_1 subcat_1 plantcat_2 2701241.0 1243033.0
1523 rows × 13 columns
you can use the function reset_index()
to restart the counting on the index.
Summary
Filtering data is a critical skill in data analysis, and Pandas provides powerful tools for this purpose. By creating boolean masks and applying them to DataFrames, you can extract specific subsets of data that meet your criteria. Whether you need to filter rows, columns, or both, Pandas offers flexible and efficient methods for data filtering, allowing you to gain valuable insights from your datasets.
Exercices
- Generate a new DataFrame called
df2
that only containsCanton
,TotalPower
columns . - On
df2
Display only the plants that are in Vaud canton (VD
) save the result indf2vd
. - Sum the column
TotalPower
ofdf2vd
using the.sum()
function. - Now we would like to do the same operation for all the canton (on
df2
) for that we can use the function groupby(_column_name) , you have then to specify the column you want to group (in[]
) and the operation of grouping (.sum() in our case
). - Save this new Serie into the
powerbycanton
variable.