Accessing data using Pandas

Table of contents

  1. Head and Tail
  2. Indexing
    1. Columns
    2. Slicing
      1. iloc : selection by position
      2. loc : selection by index
  3. Filtering
    1. Using Boolean Indexing
      1. Creating a Boolean Mask
      2. Applying the Mask
    2. Combining Conditions
    3. Summary
  4. Exercices

Head and Tail

In data processing and analysis, the terms “head” and “tail” refer to two common operations used to inspect or extract specific portions of a dataset, typically at the beginning (head) or the end (tail) of the dataset. These operations are particularly useful for quickly previewing data or obtaining a sample of the data. Conversely, the “tail” of a dataset refers to the last few rows or elements of the dataset.

# Display the first 5 rows of a DataFrame
df.head()
# Display the last 5 rows of a DataFrame
df.tail()

You can also specify the number of lines you want to display :

# Display the top 10 rows of a DataFrame
df.head(10)
# Display the bottom 10 rows of a DataFrame
df.tail(10)

Indexing

Columns

We can display all the columns of the DataFrame by displaying the property columns of df

df.columns
---

Index(['xtf_id', 'Address', 'PostCode', 'Municipality', 'Canton',
       'BeginningOfOperation', 'InitialPower', 'TotalPower', 'MainCategory',
       'SubCategory', 'PlantCategory', '_x', '_y'],
      dtype='object')

To select one column we specify its name between [], so for exemple if I want to select the column Municipality :

df['Municipality']
---

0            Aesch BL
1             Kaisten
2           Wichtrach
3                 Oey
4              Hinwil
             ...     
182399      Oberägeri
182400            Zug
182401    Blumenstein
182402         Riehen
182403     Les Mosses
Name: Municipality, Length: 182404, dtype: object

To select multiple columns they should be in an array so for example if I want to select Municipality, Postcode ,and Canton :

df[['Municipality','PostCode','Canton']]
---


	Municipality 	PostCode 	Canton
0 	Aesch BL 	    4147 	    BL
1 	Kaisten 	    5082 	    AG
2 	Wichtrach 	    3114 	    BE
3 	Oey 	        3753        BE
4 	Hinwil 	        8340    	ZH
... 	    ... 	    ... 	...
182399 	Oberägeri 	6315 	    ZG
182400 	Zug 	    6300 	    ZG
182401 	Blumenstein 3638 	    BE
182402 	Riehen 	    4125 	    BS
182403 	Les Mosses 	1862 	    VD

182404 rows × 3 columns

Slicing

iloc : selection by position

The .ilocmethods can be used to slice rows or/and columns by position (ie their number). So for example if I want to slice from the 5th to the 10th line of df:

df.iloc[4:10]
---



	xtf_id 	Address 	PostCode 	Municipality 	Canton 	BeginningOfOperation 	InitialPower 	TotalPower 	MainCategory 	SubCategory 	PlantCategory 	_x 	_y
4 	9476 	Holzweidstrasse 8 	8340 	Hinwil 	ZH 	2006-04-21 	4.8 	4.8 	maincat_2 	subcat_2 	plantcat_8 	2705863.0 	1240553.0
5 	14729 	Lärchentobelstrasse 33 	8700 	Küsnacht 	ZH 	2010-11-04 	21.8 	21.8 	maincat_2 	subcat_2 	plantcat_9 	2687331.0 	1241205.0
6 	14730 	Schulhaustrasse 24 	9470 	Buchs 	SG 	2008-09-24 	3.6 	3.6 	maincat_2 	subcat_2 	plantcat_8 	2754146.0 	1225474.0
7 	9365 	Renggerstrasse 58 	5000 	Aarau 	AG 	2006-09-14 	1.6 	1.6 	maincat_2 	subcat_2 	plantcat_8 	2645858.0 	1248560.0
8 	14742 	Gartenweg 8 	5018 	Erlinsbach 	AG 	2011-09-19 	7.7 	7.7 	maincat_2 	subcat_2 	plantcat_9 	2643326.0 	1250567.0
9 	14747 	Hofbergstrasse 21 	9500 	Wil 	SG 	2011-12-16 	50.9 	50.9 	maincat_2 	subcat_2 	plantcat_9 	2721432.0 	1258999.0


Array in Python (as most programming language) starts at 0

Python function range(i,n) start at ì but finish at n-1

If I want to select the 2nd columns to the end :

df.iloc[: , 1: ]
---


	Address 	PostCode 	Municipality 	Canton 	BeginningOfOperation 	InitialPower 	TotalPower 	MainCategory 	SubCategory 	PlantCategory 	_x 	_y
0 	Schlossstrasse 15 	4147 	Aesch BL 	BL 	2009-05-05 	14.65 	18.81 	maincat_2 	subcat_2 	plantcat_9 	2611936.0 	1257011.0
1 	Ob der Steig 1 	5082 	Kaisten 	AG 	2011-10-28 	5.80 	5.80 	maincat_2 	subcat_2 	plantcat_8 	2645758.0 	1265094.0
2 	Gässli 4 	3114 	Wichtrach 	BE 	2008-10-07 	3.00 	3.00 	maincat_2 	subcat_2 	plantcat_8 	2610547.0 	1188979.0
3 	Diemtigtalstrasse 46 	3753 	Oey 	BE 	2008-06-27 	8.40 	8.40 	maincat_2 	subcat_2 	plantcat_8 	2610529.0 	1167346.0
4 	Holzweidstrasse 8 	8340 	Hinwil 	ZH 	2006-04-21 	4.80 	4.80 	maincat_2 	subcat_2 	plantcat_8 	2705863.0 	1240553.0
... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	...
182399 	Kalchrainstrasse 12 	6315 	Oberägeri 	ZG 	1996-04-04 	3.00 	3.00 	maincat_2 	subcat_2 	NaN 	2689646.0 	1221007.0
182400 	Bahnhofstrasse 18 	6300 	Zug 	ZG 	2001-07-04 	3.00 	3.00 	maincat_2 	subcat_2 	NaN 	2681678.0 	1224818.0
182401 	Zugimattstrasse 7 	3638 	Blumenstein 	BE 	2004-07-21 	1.00 	1.00 	maincat_2 	subcat_2 	NaN 	2606413.0 	1176995.0
182402 	Hirzenstrasse 14 	4125 	Riehen 	BS 	2005-07-08 	4.90 	4.90 	maincat_2 	subcat_2 	NaN 	2616010.0 	1269781.0
182403 	(Lioson-Dessous) Les Fontaines-Goupe 3 	1862 	Les Mosses 	VD 	1990-01-01 	40.00 	40.00 	maincat_1 	subcat_1 	plantcat_5 	NaN 	NaN

182404 rows × 12 columns

And if I want to slice rows and columns :

df.iloc[10:21 , 5:8 ]
---


	BeginningOfOperation 	InitialPower 	TotalPower
10 	2011-11-22 	2.6 	2.60
11 	2008-11-20 	12.5 	12.50
12 	2020-10-19 	2550.0 	2550.00
13 	2008-11-14 	4.0 	4.00
14 	2009-08-17 	124.2 	1541.08
15 	2008-05-05 	9.0 	9.00
16 	2009-03-24 	10.4 	10.40
17 	2009-12-07 	6.3 	6.30
18 	2008-11-07 	3.2 	3.20
19 	2007-11-01 	8.7 	8.70
20 	2020-11-16 	2550.0 	2550.00

loc : selection by index

the loc works on the same way than loc but with index.

Let’s load another dataset and specify a column as index :

df_MainCategory=pd.read_csv('MainCategoryCatalogue.csv',index_col='Catalogue_id')
df_MainCategory
---


	
Catalogue_id 	de 	            fr 	                    it 	            en			
maincat_1 	Wasserkraft 	Énergie hydraulique 	Forza idrica 	Hydroelectric power
maincat_2 	Übrige erneuerbare Energien 	Autres énergies renouvelables 	Altre energie rinnovabili 	Other renewable energies
maincat_3 	Kernenergie 	Énergie nucléaire 	Energia nucleare 	Nuclear energy
maincat_4 	Fossile Energieträger 	Agents énergétiques fossiles 	Vettori energetici fossili 	Fossil fuel

I can display the index of this new DataFrame :

df_MainCategory.index
---

Index(['maincat_1', 'maincat_2', 'maincat_3', 'maincat_4'], dtype='object', name='Catalogue_id')

I can for example slice from maintcat_2 to maincat_3 the column en :

df_MainCategory.loc['maincat_2':'maincat_3','en']
---

Catalogue_id
maincat_2    Other renewable energies
maincat_3              Nuclear energy
Name: en, dtype: object

Filtering

Filtering data is a fundamental operation in data analysis that allows you to select specific rows or columns from a DataFrame based on certain conditions or criteria. Pandas provides several methods and techniques for filtering data effectively.

Using Boolean Indexing

The primary way to filter data in Pandas is through boolean indexing. Boolean indexing involves creating a boolean mask—an array of True and False values—that specifies which rows or columns meet a certain condition. You can then use this boolean mask to extract the desired data.

Here’s how to perform filtering with Pandas:

Creating a Boolean Mask

To create a boolean mask, you apply a condition to a DataFrame or Series. For example, you can create a mask that checks if values in a column meet a certain criteria:

# Create a boolean mask for values greater than 50 in a DataFrame column TotalPower
mask = df['TotalPower'] > 50
mask
---

0         False
1         False
2         False
3         False
4         False
          ...  
182399    False
182400    False
182401    False
182402    False
182403    False
Name: TotalPower, Length: 182404, dtype: bool

Applying the Mask

Once you have a boolean mask, you can use it to filter data by applying it to the DataFrame:

# Use the boolean mask to filter the DataFrame
filtered_data = df[mask]
filtered_data
---

	xtf_id 	Address 	PostCode 	Municipality 	Canton 	BeginningOfOperation 	InitialPower 	TotalPower 	MainCategory 	SubCategory 	PlantCategory 	_x 	_y
9 	14747 	Hofbergstrasse 21 	9500 	Wil 	SG 	2011-12-16 	50.9 	50.90 	maincat_2 	subcat_2 	plantcat_9 	2721432.0 	1258999.0
12 	13179 	Passo San Gottardo 	6780 	Airolo 	TI 	2020-10-19 	2550.0 	2550.00 	maincat_2 	subcat_3 	NaN 	2686253.0 	1157143.0
14 	11795 	Via Industria 1 	6934 	Bioggio 	TI 	2009-08-17 	124.2 	1541.08 	maincat_2 	subcat_2 	plantcat_8 	2714261.0 	1097044.0
20 	13171 	Passo San Gottardo 	6780 	Airolo 	TI 	2020-11-16 	2550.0 	2550.00 	maincat_2 	subcat_3 	NaN 	2686251.0 	1156372.0
28 	13159 	Passo San Gottardo 	6777 	Quinto 	TI 	2020-11-23 	2550.0 	2550.00 	maincat_2 	subcat_3 	NaN 	2685886.0 	1156734.0
... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	...
182380 	27675 	Giarsun 	7545 	Guarda 	GR 	1998-01-22 	350.0 	350.00 	maincat_1 	subcat_1 	plantcat_2 	NaN 	NaN
182381 	27685 	Charal 	7545 	Guarda 	GR 	1991-02-20 	182.0 	182.00 	maincat_1 	subcat_1 	plantcat_5 	NaN 	NaN
182382 	27695 	Chasura 	7559 	Tschlin 	GR 	1994-11-07 	300.0 	300.00 	maincat_1 	subcat_1 	plantcat_5 	NaN 	NaN
182383 	27705 	Prättigauerstrasse 1 	7208 	Malans GR 	GR 	1999-09-01 	70.0 	70.00 	maincat_1 	subcat_1 	plantcat_4 	2763587.0 	1204857.0
182394 	38488 	Wispel 1 	9472 	Grabs 	SG 	1995-08-01 	287.0 	287.00 	maincat_1 	subcat_1 	plantcat_7 	2751336.0 	1227305.0

13373 rows × 13 columns

This results in a new DataFrame containing only the rows that satisfy the condition.

Combining Conditions

You can combine multiple conditions using logical operators such as & (AND) and | (OR). For example, to filter rows where two conditions must be met:

# Filter rows where 'TotalPower' is greater than 50 and 'Canton' is ZH
combined_condition = (df['TotalPower'] > 50) & (df['Canton'] == 'ZH')
filtered_rows = df[combined_condition]
filtered_rows
---

	xtf_id 	Address 	PostCode 	Municipality 	Canton 	BeginningOfOperation 	InitialPower 	TotalPower 	MainCategory 	SubCategory 	PlantCategory 	_x 	_y
114 	17067 	Lochrütistrasse 12 	8633 	Wolfhausen 	ZH 	2011-09-22 	165.4 	165.4 	maincat_2 	subcat_2 	plantcat_8 	2702965.0 	1235106.0
124 	17075 	Salenstrasse 	8162 	Steinmaur 	ZH 	2012-02-23 	174.7 	174.7 	maincat_2 	subcat_2 	plantcat_8 	NaN 	NaN
177 	9517 	Weberrütistrasse 2 	8833 	Samstagern 	ZH 	2008-11-14 	123.6 	123.6 	maincat_2 	subcat_2 	plantcat_8 	2694034.0 	1227382.0
225 	17158 	Neunfornerstrasse 56 	8479 	Altikon 	ZH 	2011-12-21 	189.4 	189.4 	maincat_2 	subcat_2 	plantcat_8 	2701091.0 	1270696.0
248 	17244 	Regensbergerstrasse 21 	8162 	Sünikon 	ZH 	2011-06-16 	64.4 	64.4 	maincat_2 	subcat_2 	plantcat_8 	2675004.0 	1260504.0
... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	...
182111 	9283 	Zürichstrasse 23a 	8607 	Aathal-Seegräben 	ZH 	1943-11-01 	245.0 	245.0 	maincat_1 	subcat_1 	plantcat_2 	2700523.0 	1243585.0
182158 	9549 	Hagenholzstrasse 73 	8050 	Zürich 	ZH 	2001-12-13 	170.0 	170.0 	maincat_2 	subcat_2 	NaN 	2684419.0 	1252231.0
182159 	9552 	Tièchestrasse 99 	8037 	Zürich 	ZH 	2004-05-13 	55.5 	55.5 	maincat_2 	subcat_2 	NaN 	2681641.0 	1250361.0
182307 	10752 	Mühlau 21 	8482 	Sennhof 	ZH 	1942-01-01 	600.0 	600.0 	maincat_1 	subcat_1 	plantcat_2 	2699195.0 	1258012.0
182349 	12365 	Usterstrasse 206 	8620 	Wetzikon 	ZH 	1900-01-01 	170.0 	170.0 	maincat_1 	subcat_1 	plantcat_2 	2701241.0 	1243033.0

1523 rows × 13 columns

you can use the function reset_index() to restart the counting on the index.

Summary

Filtering data is a critical skill in data analysis, and Pandas provides powerful tools for this purpose. By creating boolean masks and applying them to DataFrames, you can extract specific subsets of data that meet your criteria. Whether you need to filter rows, columns, or both, Pandas offers flexible and efficient methods for data filtering, allowing you to gain valuable insights from your datasets.

Exercices

  • Generate a new DataFrame called df2 that only contains Canton, TotalPower columns .
  • On df2 Display only the plants that are in Vaud canton (VD) save the result in df2vd.
  • Sum the column TotalPower of df2vd using the .sum() function.
  • Now we would like to do the same operation for all the canton (on df2) for that we can use the function groupby(_column_name) , you have then to specify the column you want to group (in []) and the operation of grouping (.sum() in our case).
  • Save this new Serie into the powerbycanton variable.