Joining Data with Pandas in Python

Hands-on Mentor
4 min readApr 3, 2023

--

Joining data is an important task in data analysis and is a common requirement in many data science projects. Pandas is a powerful data manipulation library in Python that provides several techniques to join dataframes. In this post, we’ll explore the different types of joins available in Pandas and how to use them with examples.

Dataset

We’ll be working with two datasets to demonstrate the different types of joins. The first dataset contains information about employees and their departments, while the second dataset contains information about the departments and their managers.

employees.csv

departments.csv

Inner Join

An inner join returns only the rows that have matching values in both dataframes. To perform an inner join in Pandas, we use the merge() function.

import pandas as pd
employees = pd.read_csv('employees.csv')
departments = pd.read_csv('departments.csv')
inner_join = pd.merge(employees, departments, left_on='department_id', right_on='id')

The merge() function takes two dataframes and the columns to join on. In this example, we join employees and departments on the department_id column in employees and the id column in departments.

The resulting dataframe inner_join looks like this:

Note that the join columns department_id and id are duplicated in the resulting dataframe. We can drop one of them using the drop() function.

inner_join = inner_join.drop('id_x', axis=1)

Left Join

A left join returns all the rows from the left dataframe and the matching rows from the right dataframe. To perform a left join in Pandas, we use the merge() function with the how='left' parameter.

left_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='left')

The resulting dataframe left_join looks like this:

Note that all the rows from employees are included in the resulting dataframe, even if there is no matching row in departments. The id_y, name_y, and manager columns are filled with NaN values for these rows.

Right Join

A right join returns all the rows from the right dataframe and the matching rows from the left dataframe. To perform a right join in Pandas, we use the merge() function with the how='right' parameter.

right_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='right')

The resulting dataframe right_join looks like this:

Note that all the rows from departments are included in the resulting dataframe, even if there is no matching row in employees. The id_x, name_x, and department_id columns are filled with NaN values for these rows.

Outer Join

An outer join returns all the rows from both dataframes. To perform an outer join in Pandas, we use the merge() function with the how='outer' parameter.

outer_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='outer')

The resulting dataframe outer_join looks like this:

Note that all the rows from both employees and departments are included in the resulting dataframe, even if there is no matching row in the other dataframe. The id_x, name_x, department_id, id_y, name_y, and manager columns are filled with NaN values for these rows.

Conclusion

Joining data is an important task in data analysis, and Pandas provides several techniques to join dataframes. In this post, we explored the different types of joins available in Pandas and how to use them with examples. Remember that choosing the right type of join depends on the requirements of your project and the nature of your data.

--

--

Hands-on Mentor
Hands-on Mentor

Written by Hands-on Mentor

Practical Learning with Personalized Mentorship

No responses yet