Joining Data with Pandas in Python
Joining data is an important task in data analysis and is a common requirement in many data science projects. Pandas is a powerful data manipulation library in Python that provides several techniques to join dataframes. In this post, we’ll explore the different types of joins available in Pandas and how to use them with examples.
Dataset
We’ll be working with two datasets to demonstrate the different types of joins. The first dataset contains information about employees and their departments, while the second dataset contains information about the departments and their managers.
employees.csv
departments.csv
Inner Join
An inner join returns only the rows that have matching values in both dataframes. To perform an inner join in Pandas, we use the merge()
function.
import pandas as pd
employees = pd.read_csv('employees.csv')
departments = pd.read_csv('departments.csv')inner_join = pd.merge(employees, departments, left_on='department_id', right_on='id')
The merge()
function takes two dataframes and the columns to join on. In this example, we join employees
and departments
on the department_id
column in employees
and the id
column in departments
.
The resulting dataframe inner_join
looks like this:
Note that the join columns department_id
and id
are duplicated in the resulting dataframe. We can drop one of them using the drop()
function.
inner_join = inner_join.drop('id_x', axis=1)
Left Join
A left join returns all the rows from the left dataframe and the matching rows from the right dataframe. To perform a left join in Pandas, we use the merge()
function with the how='left'
parameter.
left_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='left')
The resulting dataframe left_join
looks like this:
Note that all the rows from employees
are included in the resulting dataframe, even if there is no matching row in departments
. The id_y
, name_y
, and manager
columns are filled with NaN
values for these rows.
Right Join
A right join returns all the rows from the right dataframe and the matching rows from the left dataframe. To perform a right join in Pandas, we use the merge()
function with the how='right'
parameter.
right_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='right')
The resulting dataframe right_join
looks like this:
Note that all the rows from departments
are included in the resulting dataframe, even if there is no matching row in employees
. The id_x
, name_x
, and department_id
columns are filled with NaN
values for these rows.
Outer Join
An outer join returns all the rows from both dataframes. To perform an outer join in Pandas, we use the merge()
function with the how='outer'
parameter.
outer_join = pd.merge(employees, departments, left_on='department_id', right_on='id', how='outer')
The resulting dataframe outer_join
looks like this:
Note that all the rows from both employees
and departments
are included in the resulting dataframe, even if there is no matching row in the other dataframe. The id_x
, name_x
, department_id
, id_y
, name_y
, and manager
columns are filled with NaN
values for these rows.
Conclusion
Joining data is an important task in data analysis, and Pandas provides several techniques to join dataframes. In this post, we explored the different types of joins available in Pandas and how to use them with examples. Remember that choosing the right type of join depends on the requirements of your project and the nature of your data.