Outer Join of Data Frames in R: An Essential Guide

Data manipulation and transformation are integral parts of data analysis and R programming, offering a variety of tools and functions to manipulate datasets efficiently. Among those, joining tables is a fundamental technique that combines data from two different sources based on a common key or set of keys. In this article, we will delve into the concept of outer joins, one of the several types of joins available in R, and explore how you can use them to enrich your data analysis experience.

Understanding Outer Joins

An outer join is a method to combine two data frames by merging rows that have matchable values in their key columns, and also by including rows that do not have a match in the other data frame. In R, there are three types of outer joins:

  • Left outer join: Includes all rows from the left data frame and the matched rows from the right data frame. Rows in the left data frame that do not have a corresponding match in the right data frame are still included, but with NA in the columns from the right data frame.
  • Right outer join: Includes all rows from the right data frame and the matched rows from the left data frame. Rows in the right data frame that do not have a match in the left are included with NA in the columns from the left data frame.
  • Full outer join: Combines the results of both left and right outer joins, including all rows from both data frames, with NAs filling in for non-matching rows on either side.

These joins are especially useful when you need to retain as much information as possible from your datasets, even if the records do not match perfectly. Now, let’s explore how to perform each type of outer join in R using the merge() function and the dplyr package.

Performing Outer Joins in R

Using the merge() Function

The merge() function is a base R function that allows you to perform merges, including outer joins, between two data frames. Here is how you can use merge() to do an outer join.

For each outer join, we’ll consider two example data frames, df1 and df2, which we will define as follows:


df1 <- data.frame(ID = c(1, 2, 3), Value = c("A", "B", "C"))
df2 <- data.frame(ID = c(1, 3, 4), Score = c(10, 20, 30))

# Printing data frames
print(df1)
print(df2)

Now, let’s perform different types of outer joins on these data frames.

Left Outer Join


left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_join)

The output will be:


  ID Value Score
1  1     A    10
2  2     B    NA
3  3     C    20

Right Outer Join


right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)

The output for this would be:


  ID Value Score
1  1     A    10
2  3     C    20
3  4  <NA>    30

Full Outer Join


full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)

And the result for the full outer join is:


  ID Value Score
1  1     A    10
2  2     B    NA
3  3     C    20
4  4  <NA>    30

Using the dplyr Package

The dplyr package provides a more intuitive and flexible syntax for data manipulation, including joins. To use dplyr, you first need to install and load the package with the following commands:


install.packages("dplyr")
library(dplyr)

Now, we can replicate the outer joins executed previously using the join functions from dplyr.

Left Outer Join


left_join_dplyr <- left_join(df1, df2, by = "ID")
print(left_join_dplyr)

Right Outer Join


right_join_dplyr <- right_join(df1, df2, by = "ID")
print(right_join_dplyr)

Full Outer Join


full_join_dplyr <- full_join(df1, df2, by = "ID")
print(full_join_dplyr)

You will find that the results using dplyr are the same as those from the merge() function, demonstrating the consistency between different methods of conducting joins in R. Note that dplyr‘s syntax is often preferred for its readability and pipeline capabilities, which make complex data manipulation tasks more manageable and more readable.

Dealing with Different Key Column Names

Sometimes, the columns you wish to join by have different names in each data frame. Here’s how to handle that scenario using both merge() and dplyr.

Using merge()


df1_alt <- data.frame(Key1 = c(1, 2, 3), Value = c("A", "B", "C"))
df2_alt <- data.frame(Key2 = c(1, 3, 4), Score = c(10, 20, 30))

full_join_diff_keys <- merge(df1_alt, df2_alt, by.x = "Key1", by.y = "Key2", all = TRUE)
print(full_join_diff_keys)

Using dplyr

You can rename one of the columns to match the other before the join or specify the different key names using the `by` argument in dplyr functions.


full_join_dplyr_diff_keys <- full_join(df1_alt, rename(df2_alt, Key1 = Key2), by = "Key1")
print(full_join_dplyr_diff_keys)

# or

full_join_dplyr_diff_keys <- full_join(df1_alt, df2_alt, by = c("Key1" = "Key2"))
print(full_join_dplyr_diff_keys)

Now that we have seen both methods in action, let’s briefly consider the advantages of using dplyr over base R’s merge() for outer joins.

Advantages of Using dplyr for Outer Joins

The dplyr package is part of the tidyverse, a suite of R packages designed for data science. Using dplyr for outer joins provides the following benefits:

  • Readability: The dplyr syntax is generally very readable and easier to understand at a glance.
  • Consistency: The dplyr package provides a set of consistent and simple functions for data manipulation, reducing the need to remember different function arguments and behaviors.
  • Chaining Operations: With %>% (the pipe operator), you can easily chain multiple data manipulation steps together, which makes your code more readable and concise.
  • Performance: For large data sets, dplyr can be faster than base R functions due to its underlying implementation.
  • Extended functionality: dplyr offers additional convenience functions and features not present in base R.

In conclusion, understanding and utilizing outer joins in R is essential for anyone working with relational data. Whether using base R’s merge() function or dplyr‘s suite of join functions, outer joins allow you to combine data frames in a way that preserves as much information as possible, which is critical for comprehensive data analysis. As demonstrated, both methods have their use cases, but the dplyr package has become a staple in the R community for its elegance and additional capabilities.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top