Data manipulation/ Joining data

More dplyr (package in the tidyverse)

Review dplyr

  • dplyr is a powerful R package for data manipulation.

  • Provides a coherent set of verbs (functions) to help you resolve most data manipulation challenges.

  • Part of the tidyverse collection of R packages.

  • Simplifies data manipulation tasks.

  • Intuitive syntax and function names.

  • Today, we’ll focus on the join functions

Review key functions in dplyr

function() Action
glimpse() get a glimpse of your data
count() count the unique values of one or more variables
filter() picks rows based on their values
mutate() creates new variables (columns)
select() picks variables (columns)
summarize() reduces multiple values down to a single statistic
arrange() changes the order of the rows based on their values
group_by() create subsets of data to apply functions to

Join Functions

  • Join functions combine data from two data frames (tables)

  • Based on a common column (key)

  • Different types of joins return different subsets of data

Join functions

function() Action
left_join(x,y) all rows from x; all columns from x and y
right_join(x,y) all rows from y; all columns from x and y
inner_join(x,y) only rows from x and y that have the same column names in x and y
full_join(x,y) all rows from x and y; all columns in x and y
semi_join(x,y) all rows from x that have matching column names in y (only x columns)
anti_join(x,y) all rows from x that don’t NOT matches in y (only x columns)

Steps in data analysis

Load the packages

library(tidyverse)


Create or load data

Creating products_df in tribble format

products_df <- tribble(
  ~product_id, ~product_name, ~category,
  101, "Laptop", "Electronics",
  102, "Mouse", "Accessories",
  103, "Keyboard", "Accessories",
  104, "Monitor", "Electronics",
  105, "Headphones", "Accessories"
)


Creating sales_df in tribble format

sales_df <- tribble(
  ~product_id, ~month, ~units_sold, ~revenue,
  101, "January", 50, 50000,  # Assuming Laptop costs $1000
  102, "January", 150, 3000,  # Mouse $20
  103, "February", 120, 4800,  # Keyboard $40
  105, "February", 80, 4000    # Headphones $50
)

View the tribbles

products_df
sales_df

left_join

  • All rows from the left data frame (products_df) and adds the matching rows from the right data frame (sales_df)
  • All columns
left_join(products_df, sales_df)

right_join

  • All rows from the right data frame (sales_df) and adds the matching rows from the left data frame (products_df)

  • All columns

right_join(products_df, sales_df)

inner_join

  • All rows from the left data frame (products_df) that have matching columns in the right data frame (sales_df)

  • All columns

inner_join(products_df, sales_df)

semi_join

  • All rows from the left data frame (products_df) that have matching columns in the right data frame (sales_df)

  • Only columns in left data frame (products_df)

semi_join(products_df, sales_df)

anti_join

  • All rows from the left data frame (products_df) that do NOT have matching columns in the right data frame (sales_df)

  • Only columns in left data frame (products_df)

anti_join(products_df, sales_df)

Your turn

Create 2 data frames employees and departments

employees <- tribble(
  ~employee_id, ~name,  ~dept_id,
  1,           "John",  100,
  2,           "Jane",  200,
  3,           "Doe",   100,
  4,           "Smith", 300,
  5,           "Stone", 500
)

departments <- tribble(
  ~dept_id, ~dept_name,
  100,      "HR",
  200,      "Finance",
  300,      "IT",
  400,      "Marketing"
)

Your turn

  • Use Inner Join to find out which employees are in which departments that we have records for.

Your turn

  • Use Left Join to list all employees and their departments, even if we don’t have a department record for them.

Your turn

  • Use Right Join to list all departments and see which ones have employees.

Your turn

  • Full Join: Get a comprehensive view of all employees and all departments, regardless of matches.

Business Applications

  • Customer Segmentation: Join customer profiles with purchase data to segment customers based on behavior.
  • Inventory Management: Join product details with sales data to forecast inventory needs.
  • Employee Performance: Join employee details with sales or performance metrics to evaluate and reward top performers.

Conclusion

  • dplyr provides a powerful and readable way to manipulate data
  • Join functions are essential for combining datasets
  • Business analysts can leverage these tools to derive insights and make data-driven decisions

Quiz

  • 6 multiple choice questions