Analytics mindset case studies

ETL in R, part 4

Authors

© 2022 Ernst & Young Foundation (US). SCORE no. 17924-221US_2

modified by E Stanny

Deliverable 6 (cont.)

Finally, you are prepared to join your two data frames. Merging or joining (we use these words interchangeably) can be tricky. It is important to think about what you want before merging, and then to make sure you get what you want after merging. To do this, you should do the following:

  • Think about how you want your data to look after you join (e.g., if you want all of the rows from both dataframes, only the rows that match up from both dataframes, only the rows from one dataframe).

  • Know the level of aggregation in your dataframes.

  • Check the number of rows of your dataframes before and after you join to make sure you did what you thought you did.

What does “level of aggregation” mean? By this, we just mean what each row of your data represents. For example, each row of df_clean represents one or more of the same product from one transaction. Thus, the level of aggregation is product-transaction. On the other hand, each row in df_states is a unique ZIP code, so the level of aggregation is ZIP code.

Once you know the level of aggregation of your two dataframes, you can ask the critical question: What level of aggregation do you want when you join these dataframes? You want to just add state names to rows that already exist in df_clean, so your final dataframe should have exactly the number of rows that “df_clean” has right now. Keep that in mind. That does not sound so difficult, so what could go wrong? Well, a lot. Look at these potential problems:

  • First, maybe df_states has duplicate ZIP codes. If this is true, then you might get duplicate rows in df_clean. You could check this with distinct(), but we already have done this and all ZIP codes are unique.
  • The next problem is that you use the wrong join method. There are multiple ways to join things, and using the wrong method will cause problems.
  • Using the wrong method could add rows that you don’t want because they are in the “df_states” dataframe but not in the df_clean dataframe. Alternatively, using the wrong method could take away rows from the df_clean dataframe because they are not in the state dataframe. Remember, you want to keep everything in the df_clean dataframe and add just states from the state dataframe.

There are many ways to join two dataframes in R. Use the methods from tidyverse. Search the internet and find the documentation from the package “dplyr” in tidyverse about mutating joins. The mutating joins add columns from the second dataframe to the first dataframe, matching rows based on the keys supplied in the function. Here are the different types you can choose from:

  • inner_join(): includes all rows in the first dataframe and the second dataframe that match up using the variable you are matching on
  • left_join(): includes all rows in the first dataframe and only those rows from the second dataframe that match up with the first dataframe using the variable you are matching on
  • right_join(): includes all rows in the second dataframe and only those rows from the first dataframe that match up with the second dataframe using the variable you are matching on
  • full_join(): includes all rows in the first dataframe and all rows in the second dataframe

Take a minute and think about which one of these methods you want to use:

  1. Use the wrong join type first, just to illustrate why it is important to use the correct join method. Use the full_join() method to join df_clean and df_states and create a new dataframe called df_clean_wrong. Match up these dataframes using the zip column.
  • Write a few sentences that discuss why this method of joining did not work correctly. Provide evidence about why it failed.
  1. Now, join the data correctly. What join type do you actually want? You want the left_join() method. This includes all rows from the first dataframe and only those rows from the second dataframe that match up with the first dataframe using the variable you are matching on. Use the left_join() method to join df_clean and df_states and create a new dataframe called df_clean_correct. Match up these dataframes using the zip column.
  • Write df_clean_correct to the posts folder in csv (comma separated value) format.
  • Write a few sentences to discuss why this method of joining worked correctly.