Analytics mindset case studies
ETL in R, part 4
Deliverable 6 (cont.)
Finally, you are prepared to join your two data frames. Merging or joining (we use these words interchangeably) can be tricky. It is important to think about what you want before merging, and then to make sure you get what you want after merging. To do this, you should do the following:
Think about how you want your data to look after you join (e.g., if you want all of the rows from both dataframes, only the rows that match up from both dataframes, only the rows from one dataframe).
Know the level of aggregation in your dataframes.
Check the number of rows of your dataframes before and after you join to make sure you did what you thought you did.
What does “level of aggregation” mean? By this, we just mean what each row of your data represents. For example, each row of df_clean
represents one or more of the same product from one transaction. Thus, the level of aggregation is product-transaction. On the other hand, each row in df_states
is a unique ZIP code, so the level of aggregation is ZIP code.
Once you know the level of aggregation of your two dataframes, you can ask the critical question: What level of aggregation do you want when you join these dataframes? You want to just add state names to rows that already exist in df_clean,
so your final dataframe should have exactly the number of rows that “df_clean” has right now. Keep that in mind. That does not sound so difficult, so what could go wrong? Well, a lot. Look at these potential problems:
- First, maybe
df_states
has duplicate ZIP codes. If this is true, then you might get duplicate rows indf_clean.
You could check this withdistinct(),
but we already have done this and all ZIP codes are unique. - The next problem is that you use the wrong join method. There are multiple ways to join things, and using the wrong method will cause problems.
- Using the wrong method could add rows that you don’t want because they are in the “df_states” dataframe but not in the
df_clean
dataframe. Alternatively, using the wrong method could take away rows from thedf_clean
dataframe because they are not in the state dataframe. Remember, you want to keep everything in thedf_clean
dataframe and add just states from the state dataframe.
There are many ways to join two dataframes in R. Use the methods from tidyverse. Search the internet and find the documentation from the package “dplyr” in tidyverse about mutating joins. The mutating joins add columns from the second dataframe to the first dataframe, matching rows based on the keys supplied in the function. Here are the different types you can choose from:
inner_join()
: includes all rows in the first dataframe and the second dataframe that match up using the variable you are matching onleft_join()
: includes all rows in the first dataframe and only those rows from the second dataframe that match up with the first dataframe using the variable you are matching onright_join()
: includes all rows in the second dataframe and only those rows from the first dataframe that match up with the second dataframe using the variable you are matching onfull_join()
: includes all rows in the first dataframe and all rows in the second dataframe
Take a minute and think about which one of these methods you want to use:
- Use the wrong join type first, just to illustrate why it is important to use the correct join method. Use the
full_join()
method to joindf_clean
anddf_states
and create a new dataframe calleddf_clean_wrong.
Match up these dataframes using thezip
column.
- Write a few sentences that discuss why this method of joining did not work correctly. Provide evidence about why it failed.
- Now, join the data correctly. What join type do you actually want? You want the
left_join()
method. This includes all rows from the first dataframe and only those rows from the second dataframe that match up with the first dataframe using the variable you are matching on. Use theleft_join()
method to joindf_clean
anddf_states
and create a new dataframe calleddf_clean_correct
. Match up these dataframes using thezip
column.
- Write
df_clean_correct
to theposts
folder in csv (comma separated value) format. - Write a few sentences to discuss why this method of joining worked correctly.