Analytics mindset case studies

ETL in R, part 2

Authors

© 2022 Ernst & Young Foundation (US). SCORE no. 17924-221US_2

modified by E Stanny

Deliverable 4

Now that you have a column that RStudio recognizes as a date, you can easily create additional useful columns that break the date into additional fields, such as year and month.

Required

  • Using the mutate() function and the appropriate functions from the lubridate package, create a new column called month that extracts the month from date.

  • Make sure your month column returns the months by their abbreviated names (Jan, Feb, etc.) and not by their numbers.

  • Look at 10 random rows from the dataframe to make sure you created the column correctly.

Deliverable 5

When cleansing your data, it is important to make sure there are no issues with data format, no missing data, data outliers or other data issues that would be problematic. You will practice these skills by examining the revenue in the Swift data, since revenue is critical to any business. Specifically, below you will look for missing values, data format issues and outliers in the revenue column.

Required

  1. Make certain that the values are all numbers in this column as you expect revenue to be a number. You could check this by looking in the Environment pane/tab in RStudio, but instead use the glimpse() function to look at just the revenue feature/column. What format is the data in? Write a sentence about how you know that.

  2. Make certain that revenue has no missing values and initially assess whether there might be really high or low values, that is, any outliers. You can check that with the skim() function (from the skimr package). Use the select function to look at just the revenue feature/column.

    1. Are there any missing values? Write a sentence about how you learned that answer.
    2. What are the maximum and minimum values for the column? Are these in line with your expectations? What additional steps could you take to gain even more perspective about the reasonableness of the values?
  3. Further analyze your data set for the products with the highest revenue to make certain the revenue meets your expectations and is not an outlier. Sort the data set in the revenue column and display the first 25 rows of the dataframe after it is sorted so that the highest revenue is at the top (sort the dataframe by the revenue column from highest to lowest). Is there any product that has a revenue value that does not appear reasonable to you? Explain why or why not.

  4. When missing data or outliers are identified, you would typically investigate to find out why the revenue is misstated (the root cause of the problem) and then fix the problem in the data. However, for the purposes of this case, you will just eliminate these instead. There are several ways to do this, but for our purposes, just delete all of these rows. To do this, first, create a new dataframe called df_clean and use the filter() function to remove this product. The filter() function is a useful function from tidyverse that easily subsets the data. Specifically, you can use it to select only a subset of rows based upon a condition. Look it up in the help section to better understand it. Then, use it to select only the rows of the dataframe that do not have this product in them. Finally, check the distribution of the revenue columns now using the skim() function again and discuss whether and why it seems more reasonable now.