Analytics mindset case studies

ETL in R, part 1

Authors

© 2022 Ernst & Young Foundation (US). SCORE no. 17924-221US_2

modified by E Stanny

Background

Many business analysts claim that 80% of their job is extracting data from various systems, transforming it so that it can be analyzed and loading it into the analysis tool that will be used for analysis. These procedures are typically called extract, transform and load (ETL) procedures, but they can also be referred to as data wrangling, data munging or just cleaning data, among others. Getting data ready to analyze takes significant time, resources and expertise. This case will give you hands-on practice with the ETL process.

In this case, you will be performing ETL procedures on transactional data for Swift Arrow convenience stores (Swift). Swift operates in over 160 cities in the US and has over 180 stores. Note that while Swift is a fictional company, the data is based on a real company. Modifications have been made only to store addresses (e.g., store address information is oriented to a post office rather than the actual store address).

Tools

The programming language R is commonly used for data analytics, and particularly for statistical analysis. It is generally seen as a relatively “learnable” programming language. There are many ways to use R. We recommend the integrated development environment (IDE) RStudio. This case assumes basic knowledge of R and RStudio. Your instructor will give you further guidance regarding your tools to use for this case.

Data

Your data set for this case is provided in two CSV files:

  • ey_swift.csv
  • ey_states.csv

The Swift file includes a random selection of customer transactions from Swift’s convenience stores for the years 2017 through 2019. These transactions include things like purchases of fuel, lottery tickets, soda and candy. The provided data set includes a very small sample of the transactions during that period. A full data set would be over 1450 times larger than this and would consist of hundreds of millions of transactions. A smaller data set is provided to teach basic principles that could be applied to the full data set without requiring you to purchase a very fast computer to process the data in a timely manner. The following table describes each column contained in the Swift data set.

Data Definitions for Swift dataset
Data_field Description

unique_id

The unique identifier for each row that has no duplicates and no nulls/omissions. Each row represents a single product purchased as part of a single transaction. Thus, the data is organized at the product by transaction level. For example, if you were to enter a store and purchase a bag of chips, a hot dog and an air freshener in one transaction, it would be recorded as three unique lines in the data set — one line for the purchase of each product. Example: 1161545

transaction_id

An identifier that is unique to each transaction and, therefore, has duplicates in the data set because some transactions have more than one product purchased as referenced by the unique_ID. Example: 20180417|433|2|1|4888093

unformatted_date

The unformatted date of the transaction in the years 2017 through 2019. Example: 2018-07-11

customer_id

A customer that participates in the customer loyalty program is assigned a loyalty card with a number as a unique identifier. When the customer uses this card at purchase, the number is populated for the transaction. Transactions with a null value represent customers that are not loyalty customers or those that made their purchase without providing their loyalty card. Example: 5362

product_name

The name of the product sold. There are over 9,000 different products. Example: DORITO SMPLY WHT CHDDR 2.5OZ

category_name

The name of the product category. There are over 200 different categories of products. Example: Salty Snacks -tort/corn Chips (152)

parent_name

The name identifier for over 80 parent categories for the products. The parent category is the top of the hierarchy under which is the category name and then the product name. Example: Salty Snacks

site_id

A numeric unique identifier for each store that pertains to the site_name. There are over 180 different stores. Example: 197

site_name

The name of the store for the site_id. Example: 433 Bay Minette

address

The street address of the store. Example: 701 Mcmeans Ave

city

The city location for the store. Stores are spread across more than 160 different cities within the US. Example: Bay Minette

zip

The postal ZIP code for the store. Example: 36507

latitude

The latitude coordinate for the store. Example: 40.1

longitude

The longitude coordinate for the store. Example: -92.4

revenue

The revenue earned for each product. This is typically how much the customer paid for the product (a positive number), but it can also represent how much a customer received for a product (negative number), i.e., a winning lottery ticket. Example: 1.69

costs

The direct cost of the product. Example: 0.511

gross_profit

The gross profit of the product calculated by subtracting the costs from the revenue. Example: 1.181

units

The number of units of the product in the transaction. This is generally one unit with the major exception being fuel, for which units represent the gallons of fuel sold. Example: 1

gp_margin

The gross profit margin for the product calculated as gross_profit divided by the revenue. Gross profit margin represents the percentage of profit for each dollar of revenue. It is an important metric for measuring the profitability of the overall business and individual product categories. Example: 0.3021

1 Note: that costs, gross_profit and gp_margin have several thousand NAs — missing values. These are generally for abnormal and fairly rare transactions, like prepaid fuel, money orders, coupons, etc.

The States file includes a select list of US states by postal ZIP code. The following table describes each column contained in the data set.

Data Definitions for State dataset
Data_field Description

postal_code

The postal ZIP code.

state_province

The full name of the state based on the ZIP code.

state_province_code

The two-letter abbreviated name of the state based on the ZIP code.

country_name

The abbreviated code of the name of the country for the state based on the ZIP code. These are all USA.

Deliverable 1

In order to start cleaning this data, you need to load the data into the R program you are using. But, before you do that, you need to load a package that you will use later on.

Required

  • Load the packages “tidyverse” and “skimr”.

  • Load the transactions data from a comma separated file (csv) into the object swift. Use the function read_csv().

    • swift <- read_csv(“https://bus320-quarto.netlify.app/data/ey_swift.csv”)

Deliverable 2

You will now examine the Swift data to get a better understanding of it to help draw initial conclusions about the data in regard to its usefulness and necessary cleansing.

Required Use the following specific functions to examine the Swift data:

  • glimpse(): This gives the overall structure of the data. This is a common function to use, but note that you could just see this in the environment tab in RStudio on the right side of the screen.
  • skim(): This gives descriptive statistics of each column in the dataframe.
  • slice_head() – show 10 rows: This prints the first few rows of the dataframe.
  • slice_tail() – show 10 rows: This prints the last 10 rows of the dataframe.
  • slice_sample() – show 50 rows: This prints a random selection of 50 rows from the dataframe.

After examining the data based on the results of the functions above, provide your answers to the following questions:

  1. Can you identify three analyses that you might find beneficial to perform for Swift? Which Swift data columns (also known as features) would you utilize for each analysis?

  2. Which aspects of the data do you think need to be cleansed? Explain why.

Deliverable 3

You can perform operations on the date

Required

  1. Load the lubridate package.

  2. To see what these strings look like, use the select function to look at only the unformatted_date column (not the whole dataframe, just that column). Use the select function to rename unformatted_date to date.

  3. Use the mutate function to create a new column called year using the mutate() function. The mutate() function is a handy way to create new variables. Look up how it works within RStudio since you will use it again. Specifically, you need to use two functions, nested together, to create this new variable:

    • The year() function
    • The mutate() function

Assign the output to swift_df. Print swift_df.