Data Definitions for Swift dataset | |
---|---|
Data_field | Description |
unique_id |
The unique identifier for each row that has no duplicates and no nulls/omissions. Each row represents a single product purchased as part of a single transaction. Thus, the data is organized at the product by transaction level. For example, if you were to enter a store and purchase a bag of chips, a hot dog and an air freshener in one transaction, it would be recorded as three unique lines in the data set — one line for the purchase of each product. Example: 1161545 |
transaction_id |
An identifier that is unique to each transaction and, therefore, has duplicates in the data set because some transactions have more than one product purchased as referenced by the unique_ID. Example: 20180417|433|2|1|4888093 |
unformatted_date |
The unformatted date of the transaction in the years 2017 through 2019. Example: 2018-07-11 |
customer_id |
A customer that participates in the customer loyalty program is assigned a loyalty card with a number as a unique identifier. When the customer uses this card at purchase, the number is populated for the transaction. Transactions with a null value represent customers that are not loyalty customers or those that made their purchase without providing their loyalty card. Example: 5362 |
product_name |
The name of the product sold. There are over 9,000 different products. Example: DORITO SMPLY WHT CHDDR 2.5OZ |
category_name |
The name of the product category. There are over 200 different categories of products. Example: Salty Snacks -tort/corn Chips (152) |
parent_name |
The name identifier for over 80 parent categories for the products. The parent category is the top of the hierarchy under which is the category name and then the product name. Example: Salty Snacks |
site_id |
A numeric unique identifier for each store that pertains to the site_name. There are over 180 different stores. Example: 197 |
site_name |
The name of the store for the site_id. Example: 433 Bay Minette |
address |
The street address of the store. Example: 701 Mcmeans Ave |
city |
The city location for the store. Stores are spread across more than 160 different cities within the US. Example: Bay Minette |
zip |
The postal ZIP code for the store. Example: 36507 |
latitude |
The latitude coordinate for the store. Example: 40.1 |
longitude |
The longitude coordinate for the store. Example: -92.4 |
revenue |
The revenue earned for each product. This is typically how much the customer paid for the product (a positive number), but it can also represent how much a customer received for a product (negative number), i.e., a winning lottery ticket. Example: 1.69 |
costs |
The direct cost of the product. Example: 0.511 |
gross_profit |
The gross profit of the product calculated by subtracting the costs from the revenue. Example: 1.181 |
units |
The number of units of the product in the transaction. This is generally one unit with the major exception being fuel, for which units represent the gallons of fuel sold. Example: 1 |
gp_margin |
The gross profit margin for the product calculated as gross_profit divided by the revenue. Gross profit margin represents the percentage of profit for each dollar of revenue. It is an important metric for measuring the profitability of the overall business and individual product categories. Example: 0.3021 |
1 Note: that costs, gross_profit and gp_margin have several thousand NAs — missing values. These are generally for abnormal and fairly rare transactions, like prepaid fuel, money orders, coupons, etc. |
Analytics mindset case studies
ETL in R, part 1
Background
Many business analysts claim that 80% of their job is extracting data from various systems, transforming it so that it can be analyzed and loading it into the analysis tool that will be used for analysis. These procedures are typically called extract, transform and load (ETL) procedures, but they can also be referred to as data wrangling, data munging or just cleaning data, among others. Getting data ready to analyze takes significant time, resources and expertise. This case will give you hands-on practice with the ETL process.
In this case, you will be performing ETL procedures on transactional data for Swift Arrow convenience stores (Swift). Swift operates in over 160 cities in the US and has over 180 stores. Note that while Swift is a fictional company, the data is based on a real company. Modifications have been made only to store addresses (e.g., store address information is oriented to a post office rather than the actual store address).
Tools
The programming language R is commonly used for data analytics, and particularly for statistical analysis. It is generally seen as a relatively “learnable” programming language. There are many ways to use R. We recommend the integrated development environment (IDE) RStudio. This case assumes basic knowledge of R and RStudio. Your instructor will give you further guidance regarding your tools to use for this case.
Data
Your data set for this case is provided in two CSV files:
- ey_swift.csv
- ey_states.csv
The Swift file includes a random selection of customer transactions from Swift’s convenience stores for the years 2017 through 2019. These transactions include things like purchases of fuel, lottery tickets, soda and candy. The provided data set includes a very small sample of the transactions during that period. A full data set would be over 1450 times larger than this and would consist of hundreds of millions of transactions. A smaller data set is provided to teach basic principles that could be applied to the full data set without requiring you to purchase a very fast computer to process the data in a timely manner. The following table describes each column contained in the Swift data set.
The States file includes a select list of US states by postal ZIP code. The following table describes each column contained in the data set.
Data Definitions for State dataset | |
---|---|
Data_field | Description |
postal_code |
The postal ZIP code. |
state_province |
The full name of the state based on the ZIP code. |
state_province_code |
The two-letter abbreviated name of the state based on the ZIP code. |
country_name |
The abbreviated code of the name of the country for the state based on the ZIP code. These are all USA. |
Deliverable 1
In order to start cleaning this data, you need to load the data into the R program you are using. But, before you do that, you need to load a package that you will use later on.
Required
Load the packages “tidyverse” and “skimr”.
Load the transactions data from a comma separated file (csv) into the object swift. Use the function
read_csv().
- swift <- read_csv(“https://bus320-quarto.netlify.app/data/ey_swift.csv”)
Deliverable 2
You will now examine the Swift data to get a better understanding of it to help draw initial conclusions about the data in regard to its usefulness and necessary cleansing.
Required Use the following specific functions to examine the Swift data:
- glimpse(): This gives the overall structure of the data. This is a common function to use, but note that you could just see this in the environment tab in RStudio on the right side of the screen.
- skim(): This gives descriptive statistics of each column in the dataframe.
- slice_head() – show 10 rows: This prints the first few rows of the dataframe.
- slice_tail() – show 10 rows: This prints the last 10 rows of the dataframe.
- slice_sample() – show 50 rows: This prints a random selection of 50 rows from the dataframe.
After examining the data based on the results of the functions above, provide your answers to the following questions:
Can you identify three analyses that you might find beneficial to perform for Swift? Which Swift data columns (also known as features) would you utilize for each analysis?
Which aspects of the data do you think need to be cleansed? Explain why.
Deliverable 3
You can perform operations on the date
Required
Load the lubridate package.
To see what these strings look like, use the
select
function to look at only theunformatted_date
column (not the whole dataframe, just that column). Use the select function to rename unformatted_date to date.Use the mutate function to create a new column called
year
using themutate()
function. Themutate()
function is a handy way to create new variables. Look up how it works within RStudio since you will use it again. Specifically, you need to use two functions, nested together, to create this new variable:- The
year()
function - The
mutate()
function
- The
Assign the output to swift_df. Print swift_df.