Analytics mindset case studies

ETL in R, part 3

Authors

© 2022 Ernst & Young Foundation (US). SCORE no. 17924-221US_2

modified by E Stanny

Deliverable 6: Preparing to join Swift and States Data Sets

Objective: Prepare to join Swift data with states data to include a state column based on ZIP codes.

a. Data Loading and Examination

  1. Import Data: Import the ey_swift.csv file into your R environment using the read_csv() function. Name the dataframe swift.

    • swift <- read_csv("http://bus320-quarto.netlify.app/data/ey_swift.csv")

    • Delete product “CASH COUNTDOWN 475” and assign output to df_clean

    • df_clean <- swift |> filter(product_name != "CASH COUNTDOWN 475")

  2. Import Data: Import the ey_states.csv file into your R environment using the read_csv() function. Name the resultant dataframe df_states.

  3. Examine df_states:

    • Use the glimpse() function to review its structure.
    • Examine column-level statistics with the skim() function.
    • select and arrange() the postal_code column in both ascending and descending orders
  4. Examine df_clean:

    • Use the glimpse() function to review its structure.
    • Examine column-level statistics with the skim() function.
    • select and arrange() the zip column in ascending and descending order.
  5. Question Response: Write a few sentences detailing any potential challenges you might face when merging these two dataframes.

b. Preparing ZIP Column in df_clean for Joining

  1. Use mutate() and as.character() to convert zip in df_clean to character type, saving as zip save output to df_clean.

  2. Use select() and skim() to examine zip in df_clean.

c. Preparing ZIP Column in df_states for Joining

  1. Modify the postal_code column in df_states. Use mutate() and str_sub() to remove “+four codes” and ‘-’ in ensuring postal_code retains only the initial five characters. Call the new variable zip.

  2. Use select() and skim() to examine zip in df_states.

d. Verify Formatting

  1. Use select() and arrange in descending order to confirm the removal of redundant characters from the zip column in df_states.

e. Select Relevant Columns in df_states

  1. Use select() to keep only zip and state_province with zip positioned first.

  2. Show df_states

f. Clean and Export the Dataframes

  1. Write the modified dataframes df_states and df_clean to .csv files, saving them in the “posts” directory.

    • First install the package here
    • write_csv(df_states, here::here("posts", "df_states.csv"))
    • write_csv(df_clean, here::here("posts", "df_clean.csv"))

Note: After completing these steps, you’ll be ready to join these datasets by zip.