Data

Material based on Chapter 1 Introduction to Modern Statistics and A/B Testing in R

Learning objectives

  • Types of data collection

    • experimental vs observational
  • Types of variables

    • numeric vs categorical
  • Tidy data

  • Summary statistics

Medical case

Question: does the use of stents reduce the risk of strokes?

Data to address the question?

451 patients at risk for strokes randomly assigned to 2 groups

  • treatment receive stent (224)

  • control no stent (227)

Data

Outcome of stent experiment

group no event stroke Total
control 199 28 227
treatment 179 45 224


Does the use of stents reduce the risk of stroke?

Compare 2 summary statistics:

  • Proportion who had a stroke in the treatment (stent) group: \(45/(179 +45) = 0.20 = 20\%.\)

  • Proportion who had a stroke in the control group: \(28/(199 +28) = 0.12 = 12\%.\)

  • summary statistic is a single number summarizing data from a sample

group no event stroke
control 88% 12%
treatment 80% 20%

Your turn

How would you calculate 88% and 80% in the table?

group no event stroke
control 88% 12%
treatment 80% 20%

Types of variables

Tidy Data

A data frame where

  • each row is a unique case (observational unit),
  • each column is a variable
  • each cell is a single value

Explanatory and response variables

explanatory variable → might affect → response variable

Relationship between variables

  • associated variables: two variables that show some connection with one another

  • independent variables: not associated

Types of studies: observational vs experiment

  • observational observe associations
    • collect information via surveys or company records, or follow a cohort
  • experimental can infer causation when effectively randomize

Your turn

12 Smoking habits of UK residents. A survey was conducted to study the smoking habits of 1,691 UK residents.

  • Observational or experimental data collection

  • What does each row of the data frame represent?

  • How many participants were included in the survey?

  • Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

gender age marital_status highest_qualification nationality ethnicity gross_income region smoke amt_weekends amt_weekdays type
Male 38 Divorced No Qualification British White 2,600 to 5,200 The North No NA NA
Female 42 Single No Qualification British White Under 2,600 The North Yes 12 12 Packets
Male 40 Married Degree English White 28,600 to 36,400 The North No NA NA
Female 40 Married Degree English White 10,400 to 15,600 The North No NA NA
Female 39 Married GCSE/O Level British White 2,600 to 5,200 The North No NA NA
Female 37 Married GCSE/O Level British White 15,600 to 20,800 The North No NA NA

Experiment: online wine retailer

Wine retailer email test data

Data

user_id cpgn_id group email open click purch
1000001 1901Email ctrl FALSE 0 0 0.00
1000002 1901Email email_B TRUE 1 0 0.00
1000003 1901Email email_A TRUE 1 1 200.51


chard sav_blanc syrah cab past_purch days_since visits
0.00 0 33.94 0.00 33.94 119 11
0.00 0 16.23 76.31 92.54 60 3
516.39 0 16.63 0.00 533.02 9 9

Your turn

names var type
user_id
cpgn_id
group
email
open
click
purch
chard
sav_blanc
syrah
cab
past_purch
days_since
visits

Summary

  • tidy data

  • variables

    • explanatory vs response variables
    • associated vs independent
    • type of variable
      • numeric: discrete vs continuous
      • categorical: nominal vs ordinal
  • summary statistic

Quiz

  • two questions
  • may be multiple answers
  • use sample presented in question (not entire sample)