9 Practical 3b: Data preparation in R

During this practical you will recap data cleaning in R.

You will recap:

Reading your data
Data cleaning
Merging your data (and how this is different to joining your data in Tableau)

9.1 Exercises

For the exercises you will require three data files which can be found here or alternatively you can read them in using the following links:

athlete_events2.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/ESeWdmN4mBZAph9ZeEqM_C4By1dnAk-XDtRQhroYAXiaNA?download=1”
noc_regions.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EWv6iX7hQflHtl4-uOfta38BmqKTE6wTL3KseCRaUiEymA?download=1”
world_pop.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EXuHQBaqMwBHsvXx6j7tBccBEFUAMg3saPssLixJNejr4Q?download=1”

Exercise 1: Open the ‘athlete_events2.csv’, ‘noc_regions.csv’, and ‘world_pop.csv’ files in R name the dataframes AthleteEventsDF, NOCRegionDF, and PopulationDF.

Show the answer

library(tidyverse) #only package needed in this practical

AthleteEventsDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/ESeWdmN4mBZAph9ZeEqM_C4By1dnAk-XDtRQhroYAXiaNA?download=1"))

NOCRegionDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EWv6iX7hQflHtl4-uOfta38BmqKTE6wTL3KseCRaUiEymA?download=1"))

PopulationDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EXuHQBaqMwBHsvXx6j7tBccBEFUAMg3saPssLixJNejr4Q?download=1", check.names=FALSE))

Looking at the population table we see that this table isn’t tidy which will create issues further down so we need to pivot it to a long format using pivot_longer.

Show the code

PopulationDF <-PopulationDF[,c(-3,-4)]
PopulationDF <-PopulationDF %>%
  pivot_longer(cols=c(3:59),
               names_to='Year',
               values_to="Population")

Exercise 2: Show the first 10 rows of AthleteEventsDF

Show the answer

head(AthleteEventsDF, 10)

# A tibble: 10 × 14
      ID Name     Sex     Age Height Weight NOC   Games  Year Season City  Sport
   <int> <chr>    <chr> <int>  <int>  <dbl> <chr> <chr> <int> <chr>  <chr> <chr>
 1     1 A Dijia… M        24    180     80 CHN   1992…  1992 Summer Barc… Bask…
 2     2 A Lamusi M        23    170     60 CHN   2012…  2012 Summer Lond… Judo 
 3     3 Gunnar … M        24     NA     NA DEN   1920…  1920 Summer Antw… Foot…
 4     4 Edgar L… M        34     NA     NA DEN   1900…  1900 Summer Paris Tug-…
 5     5 Christi… F        21    185     82 NED   1988…  1988 Winter Calg… Spee…
 6     5 Christi… F        21    185     82 NED   1988…  1988 Winter Calg… Spee…
 7     5 Christi… F        25    185     82 NED   1992…  1992 Winter Albe… Spee…
 8     5 Christi… F        25    185     82 NED   1992…  1992 Winter Albe… Spee…
 9     5 Christi… F        27    185     82 NED   1994…  1994 Winter Lill… Spee…
10     5 Christi… F        27    185     82 NED   1994…  1994 Winter Lill… Spee…
# ℹ 2 more variables: Event <chr>, Medal <chr>

From this we can see some athletes have multiple entries, this is important to remember (e.g. if we want to calculate average height we do not want to double count the same athlete). We can also scroll through the variables to see the variables names and the type of data entered.

Next up we start cleaning our data. First let’s check for missing values.

Exercise 3: Check and print the number of missing values per column.

Show the answer

head(AthleteEventsDF, 10)

# A tibble: 10 × 14
      ID Name     Sex     Age Height Weight NOC   Games  Year Season City  Sport
   <int> <chr>    <chr> <int>  <int>  <dbl> <chr> <chr> <int> <chr>  <chr> <chr>
 1     1 A Dijia… M        24    180     80 CHN   1992…  1992 Summer Barc… Bask…
 2     2 A Lamusi M        23    170     60 CHN   2012…  2012 Summer Lond… Judo 
 3     3 Gunnar … M        24     NA     NA DEN   1920…  1920 Summer Antw… Foot…
 4     4 Edgar L… M        34     NA     NA DEN   1900…  1900 Summer Paris Tug-…
 5     5 Christi… F        21    185     82 NED   1988…  1988 Winter Calg… Spee…
 6     5 Christi… F        21    185     82 NED   1988…  1988 Winter Calg… Spee…
 7     5 Christi… F        25    185     82 NED   1992…  1992 Winter Albe… Spee…
 8     5 Christi… F        25    185     82 NED   1992…  1992 Winter Albe… Spee…
 9     5 Christi… F        27    185     82 NED   1994…  1994 Winter Lill… Spee…
10     5 Christi… F        27    185     82 NED   1994…  1994 Winter Lill… Spee…
# ℹ 2 more variables: Event <chr>, Medal <chr>

The table above shows that a number of demographical variables have missing data (i.e. height, weight, and age). I have a feeling this may be related to the year of the event (i.e. no data collected during the Olympics back in the day). We will check this in a second. In addition, Medals has the highest number of missing data (231,333). This can be expected as only three medals are up for grabs per event so many athletes will not have won a medal. We could choose to replace the missing variables with No Medal.

Exercise 4: Check whether missing demographical data is related to the olympics year and replace all missing values within the Medals variable with No Medal.

Show the answer

# check if missing data is year related
AthleteEventsDF %>% 
  group_by(Year)%>%
  summarize(MissingAge=sum(is.na(Age)),
          MissingHeight=sum(is.na(Weight)),
          MissingWeight=sum(is.na(Height)),
          .groups="drop")

# A tibble: 35 × 4
    Year MissingAge MissingHeight MissingWeight
   <int>      <int>         <int>         <int>
 1  1896        163           331           334
 2  1900        790          1857          1820
 3  1904        274          1154          1088
 4  1906        743          1528          1476
 5  1908        649          2618          2626
 6  1912        156          3444          3319
 7  1920        845          3821          3525
 8  1924       1142          5003          4719
 9  1928        963          4856          4599
10  1932        330          2771          2108
# ℹ 25 more rows

# replace N/A with "No Medal"
AthleteEventsDF$Medal[is.na(AthleteEventsDF$Medal)]<- "No Medal"

# could have also used the ifelse statement for this: Athlete_events$Medal <- ifelse(is.na(Athlete_events$Medal), "No Medal", Athlete_events$Medal)

Next we would like to merge the athlete data with the NOC dataset so we can assign an NOC-region (i.e. country) to each athlete.

We need to merge the original dataset with the NOC dataset. Looking at both data tables we can see they both include the NOC code so we can use this as the primary joining ID. We will need to use a left join (all.x=TRUE) since we want all participating athletes to remain in the data even if their NOC-region is not found in the master.

Show the code

MergedDataDF <- merge(AthleteEventsDF,NOCRegionDF,by="NOC", all.x=TRUE)

Note

By default all.x and all.y are set to FALSE, this results in an inner join. To apply a left join use all.x=TRUE, to use a right join use all.y=TRUE, to use a full join use all.x=TRUE AND all.y=TRUE

Now let’s check if there are any athletes with missing region data.

Exercise 6: Filter the merged data for missing region data and then list the unique NOC codes.

Show the answer

MissingRegionDF<-MergedDataDF %>%
  filter(is.na(region))
  
unique(MissingRegionDF$NOC)

[1] "ROT" "SGP" "TUV" "UNK"

We can now see that we have 4 NOC codes which do not have a region assigned to them (so they were not present in our master NOC sheet or they were named differently). We can manually look these codes up and assign the region names.

ROT = Refugee Olympic Athletes
SGP = Singapore
TUV = Tuvalu
UNK = Unknown

Let’s assign these values in the MergedDataDF.

Exercise 7: Replace the N/A values in region with the correct regions based on the NOC codes.

Show the answer

MergedDataDF<-MergedDataDF %>%
  mutate(region=replace(region,NOC=="ROT", "Refugee Olympic Athletes"),
         region=replace(region,NOC=="SGP", "Singapore"),
         region=replace(region,NOC=="TUV", "Tuvalu"),
         region=replace(region,NOC=="UNK", "Unknown"))

#Again this could also be done with an ifelse statement within the mutate function.

Next we would like to rename region to country.

Exercise 8: Replace the region variable name with Country in the MergedDataDF and filter the MergedDataDF to only include data from 1986 onwards.

Show the answer

ColNumber <- which(colnames(MergedDataDF) == "region") #this is useful if you have a very large dataset and don't know the column number of the variable you want to rename.
colnames(MergedDataDF)[ColNumber]<-"Country"
MergedDataDF <- MergedDataDF %>%
  filter(Year>=1986)

Last lets save the MergedDataDF and Population dataframe as two .csv files.

Exercise 9: Save your MergedDataDF as a .csv file named “P3_output1” and save your PopulationDF as a .csv named “P3_output2”

Show the answer

write.csv(MergedDataDF,"C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1703/P3_output1.csv")

write.csv(PopulationDF,"C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1703/P3_output2.csv")