library(tidyverse) #only package needed in this practical
AthleteEventsDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/ESeWdmN4mBZAph9ZeEqM_C4By1dnAk-XDtRQhroYAXiaNA?download=1"))
NOCRegionDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EWv6iX7hQflHtl4-uOfta38BmqKTE6wTL3KseCRaUiEymA?download=1"))
PopulationDF <- as_tibble(read.csv("https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EXuHQBaqMwBHsvXx6j7tBccBEFUAMg3saPssLixJNejr4Q?download=1", check.names=FALSE))9 Practical 3b: Data preparation in R
During this practical you will recap data cleaning in R.
You will recap:
- Reading your data
- Data cleaning
- Merging your data (and how this is different to joining your data in Tableau)
9.1 Exercises
For the exercises you will require three data files which can be found here or alternatively you can read them in using the following links:
athlete_events2.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/ESeWdmN4mBZAph9ZeEqM_C4By1dnAk-XDtRQhroYAXiaNA?download=1”
noc_regions.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EWv6iX7hQflHtl4-uOfta38BmqKTE6wTL3KseCRaUiEymA?download=1”
world_pop.csv: “https://strath-my.sharepoint.com/:x:/g/personal/xanne_janssen_strath_ac_uk/EXuHQBaqMwBHsvXx6j7tBccBEFUAMg3saPssLixJNejr4Q?download=1”
Exercise 1: Open the ‘athlete_events2.csv’, ‘noc_regions.csv’, and ‘world_pop.csv’ files in R name the dataframes AthleteEventsDF, NOCRegionDF, and PopulationDF.
Looking at the population table we see that this table isn’t tidy which will create issues further down so we need to pivot it to a long format using pivot_longer.
PopulationDF <-PopulationDF[,c(-3,-4)]
PopulationDF <-PopulationDF %>%
pivot_longer(cols=c(3:59),
names_to='Year',
values_to="Population")Exercise 2: Show the first 10 rows of AthleteEventsDF
head(AthleteEventsDF, 10)# A tibble: 10 × 14
ID Name Sex Age Height Weight NOC Games Year Season City Sport
<int> <chr> <chr> <int> <int> <dbl> <chr> <chr> <int> <chr> <chr> <chr>
1 1 A Dijia… M 24 180 80 CHN 1992… 1992 Summer Barc… Bask…
2 2 A Lamusi M 23 170 60 CHN 2012… 2012 Summer Lond… Judo
3 3 Gunnar … M 24 NA NA DEN 1920… 1920 Summer Antw… Foot…
4 4 Edgar L… M 34 NA NA DEN 1900… 1900 Summer Paris Tug-…
5 5 Christi… F 21 185 82 NED 1988… 1988 Winter Calg… Spee…
6 5 Christi… F 21 185 82 NED 1988… 1988 Winter Calg… Spee…
7 5 Christi… F 25 185 82 NED 1992… 1992 Winter Albe… Spee…
8 5 Christi… F 25 185 82 NED 1992… 1992 Winter Albe… Spee…
9 5 Christi… F 27 185 82 NED 1994… 1994 Winter Lill… Spee…
10 5 Christi… F 27 185 82 NED 1994… 1994 Winter Lill… Spee…
# ℹ 2 more variables: Event <chr>, Medal <chr>
From this we can see some athletes have multiple entries, this is important to remember (e.g. if we want to calculate average height we do not want to double count the same athlete). We can also scroll through the variables to see the variables names and the type of data entered.
Next up we start cleaning our data. First let’s check for missing values.
Exercise 3: Check and print the number of missing values per column.
head(AthleteEventsDF, 10)# A tibble: 10 × 14
ID Name Sex Age Height Weight NOC Games Year Season City Sport
<int> <chr> <chr> <int> <int> <dbl> <chr> <chr> <int> <chr> <chr> <chr>
1 1 A Dijia… M 24 180 80 CHN 1992… 1992 Summer Barc… Bask…
2 2 A Lamusi M 23 170 60 CHN 2012… 2012 Summer Lond… Judo
3 3 Gunnar … M 24 NA NA DEN 1920… 1920 Summer Antw… Foot…
4 4 Edgar L… M 34 NA NA DEN 1900… 1900 Summer Paris Tug-…
5 5 Christi… F 21 185 82 NED 1988… 1988 Winter Calg… Spee…
6 5 Christi… F 21 185 82 NED 1988… 1988 Winter Calg… Spee…
7 5 Christi… F 25 185 82 NED 1992… 1992 Winter Albe… Spee…
8 5 Christi… F 25 185 82 NED 1992… 1992 Winter Albe… Spee…
9 5 Christi… F 27 185 82 NED 1994… 1994 Winter Lill… Spee…
10 5 Christi… F 27 185 82 NED 1994… 1994 Winter Lill… Spee…
# ℹ 2 more variables: Event <chr>, Medal <chr>
The table above shows that a number of demographical variables have missing data (i.e. height, weight, and age). I have a feeling this may be related to the year of the event (i.e. no data collected during the Olympics back in the day). We will check this in a second. In addition, Medals has the highest number of missing data (231,333). This can be expected as only three medals are up for grabs per event so many athletes will not have won a medal. We could choose to replace the missing variables with No Medal.
Exercise 4: Check whether missing demographical data is related to the olympics year and replace all missing values within the Medals variable with No Medal.
# check if missing data is year related
AthleteEventsDF %>%
group_by(Year)%>%
summarize(MissingAge=sum(is.na(Age)),
MissingHeight=sum(is.na(Weight)),
MissingWeight=sum(is.na(Height)),
.groups="drop")# A tibble: 35 × 4
Year MissingAge MissingHeight MissingWeight
<int> <int> <int> <int>
1 1896 163 331 334
2 1900 790 1857 1820
3 1904 274 1154 1088
4 1906 743 1528 1476
5 1908 649 2618 2626
6 1912 156 3444 3319
7 1920 845 3821 3525
8 1924 1142 5003 4719
9 1928 963 4856 4599
10 1932 330 2771 2108
# ℹ 25 more rows
# replace N/A with "No Medal"
AthleteEventsDF$Medal[is.na(AthleteEventsDF$Medal)]<- "No Medal"
# could have also used the ifelse statement for this: Athlete_events$Medal <- ifelse(is.na(Athlete_events$Medal), "No Medal", Athlete_events$Medal)Next we would like to merge the athlete data with the NOC dataset so we can assign an NOC-region (i.e. country) to each athlete.
We need to merge the original dataset with the NOC dataset. Looking at both data tables we can see they both include the NOC code so we can use this as the primary joining ID. We will need to use a left join (all.x=TRUE) since we want all participating athletes to remain in the data even if their NOC-region is not found in the master.
MergedDataDF <- merge(AthleteEventsDF,NOCRegionDF,by="NOC", all.x=TRUE)By default all.x and all.y are set to FALSE, this results in an inner join. To apply a left join use all.x=TRUE, to use a right join use all.y=TRUE, to use a full join use all.x=TRUE AND all.y=TRUE
Now let’s check if there are any athletes with missing region data.
Exercise 6: Filter the merged data for missing region data and then list the unique NOC codes.
MissingRegionDF<-MergedDataDF %>%
filter(is.na(region))
unique(MissingRegionDF$NOC)[1] "ROT" "SGP" "TUV" "UNK"
We can now see that we have 4 NOC codes which do not have a region assigned to them (so they were not present in our master NOC sheet or they were named differently). We can manually look these codes up and assign the region names.
ROT = Refugee Olympic Athletes
SGP = Singapore
TUV = Tuvalu
UNK = Unknown
Let’s assign these values in the MergedDataDF.
Exercise 7: Replace the N/A values in region with the correct regions based on the NOC codes.
MergedDataDF<-MergedDataDF %>%
mutate(region=replace(region,NOC=="ROT", "Refugee Olympic Athletes"),
region=replace(region,NOC=="SGP", "Singapore"),
region=replace(region,NOC=="TUV", "Tuvalu"),
region=replace(region,NOC=="UNK", "Unknown"))
#Again this could also be done with an ifelse statement within the mutate function.Next we would like to rename region to country.
Exercise 8: Replace the region variable name with Country in the MergedDataDF and filter the MergedDataDF to only include data from 1986 onwards.
ColNumber <- which(colnames(MergedDataDF) == "region") #this is useful if you have a very large dataset and don't know the column number of the variable you want to rename.
colnames(MergedDataDF)[ColNumber]<-"Country"
MergedDataDF <- MergedDataDF %>%
filter(Year>=1986)Last lets save the MergedDataDF and Population dataframe as two .csv files.
Exercise 9: Save your MergedDataDF as a .csv file named “P3_output1” and save your PopulationDF as a .csv named “P3_output2”
write.csv(MergedDataDF,"C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1703/P3_output1.csv")
write.csv(PopulationDF,"C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1703/P3_output2.csv")