7 Practical 2: Data preparation in Tableau
During this practical we will go over cleaning and preparing our data in Tableau Prep. Firstly, I will demonstrate how to use Tableau Prep after which you will get hands on yourself. We will go over:
Reading your data
Data cleaning
Merging your data (and how this is different to joining your data in Tableau)
During the demonstration we will use seven data files which you can find here.
7.1 Exercises
For the exercises you will use three data files which can be found here.
Exercise 1: Download the ‘athlete_events2.csv’, ‘noc_regions.csv’, and ‘world_pop.csv’ files from myplace if you haven’t already done so and open them up in Tableau Prep.
If the video above does not work click here
- To open a file - click on the little arrow on the left hand side which opens the connection pane.
- Click on “Text file” as we are trying to open .csv files.
- Load in the ‘athlete_events2.csv’, ‘noc_regions.csv’, and ‘world_pop.csv’ files.
Exercise 2: Show the first rows of athlete_events2 by adding a clean step.
If the video above does not work click here
- Add a clean step to “athletes_events2” by clicking on the little “+” symbol followed by “+ Clean Step”.
From this you may have noticed some athletes have multiple entries, this is important to remember (e.g. if we want to calculate average height we do not want to double count the same athlete). We can also scroll through the variables to see the variables names and the type of data entered.
First let’s ensure all data types are correct.
Exercise 3: Check and where relevant change the data types for each variable in athlete_events2.
If the video above does not work click here
You should have changed:
- Age to numerical (click on “abc” and change to “# Number (whole)”)
- Height to numerical (click on “abc” and change to “# Number (decimal)”)
- Weight to numerical (click on “abc” and change to “# Number (decimal)”)
- Year to date (optional - you can leave it as numerical; if you change it click on “abc” and change to “Date”)
Next we want to take note of the missing values.
Exercise 4: Check for missing values (in Tableau they are often indicated as null or NA). Which variables in the athlete_events2 file have missing values?
If the video above does not work click here
Missing values show as “Null” or “NA” and were found in:
- Age
- Height
- Weight
- Medal
You may now have seen that there are several variables with missing values. I have a feeling this may be related to the year of the event (i.e. no data collected during the Olympics back in the day. We may want to check this later on in our analysis. In addition, medals has the highest number of missing data (231,333). This can be expected as only three medals are up for grabs per event so many athletes will not have won a medal. We could choose to replace the missing variables with “No Medal”.
Exercise 5: Replace all missing values within the ‘Medals’ variable with “No Medal”.
If the video above does not work click here
To replace a value with a different value:
- Click on the value you want to replace (in this case “NA”)
- Type new variable value (in this case “No Medal”)
- Hit Enter
Next we would like to merge the athlete data with the NOC dataset so we can assign an NOC-region (i.e. country) to each athlete.
Exercise 6: Add a clean step to noc_regions
Looking at both data tables we can see they both include the NOC code so we can use this as the primary joining ID. We want all participating athletes to remain in the data even if their NOC-region is not found in the master.
Exercise 7: Merge the two data files together using the correct join.
If the video above does not work click here
To join two files:
- Pull the clean sheet of one of the files into the “+” symbol of the other sheet (“join” should show up).
- Check which file is located on the left and which file is located on right in the
Applied Join Clausessection (in the video ourathlete_events2data is located on the left and thenoc_regionson the right. This is important for setting the correct join type. - Select the join type. As we want all data from the
athlete_events2dataset to be included but only those noc_regions of which we have athletes in theathlete_events2dataset we will select a left join (if yourathlete_events2dataset is located on the right you would have selected a right join). - Next check any NOC codes which are mismatched. Tick the box next to
Join Clauses, you will see that “SGP” and “SIN” have not been assigned a matching value. We now both refer to Singapore so we will change one of them so they match (I chose to change SIN to SGP but you could have changed SGP to SIN). - Last we will add a clean step to our join.
In the video above you will have seen that I corrected a inconsistency in NOC codes, with one file indicating Singapore as SIN and another as SGP. Tableau makes it very easy to correct this and ensure your join works perfect.
Now let’s check our joined data and remove any duplicate columns and make sure there isn’t any missing data which we can correct. Looking at the NOC and Region variables, we can see that we have 3 NOC codes which did not have a region assigned to them (so NOC master sheet had missing region data). We can manually look these codes up and assign the region names.
ROT = Refugee Olympic Athletes
TUV = Tuvula
UNK = Unknown
Exercise 8: Create a calculated field which creates a new variable called Region. When done make sure the ROT, TUV and UNK NOC-codes have a region assigned to them.
If the video above does not work click here
- First check which NOC-codes have missing data (i.e. NA) by searching for NA in the region field (click on little magnifier symbol to search).
- Now let’s create a calculated field by clicking on the three dots within
Regionand selectingCreate Calculated Field-Custom Calculation - Name your variable “Region calculated”
- Enter IF[NOC]=“ROT” THEN “Refuge Olympic Team” ELSEIF [NOC]=“TOV” THEN “Tuvula” ELSEIF NOC=“UNK” THEN “Unknown” ELSE [region]END
- Click Save
- Last remove the duplicated and unused variables
NOC-1,region, andNotesby clicking on the three dots followed by remove.
Last we would like to rename region to country.
Exercise 9: Replace the region variable name with Country.
Now we have finished with the athletes_events data and the noc_regions data we will quickly check the world_pop data set. This dataset is currently set up in a wide format but both Tableau and R work better if data is in a long format. We will therefore need to pivot our data set and we will also remove some redundant columns.
7.1.1 Watch the video below to see how to pivot data in Tableau
If the video above does not work click here
To pivot:
- Add a clean step to the world_pop file.
- Remove redundant variables (i.e. “ndicator Name” and “Indicator Code”
- Next click on the little “+” symbol on the canvas (for wolrd pop) and select Pivot.
- Select all the years and drag them into the “Pivot Fields” box.
- Change the “Pivot Names” column Name to Years and the “Pivot Values” column name to “Population”.
- Add another clean step to check your pivot result and then add an output to save your new file.
We have now cleaned our data and are ready to move forward in Tableau. We will save our joined dataset and our new world_pop data set as .csv files so it’s easy to import into Tableau but also into R may we wish to do so.
Exercise 10: Save your data file as merged_olympics.csv and world_pop_pivot.csv.
If the video above does not work click here
- Ensure you have added an output to your flow
- Select the location you want to save your data and enter your file name
- Select the file type you want to save it as (I recommend .csv)
- Run the flow (this is important - if you don’t do this your file will not be saved!).