19 Practical 8: Advanced plots in Tableau
Up until this point we have mainly worked visualising one variable at a time. But more often then not you will have multiple variables you may want to display in a visualisation (e.g. player position and goal scoring ability). We will look at this during this practical. First up we will demonstrate this with F1 data and then you will be work through the exercises below using an ice hockey data set which includes the top 100 players who played in the NHL.
If you want to follow along with the demonstration you can download the P6_Demonstration_Completed.twbx from here.
19.1 Exercises
Throughout these exercises you will require two files (i.e. NHL Data.xlsx and TenK.xlsx) which can be found here
Exercise 1: Open the NHL data file in Tableau.
19.2 Creating scatter plots
You will see the variables are not named very self-explanatory. I have listed the meaning of the relevant variables (the ones we will use in this practical) below:
POS - Centre, Right Wing, Defense, Left wing
GP - Games Played
G - Goals
A - Assists
P - Goals + Assists (also referred to as
total points)+/- a team’s goal differential while a particular player is on the ice
PIM - Penalty minutes Shots
Total number of Shots
First up we are interested in the correlation between goals and assists.
Exercise 2: Create a scatter dot plot which displays the Goals on the x-axis and the Assists on the y-axis.
For video demonstration click here
Exercise 3: We will discuss adding trend lines in a little bit but what can you already see from this graph?
- There is a large cloud of points not really indicating a correlation
- There is one extreme outlier with one player scoring almost 2000 assists and 900 goals
- The majority of players record more assists than goals
We now have a very simple scatter dot but what if we want to compare players on different variables? We can do this by adding more aesthetics. Let’s use size for total points and colour for total number of games played.
Exercise 4: Adjust your first plot to include these two extra variables with size being total points and colour representing total number of games played.
For video demonstration click here
Consider the importance of the information displayed on the x and y-axes when choosing which variable to assign to each aesthetic. In this case, we have selected assists and goals for the x and y-axes, respectively. The size and color aesthetics are then utilized to represent additional variables. By prioritizing the primary information on the x and y-axes, the graph is designed to convey the most critical details at a glance.
When displaying multiple variables in one visualisation it is important we ensure our aesthetics are formatted correctly. In the graph above, the colour and size differences are hard to read. By adjusting the colour scheme we can improve this.
Exercise 5: Can you change the colour scheme so it uses a diverging colour scheme (I will use orange-white-blue but you can choose whatever you like).
For video demonstration click here
If you watched the video you will have seen I changed Marks from automatic to Circle. I did this so the dots are coloured in. I also reduced the opacity a little and added a grey border. These are just personal choices but show how you can play around with design options.
The graph we created now shows us a lot of information by using just one visualisation. However, it would be good to know who the players actually are, who for example is the outlier?
Exercise 6: Can you add player names as labels to the plot above.
For video demonstration click here
Tableau automatically avoids labels to overlap. However, you could overwrite this (not recommended) by clicking on the Labels mark card. When you do this you will see there are several options which you can play around with (e.g. try to show the min and max based on points scored)
Exercise 7: Can you show the names of the players with the highest and lowest total points.
For video demonstration click here
Using the labeling function is useful but limits the way in which you can show your data points. What if we wanted to show the label for a specific player and who cannot be selected using the options of the label marks card? In that case we can use annotations.
Exercise 8: Can you annotate the plot so we can see Wayne Gretzky and Mark Messner (he recorded 1200 assists and 700 goals)?
For video demonstration click here
Okay so we are getting to a pretty decent graph. Last thing we would like to change is the x and y-axis labels as well as the legend names.
Exercise 9 Rename these as Assists, Goals, Games Played, and Points.
Labels, tooltips, and annotations are three different ways to focus on specific marks, or data points. Which ones to use depends on how many you want to label and how/when you want them to be visible.
19.3 Creating stacked bar charts
Scatterplots are not the only way to show multiple quantities. Another visualization type we can use is the stacked bar chart.
We will stick with the NHL dataset but we are going to look at the per game rates. For this we will need to first calculate the goals, assists and points per game.
Exercise 10: Create three calculated fields calculating the assists, goals, and points per game for each player, and name the variables APG, GPG, and PPG.
For video demonstration click here
Now we have our per game variables we can plot APG and GPG as a stacked bar chart. However, this is not as straightforward as you may think. If we place both APG and GPG in the columns shelve, Tableau will create two separate graphs and the “Show me - Stacked Bar” option also does not work. We will instead need to use Measure Values in the column and remove all irrelevant measures.
Exercise 11: Create a stacked bar chart with GPG and APG as variables on the x-axis and players on the y-axis.
For video demonstration click here
While we have a stacked bar chart above, this is still far from ideal. We would like the two variables to have different colours.
Exercise 12: Change the colours so both variables have a unique colour and then sort the data ascending using total points scored.
For video demonstration click here
From the visualization above we can see that Gretzky, who at first seemed to be exceptional, now has company at the top of this list. Lemieux seems very close to Gretzky when it comes to total points per game but Bossy scored more goals per game than either of them.
What is also noticeable from this graph is that it is a little more difficult to compare their Assists per Game rates. This is one drawback of the stacked bar chart: it’s easy to compare the lengths of the first bars and the overall lengths of the bars, but the bars without a common baseline are much more difficult to compare.
If the comparison of the Assists per game was more important than the total points, then we would be better using two individual data points in the same chart (we can do this as the unit is the same (i.e. per game value)). Doing this enables us to see both the goals and the assists and compare players on each of those.
Exercise 13: Creating a copy of the chart you just created can you create a dot chart with the assists on the left and the goals on the right (each having a different color).
For video demonstration click here
You can play around with how you sort your data. There is no write or wrong her but it should be informed by the message you want to share. For example our stacked bar clearly showed to overall performance of total points but it was much more difficult to compare the two individual components. Using the dot chart clearly visualises the difference in goals scoring between players as well as assists. I.e. Gretzky excells in assists whereas Lemieux and Bossy perform much better on scoring.
For video demonstration click here
19.4 Regression/ association plots
Next up we will look at how we can plot the strength of an association between variables. We will have a look at shots and goals. Do note shots result in goals but goals do not result in shots, therefore the shots will become our predictor/independent variable (normally placed on the x-axis) and the goals our dependent/outcome variable (normally placed on the y-axis).
Exercise 14: Creating a scatter dot for shots vs goals and add position using the colour aesthetic.
For video demonstration click here
From the graph above we can see there may be a bit of a correlation between number of shots taken and number of goals. We can also see that the strength of the correlation may be different between positions of play.
Exercise 15: Let’s check the different correlations by adding regression lines for each individual category. Can you ensure the line crosses 0 (i.e. 0 shots will always result in 0 goals).
For video demonstration click here
From the graph above we can see the correlation between shots and goals is fairly similar for Center and Wing players but quite different for Defense players. We could choose to simplify the graph by only showing the two regressions (i.e. combined Center and Wingers vs Defense).
For video demonstration click here
19.5 Quadrants
The last way to look at this data is to divide the graph into quadrants. We can look at those who took a low number of shots and scored little (left bottom corner), those who took a low number of shots but scored quite a lot (left top corner), those who took a high number of shots but scored little (right bottom corner), and last those who took a lot of shots and score a lot (top right corner). We can do this by adding reference lines to our original Corr plot.
Exercise 16: Copy your scatter plot, remove the regression lines, ungroup your position variable and add a horizontal and vertical reference line using the average goals and shots to the updated scatter dot.
19.5.1 Show answer
For video demonstration click here
19.6 Date data
The last part of this practical will focus on using date data. The olympic dataset used in the first few practicals contained historical date based data, however we did not tap into this as much as we could. In this section we will start looking at creating time lines and using multiple plots to explain findings.
Exercise 17: Load in TenK.xlsx, which can be found here.
As you can see this data consists of all the winning times for the olympic 10K race since 1912. It also includes the total number of competitors, countries and weather details for the day of the final.
Let’s start with looking at how the winning 10K time has changed over the years.
Exercise 18: Create a line graph with year of the Olympics on the x-axis and 10K time on the y-axis (think about your formatting).
For video demonstration click here
Exercise 19: Let’s also add a trendline to this graph.
For video demonstration click here
Exercise 20: What would you take away from this graph immediately?
- There is a downward trend in finishing times (i.e. athletes are getting faster)
- 1968 (Mexico City) very clear outlier
- 2012 and 2020 also seem to break the trend
As a data analyst I would like to know why Mexico city appears to be such an outlier. Given we have weather data and altitude data and knowing humidity as well as altitude play a big role in performance we will start with this.
We could decide to create another graph like the one above but with Altitude on the y-axis and plot the two side by side.
Exercise 21: Create a second plot with Altitude on the y-axis and plot this side by side with the TimePlot
For video demonstration click here
From the graphs above we can see that Mexico City is a huge outlier in regards to altitude, it’s the only high altitude city which has hosted the Olympics. However, it is not always this clear and sometimes it may be easier to overlap two graphs.
Overlapping graphs with the same axis scale is okay but be careful when overlapping graphs with different access scales (as we do in the next exercise!).
Exercise 22: Can you lay the altitude graph over the the time graph. Can you also create a label which indicates the altitude of Mexico City.
The graph above shows us, without much effort, that the high altitude correlates with the unexpected slow time during the Mexico City Olympics. Note the peak in 1996 does not appear to have an impact on the times, this is because even though Atlanta is located higher than most Olympic cities, it’s not classed as a high altitude city. Also note how I changed the colour of the y-axis, this helps clarify which line belongs to which axis.