Beijing 2022 Biathlon Mass Start Clustering

By Billy Fryer in R

June 8, 2022

Introduction

With the warm summer making its presence felt here in many parts of the United States, it is only logical to discuss a sport that is a favorite in snowy Nordic winters: biathlon. After biathlon events, normally all we see is the final ordering of athletes based on their finish times. However, this singular time only provides one value for a sport with multiple phases. In particular, an athlete’s shooting performance has a huge effect on the outcome of their race, hence the name biathlon. Even still, this is ignoring some of the unusual stresses of the Olympics that include jet lag, potentially a different diet, or sleeping in an unfamiliar (thankfully not cardboard) bed - all of which have an effect on how well an athlete performs. Fortunately, there are machine learning techniques that allow us to begin to peel the layers back and better understand some of the patterns from the event that may not seem quite as obvious. In order to take a deeper look at the performance of athletes who competed in this year’s events, I used k-means clustering to tier biathletes to see who was actually in medal contention during the Mass Start Biathlon events of the Beijing 2022 Winter Olympic Games.

How Biathlon Works

Biathlon is a sport that combines the two disciplines of cross country skiing and shooting. This analysis chooses to focus on the Mass Start format of the biathlon discipline. In the Mass Start, all 30 biathletes are lined up and begin the race at the exact same time and are on the course along side their competitors. In other variations, such as the Individual or Pursuit, biathletes have staggered starts. For the skiing portion, biathletes ski a total of 5 loops; for men these loops are 3 km each (for a total of 15 km) and for women the loops are 2.5 km each (for a total of 12.5 km) (1). For the shooting portion of the biathlon, the athletes shoot at 5 targets from 50 meters away after each of the first four laps (1). After completing laps 1 and 2, the athletes shoot at a target with a diameter of 45 millimeters from the prone or laying down position (1). After laps 3 and 4, the athletes shoot at a target with a diameter of 115 millimeters from a standing position (1). After every shooting range session, biathletes must ski a 150 meter penalty loop for each shot they miss (2). After their final range session and completing the necessary penalty laps, the athletes complete their fifth lap and finish the race. The biathlete that completes the course the fastest is the winner.

For those interested in watching the Mass Start Events from the 2022 Beijing Olympics, they are linked below:

Biathlon - Men’s 15km Mass Start from the Olympics YouTube Page

Biathlon - Women’s 12.5km Mass Start from the Olympics YouTube Page

Data

The data for this project comes from the FHSTR R package that I created. This package contains data from NBC’s API for all events in the 2022 Beijing Winter Olympic Games. For more information about the FHSTR package, visit https://billyfryer.github.io/FHSTR/. If you are not an R user, the data is available in the following GitHub Repo: https://github.com/billyfryer/Beijing-Olympics-Data-Repo.

For this analysis, data for all 30 participants from both the men and women’s race were used for a total of 60 athletes. Three variables were used to cluster the athletes together: number of shooting misses in the prone or laying down position, number of shooting misses in the standing position, and total time taken to complete the event. These variables were centered and scaled for analysis. Unfortunately, this was all the data available for analysis. If the data were available, more variables to consider would include time spent on range, time spent skiing penalty loops, and time in between shots among others. This data would further help distinguish what areas of the events biathletes could focus on to improve their times.

To some, there may appear to be one variable that is missing: the sex of the biathlete. However, I purposefully excluded this for a few different reasons. The primary reason is because the race distances are different. The distance skied in the women’s event is shorter than in the men’s race. This artificially reduces the times of the women in comparison to the men to make the times more comparable. With this in place, I find it unnecessary to account for sex of the athletes yet again. The second reason comes from a statistical modeling perspective. The algorithm I chose, k-means clustering, does not allow for binary variables. I attempted different methods of scaling times based on sex, however those results led to clusters segregated by sex which was not the goal for this study.

Modeling

I decided to use k-means clustering to group these athletes together. In this case, the k-means algorithm iteratively groups athletes together until reaching an equilibrium. To determine the proper number of clusters, I plotted the total within-clusters sum of squares and determined that the proper number of clusters to use should be 5 (3).

I then ran the k-means algorithm to assign athletes to the 5 clusters. Each of these clusters appears to be distinct and with no overlapping when looking at the plot of the first two principal components below. As an aside, principal component analysis was not utilized in modeling, only for this visualization. Cluster names are arbitrarily assigned.

The table below shows average values for each cluster for the factors used in the model as well as a few others. For average rank, I left each athlete with their original rank separated by sex. Therefore, the fastest male and female athlete both receive a rank of 1 while the slowest athlete of each sex receive a rank of 30.

At first glance, these clusters appear very imbalanced in size with Cluster 1 in particular only having 2 people. However these athletes were strong outliers from the rest of the field which is explained later.

An interactive 3D scatter plot of all athletes in the study is below. Along the axes are standing position shooting misses, prone position shooting misses, and skiing times in minutes. Hover over a point to see the athlete’s name as well as see their other stats such as prone shots missed, standing shots missed, time raced as well as the cluster to which each athlete was assigned. The plot can also be moved around to be looked at from different perspectives as well as zoomed in as needed. “x” is the number of misses from the prone position, “y” is the number of misses from the standing position, and “z” is the biathlete’s time in minutes

As a reference guide, the table below includes all 60 biathetes that competed in the mass start event. They are arranged by sex and then by time. If you are looking for a particular biathlete, you can find them here and then go read about their specific cluster below. Otherwise, similar tables are present when each cluster is discussed individually afterwards.

Cluster 1

This cluster captures the two athletes who placed 29th and 30th in the women’s race and were significantly slower than all other participants. These two also missed significantly more shots than the rest of the biathletes especially from the standing position. Due to these misses and slow skiing times, the model separated them from the rest of the field which caused the cluster size imbalance.

Cluster 2

Athletes in cluster 2 were very good shooting from the prone position - on average only missing 0.6 shots. This is about 1.4 shots lower than the next closest cluster. However, their shooting from the standing position is above average (when ignoring the outliers of cluster 1) which makes their average number of penalty laps closer to the rest of the field. However, these athletes still maintain a little bit of that edge gained from the prone position. Overall, this cluster ended up with the second fastest average time and an average rank of 11.1th place.

The fantastic average placement of this cluster is due largely in part to the women. 5 of the top 10 including both the silver and bronze medalists. Both Tiril Eckhoff and Marte Olsbu Røiseland were perfect from the prone position but missed 4 each from the standing position. This was the same amount of shots missed by gold medalist Justine Braisaz-Bouchet. However, Braisaz-Bouchet missed some of her shots earlier in the race when she was fresher skiing. Shooting cleaner from the standing position when she was tired allowed Braisaz-Bouchet to gain ground at the end of the race compared to her competitors who were forced to ski more penalty loops. This ultimately is what kept Eckhoff and Røiseland from both the gold medal and being in the same cluster as Justine Braisaz-Bouchet

Cluster 3

Cluster 3 is a rather interesting mix of biathletes. The group contains the lower finishes in the men’s race and the middle of the pack finishes in the women’s race. Logically this makes sense because their times are similar even if their placements in their respective placements are not. These biathletes failed to shoot well from the standing position and that seems to be what held them back. Outside of cluster 1, this cluster had the highest average number of misses from that position. However, these athletes skiied and shot well enough from the prone position to keep them with the rest of the field.

Without pulling data from other Biathlon World Cup events in the past year, it’s impossible to know whether these athletes just had poor shooting days or if they are were not great shooters to begin with. On both the men and women’s side, shooting a cleaner slate could easily boost the majority of athletes in Cluster 3 up a few spots in the standings and for some potentially into medal contention.

Cluster 4

Cluster 4 is where the elite biathletes reside. These biathletes averaged one shooting miss less than the rest of the field. This helped avoid penalty laps which can really help shave time off their race. However, it also means it is harder for athletes in this cluster to improve their chances of medaling. For athletes in this cluster, they are all likely both elite shooters and skiers however may not have had their best day of competition in comparison to other Olympic athletes. Other factors that were mentioned earlier such as the magnitude of the Olympic Games or a good night of rest may have also added pressure but this would be impossible to measure.

Cluster 4 contains the top 7 times in the men’s event as well as Justine Braisaz-Bouchet who won gold in the women’s race and was referred to earlier. With an average rank of 6.4, these were the athletes to watch to see who took home the gold, silver and bronze. One of the most fascinating biathletes in the sport right now is men’s Mass Start gold medalist Johannes Thingnes Bø who placed at the top of this cluster. He finished a full 40 seconds clear from the silver medalist and more than a minute faster than the bronze! During the Beijing 2022 Games, he finished with 4 golds, a bronze, and a 5th place finish across all biathlon events. Between him and his brother Tarjei (Cluster 4), their combined medal count would have been enough to place 9th in the country medal count.

Cluster 5

Biathletes in cluster 5 includes placements in the teens and early twenties from the women’s as well as some of the last place men. Although they were below average on number of prone shots missed, they shot decent enough from the standing position to have a mean 6.3 missed shots which ranks 3rd among all 5 clusters. They finished with an average rank of 20.2 which is very similar to cluster 3. This makes sense as cluster’s average time was 17 seconds faster than that of cluster 3. The main difference between these two clusters was the change in ability to hitting shots from the prone position.

One of the more peculiar athletes in this cluster is Kristina Reztsova who placed 5th with a very good time of 41:29. It doesn’t seem like she belongs in this cluster. However, her shooting performance was much lower than the rest of her competitors. Reztsova’s 6 missed shots ultimately cost her a medal because clearly her skiing ability is one of the best in the event as she was able to make up the majority of the time that she lost by skiing penalty laps quickly. Luckily, she did not go home empty handed from Beijing, winning Silver in the Relay and Bronze in the Mixed Relay with the ROC. With an improved shooting skillset, Kristina Reztsova will be a competitor to look out for in Milano Cortina in 2026 barring a Russian ban from those Olympic Games.

Conclusion

In conclusion, outside of the gold medal on the men’s side, this event was wide open. A few less seconds of skiing or one less missed shot could have been enough to steal a medal in either race. We also learned that while missing 0 shots is ideal, everyone misses eventually. Almost equally as important to how much you miss is when you miss. As we saw in the women’s race, missing shots later on in the standing position may cost you more in penalty lap time when fatigue sets in and could allow competitors that shoot cleaner to pass you. We also discussed athletes to watch out for in the future such as Kristina Reztsova. In the future, when you see a race standings, I hope you consider imagine the racers in different tiers of competitiveness and contention for medals rather than solely the gold, silver, and bronze.

Author’s Notes

I really enjoyed doing this project as it allowed me to do a couple different things. I was able to use the FHSTR package that I created which I find pretty cool. This project also allowed me to analyze the sport that helped me fall in love with the Winter Olympics back in 2010 (the first Olympic Games I can remember). Finally, I was able to use one of the more advanced modeling techniques that I learned this past school year. I absolutely want to say thanks to everyone that helped me out with regards to the feedback for some of the data visualizations and to Sean Sullivan for helping me edit this!

For those interested in seeing the code, click here to see the R Script that contains how the modeling and data visualizations were completed. The flags that were used in the table were downloaded to my computer. If you want to see more Olympics visualizations, check out the Visualizations tab on my website!

Sources

  1. Biathlon 101: https://www.nbcolympics.com/news/biathlon-101-rules

  2. Biathlon World Article: https://www.biathlonworld.com/inside-ibu/sports-and-event/biathlon-mass-start

  3. K-Means Clustering Assistance https://uc-r.github.io/kmeans_clustering#optimal

Men’s 15kilometers Mass Start video from the Olympics YouTube Page: https://www.youtube.com/watch?v=B6uXE5-SPbI

Women’s 12.5kilometers Mass Start video from the Olympics YouTube Page: https://www.youtube.com/watch?v=ZF2gCGLmBlk

Posted on:
June 8, 2022
Length:
12 minute read, 2520 words
Categories:
R
Tags:
Olympics Biathlon
See Also:
FHSTR Package
Olympics Visualizations