For my capstone project in Data Science, I attempted to predict the future winners of Tour De France using Machine Learning.
For my Data Set I went looking all over, including Kaggle, but discovered only incomplete datasets (such as ones containing only winners). So, I decided to utilize Web Scraping to pull my own data from ProCycingStats.com, a website that carefully tracks and stores cycling results as well as information on the riders. I randomly cross-referenced some of my data with Wikipedia and Tour De France’s own website to make sure the data I was collecting is correct.
I then did some cleaning and EDA on my dataset to identify key characteristics for the previous winners, convert any non-numeric data into numeric, and reduce data type to speed up processing time.
At this stage I also played around with different target variables to define how I was going to predict my final results. These included different bin sizes for classifying rider positions, and the exact rank for an approach based on regression. (Time was considered as a target, but not utilized as the length of the race changes every year, and so time would not accurately identify winners year to year)
For my modeling I applied a series to Classification and Regression Machine Learning Models using PyCaret and Sci-Kit Learn. After hyper-parameter optimization I saw the best results through the Decision Tree model. The first iteration of modeling can bin riders into different classes based on their features with 80% accuracy against the test set. The conclusion was two fold that my data-set still lacked the depth to identify true winners and podium placers from the rest, and that machine learning might not be good enough to get the best results. For my second iteration I applied Deep Learning through Convolutional Neural Network to see if it would perform better than my Machine Learning models.
I am currently working on my third iteration where I collect more data on the riders and the Tour De France so I can have more features to train my model on and be able to predict outcome solely using Machine Learning as the models are more interpretable and the give actionable insights.