Horse with a Pointy Hat

Musings of a data scientist and recovering astrophysicist

How to run a half-marathon in 10 lines of python

I got into running a number of years ago and one of the ways I've always found productive to get me out and make sure I get a bit of exercise is to enter races. That way I know I better make an effort to train or I'm going to look like an idiot on race day. A couple of months ago I was registered to run in the ABP Southampton Half Marathon. A couple of years ago there was a feature that allowed you to register your social media accounts and then they would tweet/post your progress out as you ran. This wasn't available in 2016 and didn't seem to be available this either so I decided to engineer my own solution.

Post half-marathon refreshment.
Post half-marathon refreshment. © Amy McQuillan

What do we need to do?

The goal of the project is to enable almost real-time reporting of my performance in the race via my twitter account. To achieve this we are going to need to solve a number of discrete problems:

  1. I need to be able to programatically send a tweet from my twitter account.
  2. I need to know my current status in the race.
  3. I need to check if the race status has changed from the last time I got an update to decide whether there is something new to share.
  4. If there is a new result I need to craft an appropriate tweet.

How to send a tweet from python

In all honesty, this was what I expected to be the tricky part, turned out to be one of the easiest bits. Key to the ease of this solution was the Tweepy library.

import tweepy

# Need to setup our twitter api credentials
consumer_key = 'abcdefghijkl'
consumer_secret = 'ABCDEFGHIJKL'
access_token = '1234567890'
access_token_secret = '1234567890-secret'

# Now we connect to twitter using our credentials
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
my_api = tweepy.API(auth)

# Finally we can then send our tweet
my_api.update_status("I'm sending a tweet using Python! Hello world.")

How to keep track of a runners race status

Helpfully the ABP Southampton Half Marathon maintains a live website of each runners status thanks to DB MAX Sports Timing. The screen shot below shows my results after the race was over, however, in the weeks before the race the page already existed and I could see the empty results table. Hence, I was able to design a simple web scraper that would extract out my results.

Example of live web timing results.

So we need to write a web scraper.

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_timings(uri):
    """Get current web page data"""
    page = requests.get(uri)
    soup = BeautifulSoup(page.text, 'lxml')

    # Now find the table of timings
    start_row = soup.findAll('td', text='Start')[0]
    header = start_row.find_parent().find_previous_siblings()[0]
    columns = []
    for tr in header.findAll('tr'):
        columns.append(tr.td.text)
    data = []
    for row in header.find_next_siblings():
        data_row = []
        for td in row.findAll('td'):
            data_row.append(td.text)
        data.append(data_row)
    timings_df = pd.DataFrame(data, columns=columns)
    return timings_df

This little piece of python is a bit more complicated so it's worth taking the time to explain what's going on within the get_timings() function. The function takes a single argument that is the url of the webpage where the timings are located. We use the requests library to fetch the web content and then BeautifulSoup to search for all tables that have the text "Start" in them.

I always find webscraping to be more art than science. In this example I use the tools of BeautifulSoup to talk through the results table to get the features I need by first finding the header row and then looping over all of its siblings. I store the data as I go in a couple of lists and then convert it to a pandas DataFrame at the very end.

Keeping track of whether something has changed

Getting the instantaneous results from the webpage is all well and good, however, we don't want our tweetbot continually spamming the same tweet if nothing has changed. So to handle that we need to store what we saw when we last scraped the data. We could use a csv file but I decided to use a sqlite database in case I wanted to scale this up ever for multiple runners or track multiple races. If you aren't familiar with it sqlite is a simple, file based SQL database; you can read more here

Storing our scraped data in SQLITE

Fortunately for us pandas has already thought about people wanting to push results into a SQL database and has the DataFrame.to_sql() function. So having used out web-scraper to create a dataframe we just need to provide the database connection details to the function. Here we use the sqlite3 library.

We take bib number and a timings dataframe as inputs. We try and make a connection to our local database file and push the results out; in this instance we are always happy to overwrite any previous version of the database. If the connection isn't established we exit gracefully.

import sqlite3 as lite
def push_timings_to_db(bib, timings_df):
    """Store the timings data in a sqlite database"""
    con = None
    try:
        con = lite.connect('abp_half_results.db')
        timings_df.to_sql('bib_{}'.format(bib), con, if_exists='replace')
    except lite.Error as e:
        print("Error %s:" % e.args[0])
        sys.exit(1)
    finally:
        if con:
            con.close()

Comparing a new scrape to the old data

This is only half the battle, we also need to be able to fetch data back from the database in order to make comparisons to the current data. This function is almost the complete inverse of the previous piece of code. Now we connect to our sqlite database and use the pandas.read_sql() function to pull our data back out into a dataframe.

def fetch_timings_from_db(bib, timings_df):
    """Fetch the timings data from a sqlite database"""
    con = None
    try:
        con = lite.connect('abp_half_results.db')
        query_columns = ', '.join(['"'+col+'"' for col in timings_df.columns])
        query = "SELECT {} FROM bib_{}".format(query_columns, bib)
        result = pd.read_sql(query, con)
    except lite.Error as error:
        print("Error %s:" % error.args[0])
        sys.exit(1)
    finally:
        if con:
            con.close()
            return result

Now we need to test whether the stored results look any different to the current web results. This is achieved by doing a 'not equals' != check between the two python dataframe objects. If only one row has changed then a True boolean and the row is returned. If more than one row has changed then the last (i.e. most recent) row is the one returned.

def have_timings_changed(timings_df):
    """Testing if the timing data has changed"""
    result = fetch_timings_from_db()
    check = timings_df[(result != timings_df)["TOD"] == True]
    if check.shape[0] == 1:
        return True, check
    elif check.shape[0] > 1:
        nrows = check.shape[0]
        filt = [False if i<nrows-1 else True for i in range(nrows)]
        return True, check[filt]
    else:
        return False, check

Crafting a tweet for the race status

This component of the project is entirely open to what you would want to say. Personally, I chose to have a dedicated tweet for the start and another for the finish. Any intermediate results during the race would be a boiler plate reply that would then have the current time and pace.

def update_text(data_row, timings_df):
    """Function to generate text of a tweet update"""
    def _convert_to_dt(inp):
        """Try and convert to a datetime else return NaT"""
        try:
            return pd.to_datetime(inp)
        except ValueError:
            return pd.NaT

    timings = timings_df['TOD'].map(_convert_to_dt)
    max_time = (timings.max() - timings.min())
    str_time = "{}h{}m{}s".format(max_time.components.hours,
                                  max_time.components.minutes,
                                  max_time.components.seconds)

    text = ''
    finished = False
    if data_row['Split Name'].iloc[0] == 'Start':
        text = "I crossed the start line of the Southampton Half Marathon at {} #ABPHalfand10K".format(data_row['TOD'].iloc[0])
    elif data_row['Dist Done'].iloc[0] == "13.1":
        pace = max_time.seconds/60/13.1
        str_pace = '{}:{} per mile'.format(int(pace), int((pace%1)*60))
        text = "I've completed the Southampton Half Marathon in a time of {} with an average pace of {} #ABPHalfand10K".format(str_time,
                                                  str_pace)
        finished = True
    else:
        curr_time = (pd.to_datetime(data_row['TOD'].iloc[0]) - timings.min())
        str_curr_time = "{}h{}m{}s".format(curr_time.components.hours, curr_time.components.minutes, curr_time.components.seconds)
        text = "I've made it to {}; I've been running for {} and currently have a pace of {} #ABPHalfand10K".format(data_row['Split Name'].iloc[0],
                                  str_curr_time,
                                  data_row['Pace'].iloc[0])
    return text, finished

Stitching it all together

If we now combine all of this together into a runner() class then we can make it generic for any runner and easy to "run" (excuse the pun). I'll also wrap my original tweeting code into a simple function.

import pandas as import pd
import requests
from bs4 import BeautifulSoup
import sqlite3 as lite

class runner(object):
    """Class to act as a runner"""
    def __init__(self, bib_number):
        """Constructor"""
        self.bib = bib_number
        self.uri = "http://dbmaxresults.co.uk/MyResults.aspx?CId=16421&RId=2166&EId=1&AId={}".format(int(self.bib)+192232)

    def get_timings(self):
        """Get current web page data"""
        page = requests.get(self.uri)
        soup = BeautifulSoup(page.text, 'lxml')

        # Now find the table of timings
        start_row = soup.findAll('td', text='Start')[0]
        header = start_row.find_parent().find_previous_siblings()[0]
        columns = []
        for tr in header.findAll('tr'):
            columns.append(tr.td.text)
        data = []
        for row in header.find_next_siblings():
            data_row = []
            for td in row.findAll('td'):
                data_row.append(td.text)
            data.append(data_row)
        self.timings_df = pd.DataFrame(data, columns=columns)

    def push_timings_to_db(self):
        """Store the timings data in a sqlite database"""
        con = None
        try:
            con = lite.connect('abp_half_results.db')
            self.timings_df.to_sql('bib_'+str(self.bib), con, if_exists='replace')
        except lite.Error as e:
            print("Error %s:" % e.args[0])
            sys.exit(1)
        finally:
            if con:
                con.close()

    def fetch_timings_from_db(self):
        """Fetch the timings data from a sqlite database"""
        con = None
        try:
            con = lite.connect('abp_half_results.db')
            query_columns = ', '.join(['"'+col+'"' for col in self.timings_df.columns])
            result = pd.read_sql("SELECT {} FROM bib_{}".format(query_columns, self.bib), con)
        except lite.Error as error:
            print("Error %s:" % error.args[0])
            sys.exit(1)
        finally:
            if con:
                con.close()
                return result

    def have_timings_changed(self):
        """Testing if the timing data has changed"""
        result = self.fetch_timings_from_db()
        check = self.timings_df[(result != self.timings_df)["TOD"] == True]
        if check.shape[0] == 1:
            return True, check
        elif check.shape[0] > 1:
            nrows = check.shape[0]
            filt = [False if i<nrows-1 else True for i in range(nrows)]
            return True, check[filt]
        else:
            return False, check

    def update_text(self, data_row):
        """Function to generate text of a tweet update"""
        def _convert_to_dt(inp):
            """Try and convert to a datetime else return NaT"""
            try:
                return pd.to_datetime(inp)
            except ValueError:
                return pd.NaT

        timings = self.timings_df['TOD'].map(_convert_to_dt)
        max_time = (timings.max() - timings.min())
        str_time = "{}h{}m{}s".format(max_time.components.hours,
                                      max_time.components.minutes,
                                      max_time.components.seconds)

        text = ''
        finished = False
        if data_row['Split Name'].iloc[0] == 'Start':
            text = "I crossed the start line of the Southampton Half Marathon at {} #ABPHalfand10K".format(data_row['TOD'].iloc[0])
        elif data_row['Dist Done'].iloc[0] == "13.1":
            pace = max_time.seconds/60/13.1
            str_pace = '{}:{} per mile'.format(int(pace), int((pace%1)*60))
            text = "I've completed the Southampton Half Marathon in a time of {} with an average pace of {} #ABPHalfand10K".format(str_time, str_pace)
            finished = True
        else:
            curr_time = (pd.to_datetime(data_row['TOD'].iloc[0]) - timings.min())
            str_curr_time = "{}h{}m{}s".format(curr_time.components.hours,
                                               curr_time.components.minutes,
                                               curr_time.components.seconds)
            text = "I've made it to {}; I've been running for {} and currently have a pace of {} #ABPHalfand10K".format(data_row['Split Name'].iloc[0],
                                         str_curr_time, data_row['Pace'].iloc[0])
        return text, finished
def astroadamh_tweeter(tweet_txt):
    """Function to connect to the astroadamh twitter account"""
    consumer_key = 'abcdefghijkl'
    consumer_secret = 'ABCDEFGHIJKL'
    access_token = '1234567890'
    access_token_secret = '1234567890-secret'

    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    my_api = tweepy.API(auth)
    my_api.update_status(tweet_txt)

Now we can run a half marathon in 10 lines of Python

We can tell that there aren't going to be an awful lot of updates based upon the timings webpage, so we:

  • Initiate an instance of the runner class with the appropriate bib number.
  • Create a twitterbot ready to send out our tweets.
  • Use a while loop to monitor if we've finished the race.
  • For the duration of the race we check to see if there is a status change and if so we tweet about it. Then we wait 5 minutes to check again.
myRunner = runner(7117); myTweetBot = astroadamh_twitter()
finished = False
while not finished:
    myRunner.get_timings()
    status, data = myRunner.have_timings_changed()
    if status:
        tweet, finished = myRunner.update_text(data)
        myTweetBot.tweet(tweet)
        myRunner.push_timings_to_db()
    time.sleep(5*60)

And at the end of that we see the following

Result of the twitterbot's tweets

Comments

Comments powered by Disqus