Web Scraping Olympics: Python

Susan Vanderplas

2024-11-02

Follow Along

Basic Game Plan

  1. Try to scrape each sport’s athlete table directly

    • If table is injected with JavaScript, use Selenium
  2. Create a single table of athlete links + sport information

  3. Use info from athlete table to get birthdays

    • Start out with Olympic website
    • Use Wikipedia if that doesn’t work

Scraping Table of Athletes

Obstacle: Python’s User Agent is blocked by default

Solution: Set the User Agent to look like a normal browser.

(This ended up not mattering)

Scraping Table of Athletes

  • Hardest part: scrolling so the “next” button is visible in the frame #JustSeleniumProblems

    • Error handling was essential
  • Decided not to find the max number of pages and instead use a while loop and a break statement

  • Save data incrementally – don’t re-scrape

BeautifulSoup Coolness

The SoupStrainer function is very cool and allows you to screen out anything you don’t care about!

from bs4 import BeautifulSoup, SoupStrainer

items_to_keep = SoupStrainer("a", attrs = {"class": "competitor-container"})
for link in BeautifulSoup(page, 'html.parser', parse_only=items_to_keep):
    if link.has_attr('href'):
        print(link['href'])

Approach

  • Write a function to get all athletes from each sport

    • read table and append
    • if “Next” is enabled, scroll down and click it
  • map a lambda function over a Series of URLs for each sport

  • Use df.explode() 💥 to move from one sport per line to one athlete per line

  • Convert stub of per-athlete URL to a full URL

Table of Athletes

import pandas as pd

athlete_urls = pd.read_csv("athlete-addresses.csv")
type sport urlname url
type-team soccer football https://olympics.com/en/paris-2024/athlete/guillaume-restes_1900702
type-team soccer football https://olympics.com/en/paris-2024/athlete/santiago-hezze_1930405
type-team soccer football https://olympics.com/en/paris-2024/athlete/maksym-talovierov_1564454
type-team soccer football https://olympics.com/en/paris-2024/athlete/mayra-ramirez_1918067
type-team soccer football https://olympics.com/en/paris-2024/athlete/boubacar-traore_1567378

Scraping Athlete Data

  • Start with the Olympics website

  • Athlete data is in an HTML object with the ID #PersonInfo

  • This information is not actually a table 😭

    • Get the bold elements – variable names
    • Get the full text for each line
      variable name : variable value
    • Process to get a DataFrame - find/replace
  • Return a single row DataFrame

Scraping Athlete Data

╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│          Data Summary                Data Types                              │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                       │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
│ │ Number of rows    │ 4213   │ │ string      │ 9     │                       │
│ │ Number of columns │ 12     │ │ int64       │ 2     │                       │
│ └───────────────────┴────────┘ │ float64     │ 1     │                       │
│                                └─────────────┴───────┘                       │
│                                   number                                     │
│ ┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┳━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┓  │
│ ┃ colu ┃      ┃      ┃      ┃      ┃     ┃     ┃      ┃      ┃      ┃     ┃  │
│ ┃ mn_n ┃      ┃      ┃      ┃      ┃     ┃     ┃      ┃      ┃      ┃ his ┃  │
│ ┃ ame  ┃ NA   ┃ NA % ┃ mean ┃ sd   ┃ p0  ┃ p25 ┃ p50  ┃ p75  ┃ p100 ┃ t   ┃  │
│ ┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━╇━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━┩  │
│ │ id   │    0 │    0 │    0 │    0 │   0 │   0 │    0 │    0 │    0 │     │  │
│ │      │      │      │      │      │     │     │      │      │      │  ▇  │  │
│ │ Age  │    0 │    0 │ 25.6 │ 4.68 │  14 │  22 │   25 │   29 │   47 │ ▁▇▇ │  │
│ │      │      │      │    7 │    6 │     │     │      │      │      │ ▃▁  │  │
│ │ Heig │ 1072 │ 25.4 │ 1.78 │ 0.12 │ 1.4 │ 1.7 │ 1.78 │ 1.87 │ 2.22 │  ▃▇ │  │
│ │ ht_m │      │    5 │    7 │   05 │     │     │      │      │      │ ▆▂  │  │
│ └──────┴──────┴──────┴──────┴──────┴─────┴─────┴──────┴──────┴──────┴─────┘  │
│                                   string                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column_name           ┃ NA    ┃ NA %   ┃ words per row   ┃ total words  ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ Birth_Country         │   628 │  14.91 │               1 │         4357 │  │
│ │ Date_of_Birth         │     0 │      0 │               3 │        12639 │  │
│ │ Function              │     0 │      0 │               1 │         4275 │  │
│ │ Gender                │     0 │      0 │               1 │         4213 │  │
│ │ Place_of_birth        │   927 │     22 │             1.1 │         4774 │  │
│ │ Residence_Country     │  1218 │  28.91 │            0.91 │         3841 │  │
│ │ name                  │     0 │      0 │             2.2 │         9060 │  │
│ │ events                │     0 │      0 │              59 │       249062 │  │
│ │ Place_of_residence    │  1971 │  46.78 │            0.77 │         3257 │  │
│ └───────────────────────┴───────┴────────┴─────────────────┴──────────────┘  │
╰──────────────────────────────────── End ─────────────────────────────────────╯

Scraping Athlete Data

id Age Birth_Country Date_of_Birth name
0 19 France 11 Mar 2005 RESTES Guillaume
0 23 Argentina 22 Oct 2001 HEZZE Santiago
0 24 Russian Federation 28 Jun 2000 TALOVIEROV Maksym
0 25 Colombia 25 Mar 1999 RAMIREZ Mayra
0 23 Mali 20 Aug 2001 TRAORE Boubacar

Cleaning

  • Create a date variable that’s not a string

  • Create a decimal_date() function in Python… missing R…


from datetime import datetime

def decimal_date(date):
    start_of_year = datetime(date.year, 1, 1)
    end_of_year = datetime(date.year + 1, 1, 1)
    days_in_year = (end_of_year - start_of_year).days
    return date.year + (date - start_of_year).days / days_in_year
  

Birthdays by Team vs. Indiv

import seaborn as sns
import seaborn.objects as so

p = (
  so.Plot(athlete_data, y = "month")
  .add(so.Bar(), so.Count())
  .facet("type")
  .scale()
)
p.show()

Birthdays by Team vs. Indiv

p = (
  so.Plot(athlete_data, y = "month")
  .add(
    so.Bar(), 
    so.Hist(stat='proportion', 
            common_norm=False))
  .facet("type")
  .scale()
)
p.show()

Birthdays by Sport

p = (
  so.Plot(athlete_data, y = "month")
  .add(so.Bar(), so.Hist(stat='proportion', common_norm=False))
  .facet("sport").share(x=False)
  .layout(size=(8,4))
)
p.show()

Birthdays by Sport