2024-11-02
Try to scrape each sport’s athlete table directly
Create a single table of athlete links + sport information
Use info from athlete table to get birthdays
Obstacle: Python’s User Agent is blocked by default
Solution: Set the User Agent to look like a normal browser.
(This ended up not mattering)
Hardest part: scrolling so the “next” button is visible in the frame #JustSeleniumProblems
Decided not to find the max number of pages and instead use a while
loop and a break
statement
Save data incrementally – don’t re-scrape
The SoupStrainer
function is very cool and allows you to screen out anything you don’t care about!
Write a function to get all athletes from each sport
map a lambda function over a Series of URLs for each sport
Use df.explode()
💥 to move from one sport per line to one athlete per line
Convert stub of per-athlete URL to a full URL
type | sport | urlname | url |
---|---|---|---|
type-team | soccer | football | https://olympics.com/en/paris-2024/athlete/guillaume-restes_1900702 |
type-team | soccer | football | https://olympics.com/en/paris-2024/athlete/santiago-hezze_1930405 |
type-team | soccer | football | https://olympics.com/en/paris-2024/athlete/maksym-talovierov_1564454 |
type-team | soccer | football | https://olympics.com/en/paris-2024/athlete/mayra-ramirez_1918067 |
type-team | soccer | football | https://olympics.com/en/paris-2024/athlete/boubacar-traore_1567378 |
Start with the Olympics website
Athlete data is in an HTML object with the ID #PersonInfo
This information is not actually a table 😭
Return a single row DataFrame
╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│ Data Summary Data Types │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │
│ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │
│ │ Number of rows │ 4213 │ │ string │ 9 │ │
│ │ Number of columns │ 12 │ │ int64 │ 2 │ │
│ └───────────────────┴────────┘ │ float64 │ 1 │ │
│ └─────────────┴───────┘ │
│ number │
│ ┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┳━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┓ │
│ ┃ colu ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │
│ ┃ mn_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ his ┃ │
│ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ t ┃ │
│ ┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━╇━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━┩ │
│ │ id │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ ▇ │ │
│ │ Age │ 0 │ 0 │ 25.6 │ 4.68 │ 14 │ 22 │ 25 │ 29 │ 47 │ ▁▇▇ │ │
│ │ │ │ │ 7 │ 6 │ │ │ │ │ │ ▃▁ │ │
│ │ Heig │ 1072 │ 25.4 │ 1.78 │ 0.12 │ 1.4 │ 1.7 │ 1.78 │ 1.87 │ 2.22 │ ▃▇ │ │
│ │ ht_m │ │ 5 │ 7 │ 05 │ │ │ │ │ │ ▆▂ │ │
│ └──────┴──────┴──────┴──────┴──────┴─────┴─────┴──────┴──────┴──────┴─────┘ │
│ string │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ │
│ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │
│ │ Birth_Country │ 628 │ 14.91 │ 1 │ 4357 │ │
│ │ Date_of_Birth │ 0 │ 0 │ 3 │ 12639 │ │
│ │ Function │ 0 │ 0 │ 1 │ 4275 │ │
│ │ Gender │ 0 │ 0 │ 1 │ 4213 │ │
│ │ Place_of_birth │ 927 │ 22 │ 1.1 │ 4774 │ │
│ │ Residence_Country │ 1218 │ 28.91 │ 0.91 │ 3841 │ │
│ │ name │ 0 │ 0 │ 2.2 │ 9060 │ │
│ │ events │ 0 │ 0 │ 59 │ 249062 │ │
│ │ Place_of_residence │ 1971 │ 46.78 │ 0.77 │ 3257 │ │
│ └───────────────────────┴───────┴────────┴─────────────────┴──────────────┘ │
╰──────────────────────────────────── End ─────────────────────────────────────╯
id | Age | Birth_Country | Date_of_Birth | name |
---|---|---|---|---|
0 | 19 | France | 11 Mar 2005 | RESTES Guillaume |
0 | 23 | Argentina | 22 Oct 2001 | HEZZE Santiago |
0 | 24 | Russian Federation | 28 Jun 2000 | TALOVIEROV Maksym |
0 | 25 | Colombia | 25 Mar 1999 | RAMIREZ Mayra |
0 | 23 | Mali | 20 Aug 2001 | TRAORE Boubacar |
Create a date variable that’s not a string
Create a decimal_date() function in Python… missing R…