library(tidyverse)
library(readxl)
library(janitor)
library(teamcolors)
Cleaning
Goals of this notebook
Our data all comes from the MLS Players Association but it was published in different formats throughout the years.
- We are starting with some historical data already cleaned by Jeremy Singer-Vine for the Data is Plural newsletter: Github archive. It goes from 2007 to 2017.
- It’s interesting that the DIP PDF for 2017 is different from what I downloaded from the MLSPA site. It appears to have pool MLS players that the downloaded PDFs do not.
- I used Tablula to convert 2018 to 2021 data from PDF into CSVs. The data won’t have headers and each file is a little different.
- The 2022 is available as an Excel file. It did need cleaning.
- The May version of 2023 data came in Excel, but the September 2023 update was published as a PDF and also as an HTML table that I was able to collect through browser-based scraper.
- The early 2024 version was released as a CSV and an HTML table. I scraped the HTML table and created an Excel file because it had a player not listed in the csv file.
Once all the data was imported, I still needed to do some normalization:
- All columns needed the same datatype, etc.
- We needed to normalize team names. Sometimes team initials are used, and other times full names are used. I wanted both. I used this data to build a complete reference file in Google Sheets and the re-updated the data with that. I also consulted usfootballr for how names and abbreviations were used in hopes of joining with data from that package.
- I wanted to add conference affiliations, which I did manually in the Google Sheet reference.
- I normalized positions with preference for letter designations using hyphens for multi-position players.
Things to do
- I need to double-check all the conference designations before I actually use them.
- I might add color values to the Google Sheet reference since all the packages I’ve checked have been out of date.
Setup
Import the archive data
This brings in the Data is Plural files, which should all be the same.
<- read_csv(c(
dip_data "data-processed/data-is-plural/mls-salaries-2007.csv",
"data-processed/data-is-plural/mls-salaries-2008.csv",
"data-processed/data-is-plural/mls-salaries-2009.csv",
"data-processed/data-is-plural/mls-salaries-2010.csv",
"data-processed/data-is-plural/mls-salaries-2011.csv",
"data-processed/data-is-plural/mls-salaries-2012.csv",
"data-processed/data-is-plural/mls-salaries-2013.csv",
"data-processed/data-is-plural/mls-salaries-2014.csv",
"data-processed/data-is-plural/mls-salaries-2015.csv",
"data-processed/data-is-plural/mls-salaries-2016.csv",
"data-processed/data-is-plural/mls-salaries-2017.csv"),
id = "file_name"
|>
) mutate(
year = str_sub(file_name, 44, 47)
|>
) relocate(year) |> # moving year to front
select(!file_name) |> # dropping file_name
rename(club_short = club)
Rows: 5553 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): club, last_name, first_name, position
dbl (2): base_salary, guaranteed_compensation
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# peek
|> head() dip_data
Cleaning club abbeviations
There are times when club abbreviations vary. We fix those here.
Checking them:
|>
dip_data count(club_short)
Looking through this, I want to make a number of changes. I also ended up checking the usfootballR package for how they do abbreviations. Using SKC for historical data since it was the same club that changed names.
<- dip_data |>
dip_changed mutate(
club_short_clean = recode(
club_short,"KC" = "SKC",
"MNUFC" = "MIN",
"TFC" = "TOR",
"NYRB" = "NY",
"NYCFC" = "NYC",
"None" = NULL,
"POOL" = "MLS",
"Pool" = "MLS"
) )
Checking the changes to make sure they are right.
|>
dip_changed count(club_short, club_short_clean)
Cleaning column names for DIP files
<- dip_changed |>
dip_cleaned select(!club_short) |>
rename(club_short = club_short_clean) |>
relocate(year, club_short)
|> head() dip_cleaned
Cleaning Tabula conversions
Each of these files from years 2018 to 2021 will have different issues.
2018 salaries
For this one, the dollar signs came in as their own column, which was actually helpful so the numbers were actually numbers. I also need to add the year.
<- read_csv(
s2018_imported "data-raw/tabula-2018-09-15-Salary-Information-Alphabetical.csv",
col_names = c(
"club_long",
"last_name",
"first_name",
"position",
"currency_1",
"base_salary",
"currency_2",
"guaranteed_compensation"
)|>
) select(!starts_with("currency")) |> # removes dollar signs
add_column(year = "2018", .before = "club_long")
Rows: 694 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, currency_1, currency_2
num (2): base_salary, guaranteed_compensation
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
|> head() s2018_imported
2019 salaries
For this one, our salaries come in with $
in the salary fields, and also with a trailing space and )
. We import those a “raw” versions and clean them before using type_convert()
to have dplyr try to type them again, changing them to numberic. For some reason I couldn’t just convert to as.numeric()
even after it was cleaned up as they came up NA. I also add the year.
<- read_csv(
s2019_imported "data-raw/tabula-2019-Salary-List-Fall-Release-FINAL-Salary-List-Fall-Release-MLS.csv",
col_names = c(
"club_long",
"last_name",
"first_name",
"position",
"base_salary_raw",
"guaranteed_compensation_raw"
)|>
) mutate(
base_salary = str_remove_all(base_salary_raw, "[\\)\\$]") |> str_trim(),
guaranteed_compensation = str_remove_all(guaranteed_compensation_raw, "[\\)\\$]") |> str_trim()
|>
) type_convert() |>
select(!ends_with("raw")) |>
add_column(year = "2019", .before = "club_long")
Rows: 714 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
── Column specification ────────────────────────────────────────────────────────
cols(
club_long = col_character(),
last_name = col_character(),
first_name = col_character(),
position = col_character(),
base_salary_raw = col_character(),
guaranteed_compensation_raw = col_character(),
base_salary = col_number(),
guaranteed_compensation = col_number()
)
|> head() s2019_imported
2020 salaries
<- read_csv(
s2020_imported "data-raw/tabula-2020-Fall-Winter-Salary-List-alphabetical.csv",
col_names = c(
"last_name",
"first_name",
"club_long",
"position",
"base_salary_raw",
"guaranteed_compensation_raw"
)|>
) mutate(
base_salary = str_remove(base_salary_raw, "\\$"),
guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
|>
) select(!ends_with("raw")) |>
type_convert() |>
add_column(year = "2020", .before = "last_name")
Rows: 778 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): last_name, first_name, club_long, position, base_salary_raw, guaran...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
── Column specification ────────────────────────────────────────────────────────
cols(
last_name = col_character(),
first_name = col_character(),
club_long = col_character(),
position = col_character(),
base_salary = col_number(),
guaranteed_compensation = col_number()
)
|> head() s2020_imported
2021 salaries
<- read_csv(
s2021_imported "data-raw/tabula-2021-MLSPA-Fall-Salary-release (1).csv",
col_names = c(
"club_long",
"last_name",
"first_name",
"position",
"base_salary_raw",
"guaranteed_compensation_raw"
)|>
) mutate(
base_salary = str_remove(base_salary_raw, "\\$"),
guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
|>
) select(!ends_with("raw")) |>
type_convert() |>
add_column(year = "2021", .before = "club_long")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 871 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
── Column specification ────────────────────────────────────────────────────────
cols(
club_long = col_character(),
last_name = col_character(),
first_name = col_character(),
position = col_character(),
base_salary = col_number(),
guaranteed_compensation = col_number()
)
|> head() s2021_imported
Import excel data
2022 data
<- read_excel("data-raw/MLS Salary 2022.xlsx") |> clean_names() |>
s2022_imported add_column(year = "2022", .before = "first_name") |>
rename(position = position_s,
club_long = club)
|> glimpse() s2022_imported
Rows: 841
Columns: 7
$ year <chr> "2022", "2022", "2022", "2022", "2022", "2022"…
$ first_name <chr> "Artur", "Nanú", "Zeca", "Luquinhas", "Judson"…
$ last_name <chr> NA, NA, NA, NA, NA, NA, "Ågren", "Aaronson", "…
$ club_long <chr> "Columbus Crew", "FC Dallas", "Houston Dynamo"…
$ position <chr> "M", "D", "D", "F", "M", "D", "D", "M", "D", "…
$ base_salary <dbl> 564000, 600000, 218182, 1050000, 420000, 40000…
$ guaranteed_compensation <dbl> 615633, 668750, 259015, 1173125, 425000, 42875…
2023 data
The first version for 2023 was available as an Excel file. However, there was an update in September of 2023 that was available for download as a PDF (saved in data-raw) but also published as an HTML table that I was able to use a browser-based scraper to create an Excel file. It is this Excel file that I’m using.
<- read_excel("data-raw/2023-Salary-Report-as-of-Sept-15-2023.xlsx") |> clean_names() |>
s2023_imported add_column(year = "2023", .before = "first_name") |>
rename(position = position_s,
club_long = club)
|> glimpse() s2023_imported
Rows: 924
Columns: 7
$ year <chr> "2023", "2023", "2023", "2023", "2023", "2023"…
$ first_name <chr> "Luis", "Lalas", "Nicolás", "Alonso", "Ifunany…
$ last_name <chr> "Abram", "Abubakar", "Acevedo", "Aceves", "Ach…
$ club_long <chr> "Atlanta United", "Colorado Rapids", "New York…
$ position <chr> "D", "D", "D-M", "D", "F", "D-M", "M", "M-F", …
$ base_salary <dbl> 556364, 665000, 230000, 330000, 85444, 1250000…
$ guaranteed_compensation <dbl> 695977, 702125, 274800, 368900, 85444, 1365000…
2024 data
This 2024 Excel file was created by scraping an HTML table from the MLSPA site. There is a May 2024 csv file available, but it is missing Toronto FC player Aimé Mabika.
<- read_excel("data-raw/2024-Salary-Report-as-of-Apr-25-2023.xlsx") |> clean_names() |>
s2024_imported add_column(year = "2024", .before = "first_name") |>
rename(position = position_s,
club_long = club)
|> glimpse() s2024_imported
Rows: 874
Columns: 7
$ year <chr> "2024", "2024", "2024", "2024", "2024", "2024"…
$ first_name <chr> "Liel", "Luis", "Lalas", "Nicolás", "Kellyn", …
$ last_name <chr> "Abada", "Abram", "Abubakar", "Acevedo", "Acos…
$ club_long <chr> "Charlotte FC", "Atlanta United", "Colorado Ra…
$ position <chr> "Right Wing", "Center-back", "Center-back", "D…
$ base_salary <dbl> 2300000, 732275, 665000, 300000, 1392188, 2960…
$ guaranteed_compensation <dbl> 2448500, 871888, 702125, 344800, 1558869, 4216…
Checking 2024 csv
I would prefer to use this downloadable csv, but it is missing a player.
<- read_csv("data-raw/MLS-Salaries-as-of-4-25-2024-for-release-Final.txt") |> clean_names() |>
s2024_csv add_column(year = "2024", .before = "first_name") |>
rename(position = position_s,
club_long = club,
guaranteed_compensation = guaranteed_comp) |>
mutate(across(6:7, parse_number))
Rows: 873 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): First Name, Last Name, Club, Position(s), Base Salary, Guaranteed Comp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
|>
s2024_csv glimpse()
Rows: 873
Columns: 7
$ year <chr> "2024", "2024", "2024", "2024", "2024", "2024"…
$ first_name <chr> "Daniel", "Santiago", "Franco", "Edwin", "Tris…
$ last_name <chr> "Ríos", "Sosa", "Ibarra", "Mosquera", "Muyumba…
$ club_long <chr> "Atlanta United", "Atlanta United", "Atlanta U…
$ position <chr> "Center Forward", "Defensive Midfield", "Defen…
$ base_salary <dbl> 89716, 685000, 600000, 400000, 525000, 500000,…
$ guaranteed_compensation <dbl> 126383, 778100, 660000, 437000, 566600, 595000…
Who is missing?
|>
s2024_imported anti_join(s2024_csv)
Joining with `by = join_by(year, first_name, last_name, club_long, position,
base_salary, guaranteed_compensation)`
Merge non-dip data
The tabula and excel data both use long club names, so I’ll put those together.
<- s2018_imported |>
non_dip_data bind_rows(s2019_imported, s2020_imported, s2021_imported, s2022_imported, s2023_imported, s2024_imported)
|> glimpse() non_dip_data
Rows: 5,696
Columns: 7
$ year <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…
Cleaning long names in non-dip data
Checking the names for errors.
|>
non_dip_data count(club_long)
Some things of note:
- 2021 New England Revolution may be missing last
n
. - Montreal should probably be CF Montréal
- “MLS Pool” changed to “Major League Soccer” for consistency
There are other changes as well based on usfootballR package, etc.
<- non_dip_data |>
non_dip_changed mutate(
club_long_changed = case_match(
club_long,"Atlanta United" ~ "Atlanta United FC",
"Chicago Fire" ~ "Chicago Fire FC",
"Houston Dynamo" ~ "Houston Dynamo FC",
"CF Montreal" ~ "CF Montréal",
"Montreal" ~ "CF Montréal",
# "Montreal Impact" ~ "CF Montréal",
"Inter Miami" ~ "Inter Miami CF",
"New England Revolutio" ~ "New England Revolution",
"DC United" ~ "D.C. United",
"St. Louis City SC" ~ "St. Louis CITY SC",
"Minnesota United" ~ "Minnesota United FC",
"MLS Pool" ~ "Major League Soccer",
.default = club_long
)
)
|>
non_dip_changed count(club_long, club_long_changed)
Now we’ll fix up the columns.
<- non_dip_changed |>
non_dip_cleaned select(-club_long) |>
rename(club_long = club_long_changed) |>
relocate(year, club_long)
|> glimpse() non_dip_cleaned
Rows: 5,696
Columns: 7
$ year <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…
Building both names
I’d like to have club_long
and club_short
for each row of data, but to do that I need to build a conversion chart of some kind.
It is club_short
that I’d like to be consistent for a market, where club_long
changes as branding changes.
How I tackled this:
- Get a list of all the short names
- Get a list of all the long names
- Add them both to a Google Sheet and match them manually
- Download that sheet and then use joins to fill gaps
I used the following to get the names, though just to copy into a spreadsheet for a lot of manual research and work.
<- dip_cleaned |> distinct(year, club_short) |>
clubs_short arrange(year, club_short)
# clubs_short |> clipr::write_clip()
<- non_dip_data |> distinct(year, club_long) |>
clubs_long arrange(year, club_long)
# clubs_long |> clipr::write_clip()
Downloading in the spreadsheet
The edited Google Sheet is here: rws-mls-salaries
Download the clubs list. Commented out after saved locally. Then viewed club names to verify/check them.
# download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vQqXJxbbrBsikirZrGyXYV_G6cFZp_dYmcf52UfSYM7Kw3akGlkO5jKP8ZL8WtRA5qUJgFMNPG8JYov/pub?output=csv", "data-processed/clubs-list.csv")
<- read_csv("data-processed/clubs-list.csv") |>
clubs_list mutate(year = as.character(year))
Rows: 410 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): club_short, club_long, conference
dbl (1): year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
|>
clubs_list count(club_short, club_long)
Fill in our DIP
The DIP data doesn’t have club_long. We’ll join with our master clubs list to get them.
<- dip_cleaned |>
dip_full left_join(clubs_list, join_by(year == year, club_short == club_short))
|> head() dip_full
Check for those that did not get a long name to see why.
|>
dip_full filter(is.na(club_long))
These players are not attached to teams.
Fill in non-dip
The non-DIP data doesn’t have the short name, so we join with clubs_list reference file for them.
<- non_dip_cleaned |>
non_dip_full left_join(clubs_list, join_by(year == year, club_long == club_long))
|> head() non_dip_full
Check those that did not get a short name to research why.
|>
non_dip_full filter(is.na(club_short))
Merge all the versions together
<- dip_full |>
binded bind_rows(non_dip_full) |>
select(
year,
club_short,
last_name,
first_name,
position,
base_salary,compensation = guaranteed_compensation,
club_long,
conference
)
|> glimpse() binded
Rows: 11,249
Columns: 9
$ year <chr> "2007", "2007", "2007", "2007", "2007", "2007", "2007", "…
$ club_short <chr> "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "…
$ last_name <chr> "Armas", "Banner", "Barrett", "Blanco", "Brown", "Busch",…
$ first_name <chr> "Chris", "Michael", "Chad", "Cuauhtemoc", "C.J.", "Jon", …
$ position <chr> "M", "M", "F", "F", "D", "GK", "F", "D", "M", "D", "D", "…
$ base_salary <dbl> 225000.0, 12900.0, 41212.5, 2492316.0, 106391.0, 58008.0,…
$ compensation <dbl> 225000.0, 12900.0, 48712.5, 2666778.0, 106391.0, 58008.0,…
$ club_long <chr> "Chicago Fire FC", "Chicago Fire FC", "Chicago Fire FC", …
$ conference <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "E…
Clean positions
Peek at the positions.
|>
binded count(position)
## Help to get me a list of positions
# binded |> distinct(position) |> pull() |> clipr::write_clip()
It’s clear some cleanup is needed, and will take some decisions. I’m going to go with letter designations with hyphens when necessary.
<- binded |>
positions mutate(
position_new =
case_match(
position,# "M" ~ "",
# "F" ~ "",
# "D" ~ "",
# "GK" ~ "",
# "D-M" ~ "",
# "F-D" ~ "",
# "F-M" ~ "",
# "M-D" ~ "",
# "M-F" ~ "",
# "D-F" ~ "",
"MF" ~ "M-F",
"M/F" ~ "M-F",
"M/D" ~ "M-D",
"D/M" ~ "D-M",
"D/F" ~ "D-F",
"F/M" ~ "F-M",
# "NA" ~ "",
"Right Wing" ~ "F",
"Center-back" ~ "D",
"Defensive Midfield" ~ "D-M",
"Attacking Midfield" ~ "M-F",
"Center Forward" ~ "F",
"Left-back" ~ "D",
"Central Midfield" ~ "M",
"Forward" ~ "F",
"Right-back" ~ "F",
"Left Wing" ~ "F",
"Goalkeeper" ~ "GK",
"Left Midfield" ~ "M",
"Right Midfield" ~ "M",
"Midfielder" ~ "M",
"Defender" ~ "D",
.default = position
)
)
|>
positions count(position, position_new)
Clean up the columns
<- positions |>
positions_fixed select(!position) |>
rename(position = position_new) |>
relocate(position, .after = first_name)
positions_fixed
Export the cleaned data
Assign last cleaning dataframe and export
<- positions_fixed |>
mls_salaries arrange(year, club_short, last_name, first_name)
|> write_rds("data-processed/mls-salaries.rds") mls_salaries