Cleaning

Goals of this notebook

Our data all comes from the MLS Players Association but it was published in different formats throughout the years.

  • We are starting with some historical data already cleaned by Jeremy Singer-Vine for the Data is Plural newsletter: Github archive. It goes from 2007 to 2017.
    • It’s interesting that the DIP PDF for 2017 is different from what I downloaded from the MLSPA site. It appears to have pool MLS players that the downloaded PDFs do not.
  • I used Tablula to convert 2018 to 2021 data from PDF into CSVs. The data won’t have headers and each file is a little different.
  • The 2022 is available as an Excel file. It did need cleaning.
  • The May version of 2023 data came in Excel, but the September 2023 update was published as a PDF and also as an HTML table that I was able to collect through browser-based scraper.
  • The early 2024 version was released as a CSV and an HTML table. I scraped the HTML table and created an Excel file because it had a player not listed in the csv file.

Once all the data was imported, I still needed to do some normalization:

  • All columns needed the same datatype, etc.
  • We needed to normalize team names. Sometimes team initials are used, and other times full names are used. I wanted both. I used this data to build a complete reference file in Google Sheets and the re-updated the data with that. I also consulted usfootballr for how names and abbreviations were used in hopes of joining with data from that package.
  • I wanted to add conference affiliations, which I did manually in the Google Sheet reference.
  • I normalized positions with preference for letter designations using hyphens for multi-position players.

Things to do

  • I need to double-check all the conference designations before I actually use them.
  • I might add color values to the Google Sheet reference since all the packages I’ve checked have been out of date.

Setup

library(tidyverse)
library(readxl)
library(janitor)
library(teamcolors)

Import the archive data

This brings in the Data is Plural files, which should all be the same.

dip_data <- read_csv(c(
  "data-processed/data-is-plural/mls-salaries-2007.csv",
  "data-processed/data-is-plural/mls-salaries-2008.csv",
  "data-processed/data-is-plural/mls-salaries-2009.csv",
  "data-processed/data-is-plural/mls-salaries-2010.csv",
  "data-processed/data-is-plural/mls-salaries-2011.csv",
  "data-processed/data-is-plural/mls-salaries-2012.csv",
  "data-processed/data-is-plural/mls-salaries-2013.csv",
  "data-processed/data-is-plural/mls-salaries-2014.csv",
  "data-processed/data-is-plural/mls-salaries-2015.csv",
  "data-processed/data-is-plural/mls-salaries-2016.csv",
  "data-processed/data-is-plural/mls-salaries-2017.csv"),
  id = "file_name"
) |> 
  mutate(
    year = str_sub(file_name, 44, 47)
  ) |> 
  relocate(year) |> # moving year to front
  select(!file_name) |>  # dropping file_name
  rename(club_short = club)
Rows: 5553 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): club, last_name, first_name, position
dbl (2): base_salary, guaranteed_compensation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# peek
dip_data |> head()

Cleaning club abbeviations

There are times when club abbreviations vary. We fix those here.

Checking them:

dip_data |> 
  count(club_short)

Looking through this, I want to make a number of changes. I also ended up checking the usfootballR package for how they do abbreviations. Using SKC for historical data since it was the same club that changed names.

dip_changed <- dip_data |> 
  mutate(
    club_short_clean = recode(
      club_short,
      "KC" = "SKC",
      "MNUFC" = "MIN",
      "TFC" = "TOR",
      "NYRB" = "NY",
      "NYCFC" = "NYC",
      "None" = NULL,
      "POOL" = "MLS",
      "Pool" = "MLS"
    )
  )

Checking the changes to make sure they are right.

dip_changed |> 
  count(club_short, club_short_clean)

Cleaning column names for DIP files

dip_cleaned <- dip_changed |> 
  select(!club_short) |> 
  rename(club_short = club_short_clean) |> 
  relocate(year, club_short)

dip_cleaned |> head()  

Cleaning Tabula conversions

Each of these files from years 2018 to 2021 will have different issues.

2018 salaries

For this one, the dollar signs came in as their own column, which was actually helpful so the numbers were actually numbers. I also need to add the year.

s2018_imported <- read_csv(
  "data-raw/tabula-2018-09-15-Salary-Information-Alphabetical.csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "currency_1",
    "base_salary",
    "currency_2",
    "guaranteed_compensation"
  )
) |> 
  select(!starts_with("currency")) |> # removes dollar signs
  add_column(year = "2018", .before = "club_long")
Rows: 694 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, currency_1, currency_2
num (2): base_salary, guaranteed_compensation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
s2018_imported |> head()

2019 salaries

For this one, our salaries come in with $ in the salary fields, and also with a trailing space and ). We import those a “raw” versions and clean them before using type_convert() to have dplyr try to type them again, changing them to numberic. For some reason I couldn’t just convert to as.numeric() even after it was cleaned up as they came up NA. I also add the year.

s2019_imported <- read_csv(
  "data-raw/tabula-2019-Salary-List-Fall-Release-FINAL-Salary-List-Fall-Release-MLS.csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove_all(base_salary_raw, "[\\)\\$]") |> str_trim(),
    guaranteed_compensation = str_remove_all(guaranteed_compensation_raw, "[\\)\\$]") |> str_trim()
  ) |> 
  type_convert() |> 
  select(!ends_with("raw")) |> 
  add_column(year = "2019", .before = "club_long")
Rows: 714 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  club_long = col_character(),
  last_name = col_character(),
  first_name = col_character(),
  position = col_character(),
  base_salary_raw = col_character(),
  guaranteed_compensation_raw = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2019_imported |> head()

2020 salaries

s2020_imported <- read_csv(
  "data-raw/tabula-2020-Fall-Winter-Salary-List-alphabetical.csv",
  col_names = c(
    "last_name",
    "first_name",
    "club_long",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove(base_salary_raw, "\\$"),
    guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
  ) |> 
  select(!ends_with("raw")) |> 
  type_convert() |> 
  add_column(year = "2020", .before = "last_name")
Rows: 778 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): last_name, first_name, club_long, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  last_name = col_character(),
  first_name = col_character(),
  club_long = col_character(),
  position = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2020_imported |> head()

2021 salaries

s2021_imported <- read_csv(
  "data-raw/tabula-2021-MLSPA-Fall-Salary-release (1).csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove(base_salary_raw, "\\$"),
    guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
  ) |> 
  select(!ends_with("raw")) |> 
  type_convert() |> 
  add_column(year = "2021", .before = "club_long")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 871 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  club_long = col_character(),
  last_name = col_character(),
  first_name = col_character(),
  position = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2021_imported |> head()

Import excel data

2022 data

s2022_imported <- read_excel("data-raw/MLS Salary 2022.xlsx") |> clean_names() |> 
  add_column(year = "2022", .before = "first_name") |> 
  rename(position = position_s,
         club_long = club)

s2022_imported |> glimpse()
Rows: 841
Columns: 7
$ year                    <chr> "2022", "2022", "2022", "2022", "2022", "2022"…
$ first_name              <chr> "Artur", "Nanú", "Zeca", "Luquinhas", "Judson"…
$ last_name               <chr> NA, NA, NA, NA, NA, NA, "Ågren", "Aaronson", "…
$ club_long               <chr> "Columbus Crew", "FC Dallas", "Houston Dynamo"…
$ position                <chr> "M", "D", "D", "F", "M", "D", "D", "M", "D", "…
$ base_salary             <dbl> 564000, 600000, 218182, 1050000, 420000, 40000…
$ guaranteed_compensation <dbl> 615633, 668750, 259015, 1173125, 425000, 42875…

2023 data

The first version for 2023 was available as an Excel file. However, there was an update in September of 2023 that was available for download as a PDF (saved in data-raw) but also published as an HTML table that I was able to use a browser-based scraper to create an Excel file. It is this Excel file that I’m using.

s2023_imported <- read_excel("data-raw/2023-Salary-Report-as-of-Sept-15-2023.xlsx") |> clean_names() |> 
  add_column(year = "2023", .before = "first_name") |> 
  rename(position = position_s,
         club_long = club)

s2023_imported |> glimpse()
Rows: 924
Columns: 7
$ year                    <chr> "2023", "2023", "2023", "2023", "2023", "2023"…
$ first_name              <chr> "Luis", "Lalas", "Nicolás", "Alonso", "Ifunany…
$ last_name               <chr> "Abram", "Abubakar", "Acevedo", "Aceves", "Ach…
$ club_long               <chr> "Atlanta United", "Colorado Rapids", "New York…
$ position                <chr> "D", "D", "D-M", "D", "F", "D-M", "M", "M-F", …
$ base_salary             <dbl> 556364, 665000, 230000, 330000, 85444, 1250000…
$ guaranteed_compensation <dbl> 695977, 702125, 274800, 368900, 85444, 1365000…

2024 data

This 2024 Excel file was created by scraping an HTML table from the MLSPA site. There is a May 2024 csv file available, but it is missing Toronto FC player Aimé Mabika.

s2024_imported <- read_excel("data-raw/2024-Salary-Report-as-of-Apr-25-2023.xlsx") |> clean_names() |> 
  add_column(year = "2024", .before = "first_name") |>
  rename(position = position_s,
         club_long = club)

s2024_imported |> glimpse()
Rows: 874
Columns: 7
$ year                    <chr> "2024", "2024", "2024", "2024", "2024", "2024"…
$ first_name              <chr> "Liel", "Luis", "Lalas", "Nicolás", "Kellyn", …
$ last_name               <chr> "Abada", "Abram", "Abubakar", "Acevedo", "Acos…
$ club_long               <chr> "Charlotte FC", "Atlanta United", "Colorado Ra…
$ position                <chr> "Right Wing", "Center-back", "Center-back", "D…
$ base_salary             <dbl> 2300000, 732275, 665000, 300000, 1392188, 2960…
$ guaranteed_compensation <dbl> 2448500, 871888, 702125, 344800, 1558869, 4216…

Checking 2024 csv

I would prefer to use this downloadable csv, but it is missing a player.

s2024_csv <- read_csv("data-raw/MLS-Salaries-as-of-4-25-2024-for-release-Final.txt") |> clean_names() |> 
  add_column(year = "2024", .before = "first_name") |>
  rename(position = position_s,
         club_long = club,
         guaranteed_compensation = guaranteed_comp) |> 
  mutate(across(6:7, parse_number))
Rows: 873 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): First Name, Last Name, Club, Position(s), Base Salary, Guaranteed Comp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
s2024_csv |> 
  glimpse()
Rows: 873
Columns: 7
$ year                    <chr> "2024", "2024", "2024", "2024", "2024", "2024"…
$ first_name              <chr> "Daniel", "Santiago", "Franco", "Edwin", "Tris…
$ last_name               <chr> "Ríos", "Sosa", "Ibarra", "Mosquera", "Muyumba…
$ club_long               <chr> "Atlanta United", "Atlanta United", "Atlanta U…
$ position                <chr> "Center Forward", "Defensive Midfield", "Defen…
$ base_salary             <dbl> 89716, 685000, 600000, 400000, 525000, 500000,…
$ guaranteed_compensation <dbl> 126383, 778100, 660000, 437000, 566600, 595000…

Who is missing?

s2024_imported |> 
  anti_join(s2024_csv)
Joining with `by = join_by(year, first_name, last_name, club_long, position,
base_salary, guaranteed_compensation)`

Merge non-dip data

The tabula and excel data both use long club names, so I’ll put those together.

non_dip_data <- s2018_imported |> 
  bind_rows(s2019_imported, s2020_imported, s2021_imported, s2022_imported, s2023_imported, s2024_imported)

non_dip_data |> glimpse()
Rows: 5,696
Columns: 7
$ year                    <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long               <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name               <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name              <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position                <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary             <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…

Cleaning long names in non-dip data

Checking the names for errors.

non_dip_data |> 
  count(club_long)

Some things of note:

  • 2021 New England Revolution may be missing last n.
  • Montreal should probably be CF Montréal
  • “MLS Pool” changed to “Major League Soccer” for consistency

There are other changes as well based on usfootballR package, etc.

non_dip_changed <- non_dip_data |> 
  mutate(
    club_long_changed = case_match(
      club_long,
      "Atlanta United" ~ "Atlanta United FC",
      "Chicago Fire" ~ "Chicago Fire FC",
      "Houston Dynamo" ~ "Houston Dynamo FC",
      "CF Montreal" ~ "CF Montréal",
      "Montreal" ~ "CF Montréal",
      # "Montreal Impact" ~ "CF Montréal",
      "Inter Miami" ~ "Inter Miami CF",
      "New England Revolutio" ~ "New England Revolution",
      "DC United" ~ "D.C. United",
      "St. Louis City SC" ~ "St. Louis CITY SC",
      "Minnesota United" ~ "Minnesota United FC",
      "MLS Pool" ~ "Major League Soccer",
      .default = club_long
    )
  )

non_dip_changed  |> 
  count(club_long, club_long_changed)

Now we’ll fix up the columns.

non_dip_cleaned <- non_dip_changed |> 
  select(-club_long) |> 
  rename(club_long = club_long_changed) |> 
  relocate(year, club_long)

non_dip_cleaned |> glimpse()
Rows: 5,696
Columns: 7
$ year                    <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long               <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name               <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name              <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position                <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary             <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…

Building both names

I’d like to have club_long and club_short for each row of data, but to do that I need to build a conversion chart of some kind.

It is club_short that I’d like to be consistent for a market, where club_long changes as branding changes.

How I tackled this:

  • Get a list of all the short names
  • Get a list of all the long names
  • Add them both to a Google Sheet and match them manually
  • Download that sheet and then use joins to fill gaps

I used the following to get the names, though just to copy into a spreadsheet for a lot of manual research and work.

clubs_short <- dip_cleaned |> distinct(year, club_short) |> 
  arrange(year, club_short)

# clubs_short |> clipr::write_clip()

clubs_long <- non_dip_data |> distinct(year, club_long) |> 
  arrange(year, club_long)

# clubs_long |> clipr::write_clip()

Downloading in the spreadsheet

The edited Google Sheet is here: rws-mls-salaries

Download the clubs list. Commented out after saved locally. Then viewed club names to verify/check them.

# download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vQqXJxbbrBsikirZrGyXYV_G6cFZp_dYmcf52UfSYM7Kw3akGlkO5jKP8ZL8WtRA5qUJgFMNPG8JYov/pub?output=csv", "data-processed/clubs-list.csv")

clubs_list <- read_csv("data-processed/clubs-list.csv") |> 
  mutate(year = as.character(year))
Rows: 410 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): club_short, club_long, conference
dbl (1): year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
clubs_list |> 
  count(club_short, club_long)

Fill in our DIP

The DIP data doesn’t have club_long. We’ll join with our master clubs list to get them.

dip_full <- dip_cleaned |> 
  left_join(clubs_list, join_by(year == year, club_short == club_short))

dip_full |> head()

Check for those that did not get a long name to see why.

dip_full |> 
  filter(is.na(club_long))

These players are not attached to teams.

Fill in non-dip

The non-DIP data doesn’t have the short name, so we join with clubs_list reference file for them.

non_dip_full <- non_dip_cleaned |> 
  left_join(clubs_list, join_by(year == year, club_long == club_long))

non_dip_full |> head()

Check those that did not get a short name to research why.

non_dip_full |> 
  filter(is.na(club_short))

Merge all the versions together

binded <- dip_full |> 
  bind_rows(non_dip_full) |> 
  select(
    year,
    club_short,
    last_name,
    first_name,
    position,
    base_salary,
    compensation = guaranteed_compensation,
    club_long,
    conference
  )

binded |> glimpse()
Rows: 11,249
Columns: 9
$ year         <chr> "2007", "2007", "2007", "2007", "2007", "2007", "2007", "…
$ club_short   <chr> "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "…
$ last_name    <chr> "Armas", "Banner", "Barrett", "Blanco", "Brown", "Busch",…
$ first_name   <chr> "Chris", "Michael", "Chad", "Cuauhtemoc", "C.J.", "Jon", …
$ position     <chr> "M", "M", "F", "F", "D", "GK", "F", "D", "M", "D", "D", "…
$ base_salary  <dbl> 225000.0, 12900.0, 41212.5, 2492316.0, 106391.0, 58008.0,…
$ compensation <dbl> 225000.0, 12900.0, 48712.5, 2666778.0, 106391.0, 58008.0,…
$ club_long    <chr> "Chicago Fire FC", "Chicago Fire FC", "Chicago Fire FC", …
$ conference   <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "E…

Clean positions

Peek at the positions.

binded |> 
  count(position)
## Help to get me a list of positions
# binded |> distinct(position) |> pull() |> clipr::write_clip()

It’s clear some cleanup is needed, and will take some decisions. I’m going to go with letter designations with hyphens when necessary.

positions <- binded |> 
  mutate(
    position_new = 
    case_match(
      position,
      # "M" ~ "",
      # "F" ~ "",
      # "D" ~ "",
      # "GK" ~ "",
      # "D-M" ~ "",
      # "F-D" ~ "",
      # "F-M" ~ "",
      # "M-D" ~ "",
      # "M-F" ~ "",
      # "D-F" ~ "",
      "MF" ~ "M-F",
      "M/F" ~ "M-F",
      "M/D" ~ "M-D",
      "D/M" ~ "D-M",
      "D/F" ~ "D-F",
      "F/M" ~ "F-M",
      # "NA" ~ "",
      "Right Wing" ~ "F",
      "Center-back" ~ "D",
      "Defensive Midfield" ~ "D-M",
      "Attacking Midfield" ~ "M-F",
      "Center Forward" ~ "F",
      "Left-back" ~ "D",
      "Central Midfield" ~ "M",
      "Forward" ~ "F",
      "Right-back" ~ "F",
      "Left Wing" ~ "F",
      "Goalkeeper" ~ "GK",
      "Left Midfield" ~ "M",
      "Right Midfield" ~ "M",
      "Midfielder" ~ "M",
      "Defender" ~ "D",
      .default = position
    )
  )

positions |> 
  count(position, position_new)

Clean up the columns

positions_fixed <- positions |> 
  select(!position) |> 
  rename(position = position_new) |> 
  relocate(position, .after = first_name)

positions_fixed  

Export the cleaned data

Assign last cleaning dataframe and export

mls_salaries <- positions_fixed |> 
  arrange(year, club_short, last_name, first_name)

mls_salaries |> write_rds("data-processed/mls-salaries.rds")