Cleaning

Goals of this notebook

Our data all comes from the MLS Players Association but it was published in different formats throughout the years.

  • We are starting with some historical data already cleaned by Jeremy Singer-Vine for the Data is Plural newsletter: Github archive. It goes from 2007 to 2017.
    • It’s interesting that the DIP PDF for 2017 is different from what I downloaded from the MLSPA site. It appears to have pool MLS players that the downloaded PDFs do not.
  • I used Tablula to convert 2018 to 2021 data from PDF into CSVs. The data won’t have headers and each file is a litte different.
  • The 2022 is available as an Excel file. I did need cleaning.
  • The initial version of 2023 data came in Excel, but the September 2023 update was published as a PDF and also as an HTML table that I was able to collect through browser-based scraper.

Once all the data was imported, I still needed to do some normalization:

  • All columns needed the same datatype, etc.
  • We needed to normalize team names. Sometimes team initials are used, and other times full names are used. I wanted both. I used this data to build a complete reference file in Google Sheets and the re-updated the data with that. I also consulted usfootballr for how names and abbreviations were used in hopes of joining with data from that package.
  • I wanted to add conference affiliations, which I did manually in the Google Sheet reference.

Things to do

  • I need to double-check all the conference designations before I actually use them.
  • I might add color values to the Google Sheet reference since all the packages I’ve checked have been out of date.

Setup

library(tidyverse)
library(readxl)
library(janitor)
library(teamcolors)

Import the archive data

This brings in the Data is Plural files, which should all be the same.

dip_data <- read_csv(c(
  "data-processed/data-is-plural/mls-salaries-2007.csv",
  "data-processed/data-is-plural/mls-salaries-2008.csv",
  "data-processed/data-is-plural/mls-salaries-2009.csv",
  "data-processed/data-is-plural/mls-salaries-2010.csv",
  "data-processed/data-is-plural/mls-salaries-2011.csv",
  "data-processed/data-is-plural/mls-salaries-2012.csv",
  "data-processed/data-is-plural/mls-salaries-2013.csv",
  "data-processed/data-is-plural/mls-salaries-2014.csv",
  "data-processed/data-is-plural/mls-salaries-2015.csv",
  "data-processed/data-is-plural/mls-salaries-2016.csv",
  "data-processed/data-is-plural/mls-salaries-2017.csv"),
  id = "file_name"
) |> 
  mutate(
    year = str_sub(file_name, 44, 47)
  ) |> 
  relocate(year) |> # moving year to front
  select(!file_name) |>  # dropping file_name
  rename(club_short = club)
Rows: 5553 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): club, last_name, first_name, position
dbl (2): base_salary, guaranteed_compensation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# peek
dip_data |> head()

Cleaning club abbeviations

There are times when club abbreviations vary. We fix those here.

Checking them:

dip_data |> 
  count(club_short)

Looking through this, I want to make a number of changes. I also ended up checking the usfootballR package for how they do abbreviations. Using SKC for historical data since it was the same club that changed names.

dip_changed <- dip_data |> 
  mutate(
    club_short_clean = recode(
      club_short,
      "KC" = "SKC",
      "MNUFC" = "MIN",
      "TFC" = "TOR",
      "NYRB" = "NY",
      "NYCFC" = "NYC",
      "None" = NULL,
      "POOL" = "MLS",
      "Pool" = "MLS"
    )
  )

Checking the changes to make sure they are right.

dip_changed |> 
  count(club_short, club_short_clean)

Cleaning column names for DIP files

dip_cleaned <- dip_changed |> 
  select(!club_short) |> 
  rename(club_short = club_short_clean) |> 
  relocate(year, club_short)

dip_cleaned |> head()  

Cleaning Tabula conversions

Each of these files from years 2018 to 2021 will have different issues.

2018 salaries

For this one, the dollar signs came in as their own column, which was actually helpful so the numbers were actually numbers. I also need to add the year.

s2018_imported <- read_csv(
  "data-raw/tabula-2018-09-15-Salary-Information-Alphabetical.csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "currency_1",
    "base_salary",
    "currency_2",
    "guaranteed_compensation"
  )
) |> 
  select(!starts_with("currency")) |> # removes dollar signs
  add_column(year = "2018", .before = "club_long")
Rows: 694 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, currency_1, currency_2
num (2): base_salary, guaranteed_compensation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
s2018_imported |> head()

2019 salaries

For this one, our salaries come in with $ in the salary fields, and also with a trailing space and ). We import those a “raw” versions and clean them before using type_convert() to have dplyr try to type them again, changing them to numberic. For some reason I couldn’t just convert to as.numeric() even after it was cleaned up as they came up NA. I also add the year.

s2019_imported <- read_csv(
  "data-raw/tabula-2019-Salary-List-Fall-Release-FINAL-Salary-List-Fall-Release-MLS.csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove_all(base_salary_raw, "[\\)\\$]") |> str_trim(),
    guaranteed_compensation = str_remove_all(guaranteed_compensation_raw, "[\\)\\$]") |> str_trim()
  ) |> 
  type_convert() |> 
  select(!ends_with("raw")) |> 
  add_column(year = "2019", .before = "club_long")
Rows: 714 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  club_long = col_character(),
  last_name = col_character(),
  first_name = col_character(),
  position = col_character(),
  base_salary_raw = col_character(),
  guaranteed_compensation_raw = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2019_imported |> head()

2020 salaries

s2020_imported <- read_csv(
  "data-raw/tabula-2020-Fall-Winter-Salary-List-alphabetical.csv",
  col_names = c(
    "last_name",
    "first_name",
    "club_long",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove(base_salary_raw, "\\$"),
    guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
  ) |> 
  select(!ends_with("raw")) |> 
  type_convert() |> 
  add_column(year = "2020", .before = "last_name")
Rows: 778 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): last_name, first_name, club_long, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  last_name = col_character(),
  first_name = col_character(),
  club_long = col_character(),
  position = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2020_imported |> head()

2021 salaries

s2021_imported <- read_csv(
  "data-raw/tabula-2021-MLSPA-Fall-Salary-release (1).csv",
  col_names = c(
    "club_long",
    "last_name",
    "first_name",
    "position",
    "base_salary_raw",
    "guaranteed_compensation_raw"
  )
) |> 
  mutate(
    base_salary = str_remove(base_salary_raw, "\\$"),
    guaranteed_compensation = str_remove(guaranteed_compensation_raw, "\\$"),
  ) |> 
  select(!ends_with("raw")) |> 
  type_convert() |> 
  add_column(year = "2021", .before = "club_long")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 871 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): club_long, last_name, first_name, position, base_salary_raw, guaran...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

── Column specification ────────────────────────────────────────────────────────
cols(
  club_long = col_character(),
  last_name = col_character(),
  first_name = col_character(),
  position = col_character(),
  base_salary = col_number(),
  guaranteed_compensation = col_number()
)
s2021_imported |> head()

Import excel data

2022 data

s2022_imported <- read_excel("data-raw/MLS Salary 2022.xlsx") |> clean_names() |> 
  add_column(year = "2022", .before = "first_name") |> 
  rename(position = position_s,
         club_long = club)

s2022_imported |> glimpse()
Rows: 841
Columns: 7
$ year                    <chr> "2022", "2022", "2022", "2022", "2022", "2022"…
$ first_name              <chr> "Artur", "Nanú", "Zeca", "Luquinhas", "Judson"…
$ last_name               <chr> NA, NA, NA, NA, NA, NA, "Ågren", "Aaronson", "…
$ club_long               <chr> "Columbus Crew", "FC Dallas", "Houston Dynamo"…
$ position                <chr> "M", "D", "D", "F", "M", "D", "D", "M", "D", "…
$ base_salary             <dbl> 564000, 600000, 218182, 1050000, 420000, 40000…
$ guaranteed_compensation <dbl> 615633, 668750, 259015, 1173125, 425000, 42875…

2023 data

The first version for 2023 was available as an Excel file. However, there was an update in September of 2023 that was available for download as a PDF (saved in data-raw) but also published as an HTML table that I was able to use a browser-based scraper to create an Excel file. It is this Excel file that I’m using.

s2023_imported <- read_excel("data-raw/2023-Salary-Report-as-of-Sept-15-2023.xlsx") |> clean_names() |> 
  add_column(year = "2023", .before = "first_name") |> 
  rename(position = position_s,
         club_long = club)

s2023_imported |> glimpse()
Rows: 924
Columns: 7
$ year                    <chr> "2023", "2023", "2023", "2023", "2023", "2023"…
$ first_name              <chr> "Luis", "Lalas", "Nicolás", "Alonso", "Ifunany…
$ last_name               <chr> "Abram", "Abubakar", "Acevedo", "Aceves", "Ach…
$ club_long               <chr> "Atlanta United", "Colorado Rapids", "New York…
$ position                <chr> "D", "D", "D-M", "D", "F", "D-M", "M", "M-F", …
$ base_salary             <dbl> 556364, 665000, 230000, 330000, 85444, 1250000…
$ guaranteed_compensation <dbl> 695977, 702125, 274800, 368900, 85444, 1365000…

Merge non-dip data

The tabula and excel data both use long club names, so I’ll put those together.

non_dip_data <- s2018_imported |> 
  bind_rows(s2019_imported, s2020_imported, s2021_imported, s2022_imported, s2023_imported)

non_dip_data |> glimpse()
Rows: 4,822
Columns: 7
$ year                    <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long               <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name               <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name              <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position                <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary             <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…

Cleaning long names in non-dip data

Checking the names for errors.

non_dip_data |> 
  count(club_long)

Some things of note:

  • 2021 New England Revolution may be missing last n.
  • Montreal should probably be CF Montréal

There are other changes as well based on usfootballR package, etc.

non_dip_changed <- non_dip_data |> 
  mutate(
    club_long_changed = recode(
      club_long,
      "Atlanta United" = "Atlanta United FC",
      "Chicago Fire" = "Chicago Fire FC",
      "Houston Dynamo" = "Houston Dynamo FC",
      "Montreal" = "CF Montréal",
      "Inter Miami" = "Inter Miami CF",
      "New England Revolutio" = "New England Revolution",
      "DC United" = "D.C. United",
      "St. Louis City SC" = "St. Louis CITY SC",
      "Minnesota United" = "Minnesota United FC"
    )
  )

non_dip_changed  |> 
  count(club_long, club_long_changed)

Now we’ll fix up the columns.

non_dip_cleaned <- non_dip_changed |> 
  select(-club_long) |> 
  rename(club_long = club_long_changed) |> 
  relocate(year, club_long)

non_dip_cleaned |> glimpse()
Rows: 4,822
Columns: 7
$ year                    <chr> "2018", "2018", "2018", "2018", "2018", "2018"…
$ club_long               <chr> "New York Red Bulls", "New York City FC", "Col…
$ last_name               <chr> "Abang", "Abdul-Salaam", "Abu", "Abubakar", "A…
$ first_name              <chr> "Anatole", "Saad", "Mohammed", "Lalas", "David…
$ position                <chr> "F", "D", "M", "D", "F-M", "M", "F-M", "M", "M…
$ base_salary             <dbl> 68927.00, 106480.00, 175008.00, 68250.04, 1250…
$ guaranteed_compensation <dbl> 68927.00, 120230.00, 181258.00, 75750.04, 1250…

Building both names

I’d like to have club_long and club_short for each row of data, but to do that I need to build a conversion chart of some kind. How I tackled this:

  • Get a list of all the short names
  • Get a list of all the long names
  • Add them both to a Google Sheet and match them manually
  • Download that sheet and then use joins to fill gaps

I used the following to get the names, though just to copy into a spreadsheet for a lot of manual research and work.

clubs_short <- dip_cleaned |> distinct(year, club_short) |> 
  arrange(year, club_short)

# clubs_short |> clipr::write_clip()

clubs_long <- non_dip_data |> distinct(year, club_long) |> 
  arrange(year, club_long)

# clubs_long |> clipr::write_clip()

Downloading in the spreadsheet

The edited Google Sheet is here: rws-mls-salaries

Download the clubs list. Commented out after saved locally. Then viewed club names to verify/check them.

# download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vQqXJxbbrBsikirZrGyXYV_G6cFZp_dYmcf52UfSYM7Kw3akGlkO5jKP8ZL8WtRA5qUJgFMNPG8JYov/pub?output=csv", "data-processed/clubs-list.csv")

clubs_list <- read_csv("data-processed/clubs-list.csv") |> 
  mutate(year = as.character(year))
Rows: 379 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): club_short, club_long, conference
dbl (1): year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
clubs_list |> 
  count(club_short, club_long)

Fill in our DIP

The DIP data doesn’t have club_long. We’ll join with our master clubs list to get them.

dip_full <- dip_cleaned |> 
  left_join(clubs_list, join_by(year == year, club_short == club_short))

dip_full |> head()

Check for those that did not get a long name to see why.

dip_full |> 
  filter(is.na(club_long))

Fill in non-dip

The non-DIP data doesn’t have the short name, so we join with clubs_list reference file for them.

non_dip_full <- non_dip_cleaned |> 
  left_join(clubs_list, join_by(year == year, club_long == club_long))

non_dip_full |> head()

Check those that did not get a short name to research why.

non_dip_full |> 
  filter(is.na(club_short))

Merge all the versions together

mls_salaries <- dip_full |> 
  bind_rows(non_dip_full) |> 
  select(
    year,
    club_short,
    last_name,
    first_name,
    position,
    base_salary,
    compensation = guaranteed_compensation,
    club_long,
    conference
  )

mls_salaries |> glimpse()
Rows: 10,375
Columns: 9
$ year         <chr> "2007", "2007", "2007", "2007", "2007", "2007", "2007", "…
$ club_short   <chr> "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "CHI", "…
$ last_name    <chr> "Armas", "Banner", "Barrett", "Blanco", "Brown", "Busch",…
$ first_name   <chr> "Chris", "Michael", "Chad", "Cuauhtemoc", "C.J.", "Jon", …
$ position     <chr> "M", "M", "F", "F", "D", "GK", "F", "D", "M", "D", "D", "…
$ base_salary  <dbl> 225000.0, 12900.0, 41212.5, 2492316.0, 106391.0, 58008.0,…
$ compensation <dbl> 225000.0, 12900.0, 48712.5, 2666778.0, 106391.0, 58008.0,…
$ club_long    <chr> "Chicago Fire FC", "Chicago Fire FC", "Chicago Fire FC", …
$ conference   <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "E…

Export the cleaned data

mls_salaries |> write_rds("data-processed/mls-salaries.rds")