library(tidyverse)
library(janitor)
options(dplyr.summarise.inform = FALSE)
Points vs salaries
Goal of this notebook
This notebook is a work in progress. There are no particular findings at this time.
Does spending more on players ensure a better season? I want to compare season points to salaries spent by MLS teams.
I thought I had found some standings data on kaggle, but it only goes through part of 2021.
I worked it out using 2020 data, but there is nothing to really show.
Setup
Download
I scraped the MLS 2023 standings and put them in a Google Sheet where I hand-edited some columns to clean it up. (It was just quicker because of some non-consisten patterns in the scrapped team name.)
download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwM9NKGZEc5CKVR2z7IoEUq0nij77xNu6kz8Rua4JFNfqt0sOe5RaE8gwW2lZ0nSsAFH5aCNUQJrcV/pub?output=csv", "data-raw/mls-standings-2023.csv")
I haven’t used this as of yet.
Importing
<- read_csv("data-raw/all_tables.csv") |> clean_names()
mls_standings <- read_rds("data-processed/mls-salaries.rds")
mls_salaries <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwM9NKGZEc5CKVR2z7IoEUq0nij77xNu6kz8Rua4JFNfqt0sOe5RaE8gwW2lZ0nSsAFH5aCNUQJrcV/pub?output=csv") mls_standings_2023
Salaries
Starting with salaries data since this is “cleaned” already.
<- mls_salaries |>
sal_20 filter(year == "2020") |>
group_by(club_long, conference) |>
summarise(team_salary = sum(compensation))
sal_20
Standings
Checking what we have for standings data.
|>
mls_standings summarise(
yr_old = min(year),
yr_new = max(year)
)
Trying this with 2020 anyway
Peek at the data
<- mls_standings |> filter(year == "2020") std_20
Cleaning names
There are some issues here with team names. There are notes on the names and some names are different than what is used with MLS Salaries.
<- mls_standings |>
std_20_clean filter(year == "2020", conference == "Overall") |>
mutate(
team_clean = str_remove_all(team, regex("\\[.\\]|\\(.\\)")) |> str_trim(),
team_new = case_match(
team_clean,"Columbus Crew SC" ~ "Columbus Crew",
"Houston Dynamo" ~ "Houston Dynamo FC",
"Los Angeles FC" ~ "LAFC",
"Vancouver Whitecaps FC" ~ "Vancouver Whitecaps",
.default = team_clean
)|>
) arrange(team_new) |>
distinct(team_new, pts)
std_20_clean
Put them together
<- sal_20 |>
combined_20 left_join(std_20_clean, by = join_by(club_long == team_new)) |>
drop_na()
combined_20
Plot the relationship
ggplot(combined_20) +
aes(x = pts, y = team_salary) +
geom_point(aes(color = conference), size = 4) +
# geom_text(aes(label = club_long)) +
::geom_text_repel(aes(label = club_long), size = 3) ggrepel
There isn’t one, really. At least not for this year. It’s also difficult to read.