Using Logistic Regression to Find the Best Features for Fighters
What does the statistically ultimate fighter look like? In this article I explore the use of logistic regression in determining what features give a fighter the edge in the sport of mixed martial arts
What does it take to be a good fighter?
Perhaps I, typing away on my desk sipping some hot cocoa, may not quite know.
What I do know is a bit of computational economics, and through a model known as logistic regression, we are going to dig into the numbers to see if we can find out something new that perhaps even the most veteran coach or skilled fighter may not know yet.
We have a dataset of 5,512 fights taken from Kaggle user mdabbert and cleaned slightly by user Andrew Ritchie here. Each row describes a bout between two fighters including their physical attributes, fight record, betting odds, and the eventual outcome of the fight.
Our analysis begins by treating this as a binary classification problem. We want to look at several features mentioned above and determine if it has a significant impact on one fighter's chances of winning. Enter:
The Logistic regressionΒΆ
A logistic regression is a statistical tool that helps us predict binary classes. Unlike linear regression which predicts a continuous variable, a logistic regression can be used to predict classes which make it for predicting a yes/no or a win/loss.
This works by putting a linear regression into a sigmoid function. The sigmoid function takes our input $y$ and returns $\sigma$ which is always a value between 0 to 1. The closer the output is to 0, the stronger the prediction is that the outcome is a loss for one fighter and vice-versa.
Linear regression:
$ \large y = Ξ²_0 + Ξ²_1X_1 + Ξ²_2X_2 + ... + Ξ²_nX_n $
Sigmoid function:
$ \large \sigma = \frac{1}{1 + \large e^{-y}} $
Logistic regression:
$ \large \sigma = \frac{1}{1 + \large e^{-(Ξ²_0 + Ξ²_1X_1 + Ξ²_2X_2 + ... + Ξ²_nX_n)}} $
Let's try recreating that in python:
# import all needed packages
import pandas as pd
import numpy as np
import copy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.optimize import minimize
# define a function that takes in our data (df) and the coefficients we'll be estimating (beta)
def sigmoid(df, beta):
# we dot multiply df and beta to match our betas to their corresponding independent variables
# Xb is essentially a linear regression model
Xb = np.dot(df, beta)
# after that we simply get the negative exponential of Xb and plug into the fraction to complete our sigmoid
eXb = np.exp(Xb)
# this is where we'll go off script a bit. We'll use a variation of the sigmoid called a softmax which will allow us to do handle multiple classes for multinomial choice models.
probability = eXb / eXb.sum(1)[:,None]
return probability
You might notice one part of the code we haven't discussed yet which is estimating the beta coefficients. To do this, we will use something called:
Maximum Likelihood Estimation (MLE)ΒΆ
MLE is an estimation method that takes our dataset's parameters, such as $\hat{\mu}$ and $\hat{\sigma}$, as our given conditions and tests different sets of coefficients $\beta$ to see which one is most likely to produce our actual dataset.
In statistical terms, if the probability density function is given as:
$$ Pr (data | distribution) $$
Then likelihood is:
$$ L (distribution | data) $$
We're not out of the woods yet because we still need a way of approaching the best likelihood value as we test through multiple guesses or iterations of $\beta$. I'll spare you the calculus and present to you the loglikelihood function which will do just that:
# define a function that takes in our coefficient guesses (betas) and *args
# *args contains our dependent variable (y), independent variables (X), as well as our number of parameters/X's (n_params)
def LL(betas, *args):
# unpack args
y, X, n_params, n_classes = args[0], args[1], args[2], args[3]
# reshape beta so we can properly dot multiply them in our sigmoid function
beta_shaped = np.array(betas).reshape(n_params, -1, order='F')
# we're going to set our first column beta_0 to 0 so we only fit J-1 parameter sets
beta_shaped[:,0] = [0]*n_params
# we'll use dummy variables in line with our use of softmax
d = pd.get_dummies(y).to_numpy()
probs = sigmoid(X, beta_shaped)
log_probs = np.log(probs)
ll = d * log_probs
return -np.sum(ll)
Okay that was a lot to take in (especially if you're writing this in the morning like I am) so let's take a step back and look at why we're writing all these functions in the first place. We want to find out:
What does it take to be the
statistically advantageous Fighter?
ΒΆ
statistically advantageous Fighter?
Well, we can start by asking the question "what do we normally think of as being advantaegous in a fight?"
Traditionally in combat sports, we have what is called "the tale of the tape" which highlights a few key stats about each fighter that may influence the outcome of the fight. In the UFC, one of the biggest MMA promotions today, these stats are: age, height, weight, and reach.
Conventional wisdom dictates that a weight disparity between two combatants would intuitively have the most significant impact over who wins a fight. Because of this, the pool of professional fighters are always divided into weightclasses which means this variable is pretty much held constant for us for every observation.
(Side note: it is common these days for atheletes to "cut weight" for fights. They would shed a ridiculous amount of body weight when weighing in for their division, then immediately rehydrate and bulk back up until the actual fight a few days later. It would be interesting to see how the regular "walkaround" weight disparity between fighters may play a role in determining the outcome of a fight but we'll leave that scraping task for another day.)
What does our data look like?ΒΆ
Our table contains 5,512 observations of UFC fights beginning March 21, 2010 (billed as UFC Live: Vera vs. Jones) until December 10, 2022 (UFC 282). There are some inconsistencies and inaccuracies with the data which I won't get into here, but we either will be cleaning those features as we go or not using the feature altogether.
# Read data from the CSV file
ufc_data = pd.read_csv(r"/Users/shawnlee/Documents/School/USF/ECON611/final_project/ufc-master-[corrected3].csv")
ufc_data.head()
R_fighter | B_fighter | R_odds | B_odds | R_ev | B_ev | date | location | country | Winner | ... | finish_details | finish_round | finish_round_time | total_fight_time_secs | r_dec_odds | b_dec_odds | r_sub_odds | b_sub_odds | r_ko_odds | b_ko_odds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Paddy Pimblett | Jared Gordon | -270.0 | 220.0 | 37.037037 | 220.000000 | 2022-12-10 | Las Vegas, Nevada, USA | USA | Red | ... | NaN | 3.0 | 5:00 | 900.0 | 300.0 | 425.0 | 200.0 | 1400.0 | 265.0 | 700.0 |
1 | Santiago Ponzinibbio | Alex Morono | -150.0 | 125.0 | 66.666667 | 125.000000 | 2022-12-10 | Las Vegas, Nevada, USA | USA | Red | ... | Punches | 3.0 | 2:29 | 749.0 | 180.0 | 270.0 | 1600.0 | 1200.0 | 200.0 | 700.0 |
2 | Darren Till | Dricus Du Plessis | 180.0 | -220.0 | 180.000000 | 45.454545 | 2022-12-10 | Las Vegas, Nevada, USA | USA | Blue | ... | Neck Crank | 3.0 | 2:43 | 763.0 | 400.0 | 325.0 | 2300.0 | 475.0 | 360.0 | 180.0 |
3 | Bryce Mitchell | Ilia Topuria | NaN | NaN | NaN | NaN | 2022-12-10 | Las Vegas, Nevada, USA | USA | Blue | ... | Arm Triangle | 2.0 | 3:10 | 490.0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Raul Rosas Jr. | Jay Perrin | -300.0 | 240.0 | 33.333333 | 240.000000 | 2022-12-10 | Las Vegas, Nevada, USA | USA | Red | ... | Neck Crank | 1.0 | 2:44 | 164.0 | 200.0 | 360.0 | 180.0 | 1000.0 | 650.0 | 800.0 |
5 rows Γ 119 columns
Let's start by seeing the spread of observations across our four features: height, reach, age, and betting odds. Because we have two fighters for every row of observation (Blue fighter, Red fighter), we will create a df combined_feats
which appends the rows vertically for the four features so that our df has shape (11024, 4).
# Combine
combined_feats = pd.concat([
ufc_data[['B_Height_cms', 'B_Reach_cms', 'B_age', 'B_odds']].rename(columns=lambda x: x.replace('B_', '')),
ufc_data[['R_Height_cms', 'R_Reach_cms', 'R_age', 'R_odds']].rename(columns=lambda x: x.replace('R_', ''))
], ignore_index=True)
print("combined_feats has a shape of: ", combined_feats.shape)
combined_feats has a shape of: (11024, 4)
We'll clean the data a bit with the dropna()
method, then we'll create four histograms side-by-side. We'll also add in boxplots at the bottom so we can tease out any outliers.
# Drop rows with missing values
combined_feats = combined_feats.dropna()
# Set Seaborn style
sns.set(style="whitegrid")
# Plotting histograms using Seaborn
fig, axs = plt.subplots(2, 4, figsize=(15, 4), gridspec_kw={'height_ratios': [6, 1]})
# Plot histograms for height
axs[0, 0].hist(combined_feats['Height_cms'], bins=20, color='blue', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 0])
axs[0, 0].set_title('Fighter heights observed')
axs[0, 0].set_xlabel('')
axs[0, 0].set_ylabel('Frequency')
# Plot histograms for reach
axs[0, 1].hist(combined_feats['Reach_cms'], bins=20, color='orange', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 1])
axs[0, 1].set_title('Fighter reaches observed')
axs[0, 1].set_xlabel('')
axs[0, 1].set_ylabel('Frequency')
# Plot histograms for age
axs[0, 2].hist(combined_feats['age'], bins=20, color='green', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 2])
axs[0, 2].set_title('Fighter ages observed')
axs[0, 2].set_xlabel('')
axs[0, 2].set_ylabel('Frequency')
# Plot histograms for odds
axs[0, 3].hist(combined_feats['odds'], bins=20, color='grey', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 3])
axs[0, 3].set_title('Fight odds observed')
axs[0, 3].set_xlabel('')
axs[0, 3].set_ylabel('Frequency')
# Plot boxplots using Seaborn
# Set the color palette for boxplots
colors = ['blue', 'orange', 'green', 'grey']
# Plot boxplots for height
sns.boxplot(x=combined_feats['Height_cms'], ax=axs[1, 0], vert=False, boxprops=dict(facecolor=colors[0], alpha=0.7), medianprops=dict(color='black'))
axs[1, 0].set_title('')
axs[1, 0].set_xlabel('Height (cms)')
# Plot boxplots for reach
sns.boxplot(x=combined_feats['Reach_cms'], ax=axs[1, 1], vert=False, boxprops=dict(facecolor=colors[1], alpha=0.7), medianprops=dict(color='black'))
axs[1, 1].set_title('')
axs[1, 1].set_xlabel('Reach (cms)')
# Plot boxplots for age
sns.boxplot(x=combined_feats['age'], ax=axs[1, 2], vert=False, boxprops=dict(facecolor=colors[2], alpha=0.7), medianprops=dict(color='black'))
axs[1, 2].set_title('')
axs[1, 2].set_xlabel('Age')
# Plot boxplots for odds
sns.boxplot(x=combined_feats['odds'], ax=axs[1, 3], vert=False, boxprops=dict(facecolor=colors[3], alpha=0.7), medianprops=dict(color='black'))
axs[1, 3].set_title('')
axs[1, 3].set_xlabel('Odds (lower is favored)')
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
Since fighters tend to compete within one weightclass for their whole career, their matchups with other fighters tend to be limited to whoever else is in their division. This leads to certain trends that may form within each division and can be an interesting way of clustering the fights together. Let's see how our observations are spread across weightclasses.
# Count the occurrences of each unique value in the "weight_class" column
weight_class_counts = ufc_data['weight_class'].value_counts()
# Define the upper limits for each weight class
upper_limits = {
"Women's Strawweight": 115, 'Flyweight': 125, "Women's Flyweight": 125,
"Women's Bantamweight": 135, 'Bantamweight': 135, 'Featherweight': 145,
"Women's Featherweight": 145, 'Lightweight': 155, 'Welterweight': 170,
'Middleweight': 185, 'Light Heavyweight': 205, 'Heavyweight': 265, 'Catch Weight': None
}
# Create a DataFrame with weight_class, observed_frequency
observed_frequency_df = pd.DataFrame({
'weight_class': weight_class_counts.index,
'observed_frequency': weight_class_counts.values
})
# Create a DataFrame with weight_class, upper_limit
upper_limits_df = pd.DataFrame(list(upper_limits.items()), columns=['weight_class', 'upper_limit'])
# Merge the DataFrames based on the 'weight_class' column
merged_df = pd.merge(observed_frequency_df, upper_limits_df, on='weight_class', how='left')
# Sort the DataFrame based on the 'upper_limit' column
merged_df = merged_df.sort_values(by='upper_limit')
# Separate men's and women's divisions
men_df = merged_df[~merged_df['weight_class'].str.contains("Women's")]
women_df = merged_df[merged_df['weight_class'].str.contains("Women's")]
# Set Seaborn style
sns.set(style="whitegrid")
# Plotting the reversed horizontal bar chart for men's divisions
plt.figure(figsize=(15, 3))
bars_men = sns.barplot(x=men_df['observed_frequency'], y=men_df['weight_class'],
color='darkblue', edgecolor='black', alpha=0.7)
# Plotting the reversed horizontal bar chart for women's divisions
bars_women = sns.barplot(x=women_df['observed_frequency'], y=women_df['weight_class'],
color='darkred', edgecolor='black', alpha=0.7)
# Adding labels and title
plt.xlabel('Observed Frequency')
plt.ylabel('Weight Class')
plt.title('Weight Class Observations by Frequency')
# Display the plot
plt.show()
Logistic RegressionΒΆ
We'll start by making some adjustments to our data. First, we create an outcome
column that gives us '1' if the Red fighter wins and '0' if the Blue fighter wins.
We'll also create odds_dif
to present the differential between Red's betting odds and Blue's betting odds (lower odds represent the favorite to win).
ufc_data['outcome'] = ufc_data['Winner'].replace({'Red':1, 'Blue':0})
ufc_data['odds_dif'] = ufc_data['R_odds'] - ufc_data['B_odds']
Now we'll standardize all our ordinal features by their z-scores.
# Standardize all ordinal
def z_std_col(df, col_name):
std_col = StandardScaler().fit_transform(df[col_name].values.reshape(-1,1))
return std_col
ufc_data['age_dif_std'] = z_std_col(ufc_data, 'age_dif')
ufc_data['reach_dif_std'] = z_std_col(ufc_data, 'reach_dif')
ufc_data['height_dif_std'] = z_std_col(ufc_data, 'height_dif')
ufc_data['odds_dif_std'] = z_std_col(ufc_data, 'odds_dif')
Thankfully, the data we have already comes with a column of 1's constant_1
but we could have easily done that ourselves with df['intercept'] = 1
.
To make things neater, we will create df_logit
to contain only our intercept, independent, and dependent variables.
df_logit = ufc_data[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std', 'outcome']].copy(deep=True)
df_logit = df_logit.dropna()
df_logit.head()
constant_1 | age_dif_std | reach_dif_std | height_dif_std | odds_dif_std | outcome | |
---|---|---|---|---|---|---|
0 | 1 | 1.308071 | -1.336975 | -0.369988 | -0.604725 | 1 |
1 | 1 | -0.814472 | -0.244517 | -0.369988 | -0.186696 | 1 |
2 | 1 | -0.235597 | 0.574826 | 0.368403 | 1.125721 | 0 |
4 | 1 | 2.079904 | 0.028597 | -0.739183 | -0.701941 | 1 |
5 | 1 | -0.235597 | -0.517631 | 0.368403 | -0.303355 | 1 |
Finally, we will use statsmodels.api
to perform our logistic regression for us.
We simply define X
as our independent variable columns, and y
as our dependent variable column.
We pass the two into sm.Logit().fit()
and use the summary()
method. Like how we discussed MLE, this method will test multiple iterations of coefficients for our X
variabes to see which iteration has the highest likelihood of producing our current dataset.
X = df_logit[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']]
y = df_logit['outcome']
log_reg = sm.Logit(y, X).fit()
log_reg.summary()
Optimization terminated successfully. Current function value: 0.615285 Iterations 5
Dep. Variable: | outcome | No. Observations: | 5489 |
---|---|---|---|
Model: | Logit | Df Residuals: | 5484 |
Method: | MLE | Df Model: | 4 |
Date: | Fri, 15 Dec 2023 | Pseudo R-squ.: | 0.09359 |
Time: | 17:23:36 | Log-Likelihood: | -3377.3 |
converged: | True | LL-Null: | -3726.0 |
Covariance Type: | nonrobust | LLR p-value: | 1.235e-149 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
constant_1 | 0.3935 | 0.029 | 13.356 | 0.000 | 0.336 | 0.451 |
age_dif_std | -0.0793 | 0.030 | -2.667 | 0.008 | -0.138 | -0.021 |
reach_dif_std | -0.0900 | 0.041 | -2.208 | 0.027 | -0.170 | -0.010 |
height_dif_std | 0.0275 | 0.039 | 0.707 | 0.480 | -0.049 | 0.104 |
odds_dif_std | -0.7863 | 0.034 | -23.178 | 0.000 | -0.853 | -0.720 |
Real quickly, let's check if the functions we defined earlier would generate the same coefficients.
# convert to numpy array for more efficient processing
arr_logit = np.array(df_logit)
# initialize variables to be passed into *args
n_params = 5
n_classes = 2
# initialize starting guesses
starting_values = np.random.rand(n_params*n_classes)
# use scipy.optimize's minimize function
# minimize the function LL, with starting guesses starting_values, passing args with y, X, n_params, n_classes
results = minimize(LL, x0 = starting_values, args = (arr_logit[:, -1], arr_logit[:, :-1], n_params, n_classes))
#print results
parameter_labels = ['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']
coefficients = results.x[n_params:]
print('\n'.join([f" {label} {round(coeff, 4)}" for i, (label, coeff) in enumerate(zip(parameter_labels, coefficients), start=1)]))
constant_1 0.3935 age_dif_std -0.0793 reach_dif_std -0.09 height_dif_std 0.0275 odds_dif_std -0.7863
Awesome! We got the same results as our statsmodel package.
Interpretting the logistic regression resultsΒΆ
Before we draw conclusions with our coefficients, we must first understand that we are working with a logistic regression and not a linear regression. This means that y
does not change by $Ξ²$ per unit change in x
.
A unit change x $\cdot \beta$ is passed into the sigmoid function which will have some positive/negative effect on the output. At most we can say that the positivity/negativity of the effect will be in the same direction as $\beta$'s sign, but that's about all we can say at this point.
Lucky for us, we have the get_margeff()
method which translates our coefficients to more linear terms in dy/dx
:
margeff_frame = log_reg.get_margeff().summary_frame()
margeff_frame
dy/dx | Std. Err. | z | Pr(>|z|) | Conf. Int. Low | Cont. Int. Hi. | |
---|---|---|---|---|---|---|
age_dif_std | -0.016932 | 0.006333 | -2.673574 | 7.504765e-03 | -0.029344 | -0.004519 |
reach_dif_std | -0.019221 | 0.008694 | -2.210949 | 2.703939e-02 | -0.036261 | -0.002182 |
height_dif_std | 0.005871 | 0.008307 | 0.706721 | 4.797399e-01 | -0.010411 | 0.022152 |
odds_dif_std | -0.167899 | 0.005815 | -28.875034 | 2.457884e-183 | -0.179295 | -0.156502 |
Of course we will have another caveat in that our data was standardized by their z-scores, so every unit of change is measured in terms of standard deviations $\sigma$.
For example, the standard deviation of height differences observed is 9.30cm. With a reach_dif_std
dy/dx
of -0.0192, that means that red's probability of winning improves for every -0.0192 * 9.30cm = -0.17856 cm reach advantage they get or 0.0192cm reach they give up to their opponent. Controversially, this suggests that having shorter arms than your opponent may actual have an advantage in professional fighting.
Indeed, our analysis seems consistent for all other features:
age_dif_std
with negativedy/dx
suggests that being younger is advantageousheight_dif_std
with positivedy/dx
suggests that being taller is advantageousodds_dif_std
with negativedy/dx
suggests that having a more negative betting odd (which is considered the favorite) is advantageous
Before we close the book on this analysis, however, we cannot just take our results as the absolute truth. With any statistical test, it is important to also evaluate our p-value to see if we have enough evidence in the data to support our claim. In this case, height_dif_std
has a surprisingly high p-value.
A p-value of 0.480 is way above the conventional threshold of 0.05. This means, we do not have enough evidence to say that height has an impact on the outcome of a fight!
Clustering data for additional analysisΒΆ
Earlier we mentioned the importance of weight class and how each weight class may form different trends. Out of curiosity, let's see if we can code up a way to show how each weight class' logistic regression results vary.
Note that at this stage, we are working with significantly smaller datasets so we'll have to take any conclusions made with a grain of salt.
# Copy the original dataset
wc_logit = ufc_data[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std', 'outcome', 'weight_class']].copy(deep=True)
wc_logit = wc_logit.dropna()
# Group by 'weight_class' and create a dictionary of DataFrames
grouped_df_dict = dict(iter(wc_logit.groupby('weight_class')))
# Initialize lists to store DataFrames for each feature
age_dif_std_dfs = []
reach_dif_std_dfs = []
height_dif_std_dfs = []
odds_dif_std_dfs = []
# Iterate over weight classes
for weightclass in grouped_df_dict:
test_class = weightclass
X = grouped_df_dict.get(test_class)[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']]
y = grouped_df_dict.get(test_class)['outcome']
log_reg = sm.Logit(y, X).fit(disp=0)
margeff_frame = log_reg.get_margeff().summary_frame()
margeff_frame_select = margeff_frame.iloc[:, [0, 3]]
margeff_frame_select = margeff_frame_select.round(4)
# Append DataFrames for each feature to the respective list
age_dif_std_dfs.append(margeff_frame_select.loc['age_dif_std'].rename(weightclass))
reach_dif_std_dfs.append(margeff_frame_select.loc['reach_dif_std'].rename(weightclass))
height_dif_std_dfs.append(margeff_frame_select.loc['height_dif_std'].rename(weightclass))
odds_dif_std_dfs.append(margeff_frame_select.loc['odds_dif_std'].rename(weightclass))
# Concatenate the DataFrames in each list to create the final DataFrames
age_dif_std_df = pd.concat(age_dif_std_dfs, axis=1).T
reach_dif_std_df = pd.concat(reach_dif_std_dfs, axis=1).T
height_dif_std_df = pd.concat(height_dif_std_dfs, axis=1).T
odds_dif_std_df = pd.concat(odds_dif_std_dfs, axis=1).T
# Create a 2x2 grid of scatter plots
fig, axs = plt.subplots(2, 2, figsize=(12, 6))
fig.suptitle('Scatter Plots of Marginal Effects')
# Function to add horizontal line at y=0.05
def add_threshold_line(ax):
ax.axhline(y=0.05, color='black', linestyle='-', linewidth=1)
# Function to add vertical line at x=0
def add_zero_line(ax):
ax.axvline(x=0, color='black', linestyle='--', linewidth=1)
# Scatter plot for Age Difference Standardized
sns.scatterplot(data=age_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=age_dif_std_df.index, s=50, ax=axs[0, 0])
add_threshold_line(axs[0, 0])
add_zero_line(axs[0, 0])
axs[0, 0].set_title('Age Difference Standardized')
axs[0, 0].set_xlabel('dy/dx')
axs[0, 0].set_ylabel('Pr(>|z|)')
axs[0, 0].legend().set_visible(False)
# Scatter plot for Reach Difference Standardized
sns.scatterplot(data=reach_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=reach_dif_std_df.index, s=50, ax=axs[0, 1])
add_threshold_line(axs[0, 1])
add_zero_line(axs[0, 1])
axs[0, 1].set_title('Reach Difference Standardized')
axs[0, 1].set_xlabel('dy/dx')
axs[0, 1].set_ylabel('Pr(>|z|)')
axs[0, 1].legend().set_visible(False)
# Scatter plot for Height Difference Standardized
sns.scatterplot(data=height_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=height_dif_std_df.index, s=50, ax=axs[1, 0])
add_threshold_line(axs[1, 0])
add_zero_line(axs[1, 0])
axs[1, 0].set_title('Height Difference Standardized')
axs[1, 0].set_xlabel('dy/dx')
axs[1, 0].set_ylabel('Pr(>|z|)')
axs[1, 0].legend().set_visible(False)
# Scatter plot for Odds Difference Standardized
scatter = sns.scatterplot(data=odds_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=odds_dif_std_df.index, s=50, ax=axs[1, 1])
add_threshold_line(axs[1, 1])
add_zero_line(axs[1, 1])
axs[1, 1].set_title('Odds Difference Standardized')
axs[1, 1].set_xlabel('dy/dx')
axs[1, 1].set_ylabel('Pr(>|z|)')
axs[1, 1].legend().set_visible(False)
# Move the common legend outside the plot
scatter.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Display the DataFrames for each feature
print("Age Difference Standardized:")
print(age_dif_std_df, "\n")
print("Reach Difference Standardized:")
print(reach_dif_std_df, "\n")
print("Height Difference Standardized:")
print(height_dif_std_df, "\n")
print("Odds Difference Standardized:")
print(odds_dif_std_df, "\n")
Age Difference Standardized: dy/dx Pr(>|z|) Bantamweight -0.0260 0.1769 Catch Weight -0.0968 0.1440 Featherweight -0.0635 0.0022 Flyweight -0.0415 0.1716 Heavyweight -0.0074 0.7064 Light Heavyweight -0.0123 0.5475 Lightweight -0.0238 0.1461 Middleweight -0.0104 0.5792 Welterweight -0.0019 0.9076 Women's Bantamweight 0.0038 0.8994 Women's Featherweight -0.0834 0.3853 Women's Flyweight 0.0780 0.0282 Women's Strawweight -0.0376 0.2121 Reach Difference Standardized: dy/dx Pr(>|z|) Bantamweight 0.0354 0.1747 Catch Weight 0.0535 0.5316 Featherweight -0.0333 0.2057 Flyweight -0.0270 0.5685 Heavyweight -0.0977 0.0015 Light Heavyweight -0.0606 0.0473 Lightweight -0.0027 0.9028 Middleweight -0.0437 0.0841 Welterweight 0.0132 0.5847 Women's Bantamweight -0.0607 0.2921 Women's Featherweight -0.1791 0.2981 Women's Flyweight -0.0016 0.9719 Women's Strawweight 0.0000 0.9999 Height Difference Standardized: dy/dx Pr(>|z|) Bantamweight -0.0013 0.9630 Catch Weight 0.0356 0.6512 Featherweight 0.0157 0.5054 Flyweight 0.0631 0.1583 Heavyweight 0.0577 0.0328 Light Heavyweight 0.0028 0.9297 Lightweight 0.0130 0.5250 Middleweight 0.0163 0.5255 Welterweight -0.0380 0.0968 Women's Bantamweight -0.0014 0.9789 Women's Featherweight 0.0926 0.4381 Women's Flyweight -0.0310 0.5445 Women's Strawweight 0.0274 0.4174 Odds Difference Standardized: dy/dx Pr(>|z|) Bantamweight -0.1891 0.0000 Catch Weight -0.1614 0.0001 Featherweight -0.1387 0.0000 Flyweight -0.1520 0.0000 Heavyweight -0.1639 0.0000 Light Heavyweight -0.1601 0.0000 Lightweight -0.1807 0.0000 Middleweight -0.1539 0.0000 Welterweight -0.1720 0.0000 Women's Bantamweight -0.1516 0.0000 Women's Featherweight -0.2617 0.0000 Women's Flyweight -0.1364 0.0000 Women's Strawweight -0.1798 0.0000
Interesting yet unsurprisingly, parsing down our 5000+ observations down to just 200-300 per cluster significantly affects our p-values and our ability to reject the null-hypothesis.
While this might seem anticlimactic, we actually do learn new tidbits of information. For example:
- Heavyweights are the only fighters who gain a statistically significant advantage from being taller.
- On the other hand, Heavyweights and Light Heavyweights are the only ones to gain an advantage in reach differential, and the advantage is seen for those with reachers shorter than their opponents!
- With p-values virtually equal to 0, betting odds are a great predictor for fight outcome.
- Interestingly, catch weight fights (which often take place on short-notice or because a fighter failed their weightcut) have a less non-zero p-value of 0.0001. This suggests that the uncertain nature of catch weight fights are also difficult to read for bettors!
Final thoughtsΒΆ
There's still so much we can dig through and learn from studying fight data, but at the end of the day, it's great to learn that physical attributes and conventional wisdom only play a minor role in determining the outcome of a fight. Training, experience, matchups, and a lot of other factors likely play a much larger role in helping you win a fight. But until we get more granular data with bigger datasets, we will leave those features shrouded in the noise of our model.