Using Logistic Regression to Find the Best Features for Fighters

What does the statistically ultimate fighter look like? In this article I explore the use of logistic regression in determining what features give a fighter the edge in the sport of mixed martial arts

What does it take to be a good fighter?

Perhaps I, typing away on my desk sipping some hot cocoa, may not quite know.

What I do know is a bit of computational economics, and through a model known as logistic regression, we are going to dig into the numbers to see if we can find out something new that perhaps even the most veteran coach or skilled fighter may not know yet.

We have a dataset of 5,512 fights taken from Kaggle user mdabbert and cleaned slightly by user Andrew Ritchie here. Each row describes a bout between two fighters including their physical attributes, fight record, betting odds, and the eventual outcome of the fight.

Our analysis begins by treating this as a binary classification problem. We want to look at several features mentioned above and determine if it has a significant impact on one fighter's chances of winning. Enter:

The Logistic regression¶

A logistic regression is a statistical tool that helps us predict binary classes. Unlike linear regression which predicts a continuous variable, a logistic regression can be used to predict classes which make it for predicting a yes/no or a win/loss.

This works by putting a linear regression into a sigmoid function. The sigmoid function takes our input $y$ and returns $\sigma$ which is always a value between 0 to 1. The closer the output is to 0, the stronger the prediction is that the outcome is a loss for one fighter and vice-versa.

Linear regression:

$ \large y = β_0 + β_1X_1 + β_2X_2 + ... + β_nX_n $

Sigmoid function:

$ \large \sigma = \frac{1}{1 + \large e^{-y}} $

Logistic regression:

$ \large \sigma = \frac{1}{1 + \large e^{-(β_0 + β_1X_1 + β_2X_2 + ... + β_nX_n)}} $

Let's try recreating that in python:

In [ ]:

# import all needed packages
import pandas as pd
import numpy as np
import copy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.optimize import minimize

In [ ]:

# define a function that takes in our data (df) and the coefficients we'll be estimating (beta)
def sigmoid(df, beta):
    # we dot multiply df and beta to match our betas to their corresponding independent variables
    # Xb is essentially a linear regression model
    Xb = np.dot(df, beta)

    # after that we simply get the negative exponential of Xb and plug into the fraction to complete our sigmoid
    eXb = np.exp(Xb)
    # this is where we'll go off script a bit. We'll use a variation of the sigmoid called a softmax which will allow us to do handle multiple classes for multinomial choice models.
    probability = eXb / eXb.sum(1)[:,None]
    return probability

You might notice one part of the code we haven't discussed yet which is estimating the beta coefficients. To do this, we will use something called:

Maximum Likelihood Estimation (MLE)¶

MLE is an estimation method that takes our dataset's parameters, such as $\hat{\mu}$ and $\hat{\sigma}$, as our given conditions and tests different sets of coefficients $\beta$ to see which one is most likely to produce our actual dataset.

In statistical terms, if the probability density function is given as:

$$ Pr (data | distribution) $$

Then likelihood is:

$$ L (distribution | data) $$

We're not out of the woods yet because we still need a way of approaching the best likelihood value as we test through multiple guesses or iterations of $\beta$. I'll spare you the calculus and present to you the loglikelihood function which will do just that:

In [ ]:

# define a function that takes in our coefficient guesses (betas) and *args
# *args contains our dependent variable (y), independent variables (X), as well as our number of parameters/X's (n_params)
def LL(betas, *args):
    # unpack args
    y, X, n_params, n_classes = args[0], args[1], args[2], args[3]

    # reshape beta so we can properly dot multiply them in our sigmoid function
    beta_shaped = np.array(betas).reshape(n_params, -1, order='F')

    # we're going to set our first column beta_0 to 0 so we only fit J-1 parameter sets
    beta_shaped[:,0] = [0]*n_params

    # we'll use dummy variables in line with our use of softmax 
    d = pd.get_dummies(y).to_numpy()

    probs = sigmoid(X, beta_shaped)
    log_probs = np.log(probs)
    ll = d * log_probs
    
    return -np.sum(ll)

Okay that was a lot to take in (especially if you're writing this in the morning like I am) so let's take a step back and look at why we're writing all these functions in the first place. We want to find out:

What does it take to be the
statistically advantageous Fighter?

Well, we can start by asking the question "what do we normally think of as being advantaegous in a fight?"

Traditionally in combat sports, we have what is called "the tale of the tape" which highlights a few key stats about each fighter that may influence the outcome of the fight. In the UFC, one of the biggest MMA promotions today, these stats are: age, height, weight, and reach.

Conventional wisdom dictates that a weight disparity between two combatants would intuitively have the most significant impact over who wins a fight. Because of this, the pool of professional fighters are always divided into weightclasses which means this variable is pretty much held constant for us for every observation.

(Side note: it is common these days for atheletes to "cut weight" for fights. They would shed a ridiculous amount of body weight when weighing in for their division, then immediately rehydrate and bulk back up until the actual fight a few days later. It would be interesting to see how the regular "walkaround" weight disparity between fighters may play a role in determining the outcome of a fight but we'll leave that scraping task for another day.)

What does our data look like?¶

Our table contains 5,512 observations of UFC fights beginning March 21, 2010 (billed as UFC Live: Vera vs. Jones) until December 10, 2022 (UFC 282). There are some inconsistencies and inaccuracies with the data which I won't get into here, but we either will be cleaning those features as we go or not using the feature altogether.

In [ ]:

# Read data from the CSV file
ufc_data = pd.read_csv(r"/Users/shawnlee/Documents/School/USF/ECON611/final_project/ufc-master-[corrected3].csv")
ufc_data.head()

Out[ ]:

	R_fighter	B_fighter	R_odds	B_odds	R_ev	B_ev	date	location	country	Winner	...	finish_details	finish_round	finish_round_time	total_fight_time_secs	r_dec_odds	b_dec_odds	r_sub_odds	b_sub_odds	r_ko_odds	b_ko_odds
0	Paddy Pimblett	Jared Gordon	-270.0	220.0	37.037037	220.000000	2022-12-10	Las Vegas, Nevada, USA	USA	Red	...	NaN	3.0	5:00	900.0	300.0	425.0	200.0	1400.0	265.0	700.0
1	Santiago Ponzinibbio	Alex Morono	-150.0	125.0	66.666667	125.000000	2022-12-10	Las Vegas, Nevada, USA	USA	Red	...	Punches	3.0	2:29	749.0	180.0	270.0	1600.0	1200.0	200.0	700.0
2	Darren Till	Dricus Du Plessis	180.0	-220.0	180.000000	45.454545	2022-12-10	Las Vegas, Nevada, USA	USA	Blue	...	Neck Crank	3.0	2:43	763.0	400.0	325.0	2300.0	475.0	360.0	180.0
3	Bryce Mitchell	Ilia Topuria	NaN	NaN	NaN	NaN	2022-12-10	Las Vegas, Nevada, USA	USA	Blue	...	Arm Triangle	2.0	3:10	490.0	NaN	NaN	NaN	NaN	NaN	NaN
4	Raul Rosas Jr.	Jay Perrin	-300.0	240.0	33.333333	240.000000	2022-12-10	Las Vegas, Nevada, USA	USA	Red	...	Neck Crank	1.0	2:44	164.0	200.0	360.0	180.0	1000.0	650.0	800.0

5 rows × 119 columns

Let's start by seeing the spread of observations across our four features: height, reach, age, and betting odds. Because we have two fighters for every row of observation (Blue fighter, Red fighter), we will create a df combined_feats which appends the rows vertically for the four features so that our df has shape (11024, 4).

In [ ]:

# Combine 
combined_feats = pd.concat([
    ufc_data[['B_Height_cms', 'B_Reach_cms', 'B_age', 'B_odds']].rename(columns=lambda x: x.replace('B_', '')),
    ufc_data[['R_Height_cms', 'R_Reach_cms', 'R_age', 'R_odds']].rename(columns=lambda x: x.replace('R_', ''))
], ignore_index=True)

print("combined_feats has a shape of: ", combined_feats.shape)

combined_feats has a shape of:  (11024, 4)

We'll clean the data a bit with the dropna() method, then we'll create four histograms side-by-side. We'll also add in boxplots at the bottom so we can tease out any outliers.

In [ ]:

# Drop rows with missing values
combined_feats = combined_feats.dropna()

# Set Seaborn style
sns.set(style="whitegrid")

# Plotting histograms using Seaborn
fig, axs = plt.subplots(2, 4, figsize=(15, 4), gridspec_kw={'height_ratios': [6, 1]})

# Plot histograms for height
axs[0, 0].hist(combined_feats['Height_cms'], bins=20, color='blue', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 0])
axs[0, 0].set_title('Fighter heights observed')
axs[0, 0].set_xlabel('')
axs[0, 0].set_ylabel('Frequency')

# Plot histograms for reach
axs[0, 1].hist(combined_feats['Reach_cms'], bins=20, color='orange', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 1])
axs[0, 1].set_title('Fighter reaches observed')
axs[0, 1].set_xlabel('')
axs[0, 1].set_ylabel('Frequency')

# Plot histograms for age
axs[0, 2].hist(combined_feats['age'], bins=20, color='green', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 2])
axs[0, 2].set_title('Fighter ages observed')
axs[0, 2].set_xlabel('')
axs[0, 2].set_ylabel('Frequency')

# Plot histograms for odds
axs[0, 3].hist(combined_feats['odds'], bins=20, color='grey', alpha=0.7, edgecolor='black')
sns.despine(ax=axs[0, 3])
axs[0, 3].set_title('Fight odds observed')
axs[0, 3].set_xlabel('')
axs[0, 3].set_ylabel('Frequency')

# Plot boxplots using Seaborn
# Set the color palette for boxplots
colors = ['blue', 'orange', 'green', 'grey']

# Plot boxplots for height
sns.boxplot(x=combined_feats['Height_cms'], ax=axs[1, 0], vert=False, boxprops=dict(facecolor=colors[0], alpha=0.7), medianprops=dict(color='black'))
axs[1, 0].set_title('')
axs[1, 0].set_xlabel('Height (cms)')

# Plot boxplots for reach
sns.boxplot(x=combined_feats['Reach_cms'], ax=axs[1, 1], vert=False, boxprops=dict(facecolor=colors[1], alpha=0.7), medianprops=dict(color='black'))
axs[1, 1].set_title('')
axs[1, 1].set_xlabel('Reach (cms)')

# Plot boxplots for age
sns.boxplot(x=combined_feats['age'], ax=axs[1, 2], vert=False, boxprops=dict(facecolor=colors[2], alpha=0.7), medianprops=dict(color='black'))
axs[1, 2].set_title('')
axs[1, 2].set_xlabel('Age')

# Plot boxplots for odds
sns.boxplot(x=combined_feats['odds'], ax=axs[1, 3], vert=False, boxprops=dict(facecolor=colors[3], alpha=0.7), medianprops=dict(color='black'))
axs[1, 3].set_title('')
axs[1, 3].set_xlabel('Odds (lower is favored)')

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()

No description has been provided for this image

Since fighters tend to compete within one weightclass for their whole career, their matchups with other fighters tend to be limited to whoever else is in their division. This leads to certain trends that may form within each division and can be an interesting way of clustering the fights together. Let's see how our observations are spread across weightclasses.

In [ ]:

# Count the occurrences of each unique value in the "weight_class" column
weight_class_counts = ufc_data['weight_class'].value_counts()

# Define the upper limits for each weight class
upper_limits = {
    "Women's Strawweight": 115, 'Flyweight': 125, "Women's Flyweight": 125,
    "Women's Bantamweight": 135, 'Bantamweight': 135, 'Featherweight': 145,
    "Women's Featherweight": 145, 'Lightweight': 155, 'Welterweight': 170,
    'Middleweight': 185, 'Light Heavyweight': 205, 'Heavyweight': 265, 'Catch Weight': None
}

# Create a DataFrame with weight_class, observed_frequency
observed_frequency_df = pd.DataFrame({
    'weight_class': weight_class_counts.index,
    'observed_frequency': weight_class_counts.values
})

# Create a DataFrame with weight_class, upper_limit
upper_limits_df = pd.DataFrame(list(upper_limits.items()), columns=['weight_class', 'upper_limit'])

# Merge the DataFrames based on the 'weight_class' column
merged_df = pd.merge(observed_frequency_df, upper_limits_df, on='weight_class', how='left')

# Sort the DataFrame based on the 'upper_limit' column
merged_df = merged_df.sort_values(by='upper_limit')

# Separate men's and women's divisions
men_df = merged_df[~merged_df['weight_class'].str.contains("Women's")]
women_df = merged_df[merged_df['weight_class'].str.contains("Women's")]

# Set Seaborn style
sns.set(style="whitegrid")

# Plotting the reversed horizontal bar chart for men's divisions
plt.figure(figsize=(15, 3))
bars_men = sns.barplot(x=men_df['observed_frequency'], y=men_df['weight_class'],
                      color='darkblue', edgecolor='black', alpha=0.7)

# Plotting the reversed horizontal bar chart for women's divisions
bars_women = sns.barplot(x=women_df['observed_frequency'], y=women_df['weight_class'],
                        color='darkred', edgecolor='black', alpha=0.7)

# Adding labels and title
plt.xlabel('Observed Frequency')
plt.ylabel('Weight Class')
plt.title('Weight Class Observations by Frequency')

# Display the plot
plt.show()

Logistic Regression¶

We'll start by making some adjustments to our data. First, we create an outcome column that gives us '1' if the Red fighter wins and '0' if the Blue fighter wins.

We'll also create odds_dif to present the differential between Red's betting odds and Blue's betting odds (lower odds represent the favorite to win).

In [ ]:

ufc_data['outcome'] = ufc_data['Winner'].replace({'Red':1, 'Blue':0})
ufc_data['odds_dif'] = ufc_data['R_odds'] - ufc_data['B_odds']

Now we'll standardize all our ordinal features by their z-scores.

In [ ]:

# Standardize all ordinal 
def z_std_col(df, col_name):
    std_col = StandardScaler().fit_transform(df[col_name].values.reshape(-1,1))
    return std_col

ufc_data['age_dif_std'] = z_std_col(ufc_data, 'age_dif')
ufc_data['reach_dif_std'] = z_std_col(ufc_data, 'reach_dif')
ufc_data['height_dif_std'] = z_std_col(ufc_data, 'height_dif')
ufc_data['odds_dif_std'] = z_std_col(ufc_data, 'odds_dif')

Thankfully, the data we have already comes with a column of 1's constant_1 but we could have easily done that ourselves with df['intercept'] = 1.

To make things neater, we will create df_logit to contain only our intercept, independent, and dependent variables.

In [ ]:

df_logit = ufc_data[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std', 'outcome']].copy(deep=True)
df_logit = df_logit.dropna()
df_logit.head()

Out[ ]:

	constant_1	age_dif_std	reach_dif_std	height_dif_std	odds_dif_std	outcome
0	1	1.308071	-1.336975	-0.369988	-0.604725	1
1	1	-0.814472	-0.244517	-0.369988	-0.186696	1
2	1	-0.235597	0.574826	0.368403	1.125721	0
4	1	2.079904	0.028597	-0.739183	-0.701941	1
5	1	-0.235597	-0.517631	0.368403	-0.303355	1

Finally, we will use statsmodels.api to perform our logistic regression for us.

We simply define X as our independent variable columns, and y as our dependent variable column.

We pass the two into sm.Logit().fit() and use the summary() method. Like how we discussed MLE, this method will test multiple iterations of coefficients for our X variabes to see which iteration has the highest likelihood of producing our current dataset.

In [ ]:

X = df_logit[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']]
y = df_logit['outcome']

log_reg = sm.Logit(y, X).fit()
log_reg.summary()

Optimization terminated successfully.
         Current function value: 0.615285
         Iterations 5

Out[ ]:

Logit Regression Results
Dep. Variable:	outcome	No. Observations:	5489
Model:	Logit	Df Residuals:	5484
Method:	MLE	Df Model:	4
Date:	Fri, 15 Dec 2023	Pseudo R-squ.:	0.09359
Time:	17:23:36	Log-Likelihood:	-3377.3
converged:	True	LL-Null:	-3726.0
Covariance Type:	nonrobust	LLR p-value:	1.235e-149

	coef	std err	z	P>\|z\|	[0.025	0.975]
constant_1	0.3935	0.029	13.356	0.000	0.336	0.451
age_dif_std	-0.0793	0.030	-2.667	0.008	-0.138	-0.021
reach_dif_std	-0.0900	0.041	-2.208	0.027	-0.170	-0.010
height_dif_std	0.0275	0.039	0.707	0.480	-0.049	0.104
odds_dif_std	-0.7863	0.034	-23.178	0.000	-0.853	-0.720

Real quickly, let's check if the functions we defined earlier would generate the same coefficients.

In [ ]:

# convert to numpy array for more efficient processing
arr_logit = np.array(df_logit)

# initialize variables to be passed into *args
n_params = 5
n_classes = 2

# initialize starting guesses
starting_values = np.random.rand(n_params*n_classes)

# use scipy.optimize's minimize function
# minimize the function LL, with starting guesses starting_values, passing args with y, X, n_params, n_classes
results = minimize(LL, x0 = starting_values, args = (arr_logit[:, -1], arr_logit[:, :-1], n_params, n_classes))

#print results
parameter_labels = ['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']
coefficients = results.x[n_params:]
print('\n'.join([f" {label} {round(coeff, 4)}" for i, (label, coeff) in enumerate(zip(parameter_labels, coefficients), start=1)]))

 constant_1 0.3935
 age_dif_std -0.0793
 reach_dif_std -0.09
 height_dif_std 0.0275
 odds_dif_std -0.7863

Awesome! We got the same results as our statsmodel package.

Interpretting the logistic regression results¶

Before we draw conclusions with our coefficients, we must first understand that we are working with a logistic regression and not a linear regression. This means that y does not change by $β$ per unit change in x.

A unit change x $\cdot \beta$ is passed into the sigmoid function which will have some positive/negative effect on the output. At most we can say that the positivity/negativity of the effect will be in the same direction as $\beta$'s sign, but that's about all we can say at this point.

Lucky for us, we have the get_margeff() method which translates our coefficients to more linear terms in dy/dx:

In [ ]:

margeff_frame = log_reg.get_margeff().summary_frame()
margeff_frame

Out[ ]:

	dy/dx	Std. Err.	z	Pr(>\|z\|)	Conf. Int. Low	Cont. Int. Hi.
age_dif_std	-0.016932	0.006333	-2.673574	7.504765e-03	-0.029344	-0.004519
reach_dif_std	-0.019221	0.008694	-2.210949	2.703939e-02	-0.036261	-0.002182
height_dif_std	0.005871	0.008307	0.706721	4.797399e-01	-0.010411	0.022152
odds_dif_std	-0.167899	0.005815	-28.875034	2.457884e-183	-0.179295	-0.156502

Of course we will have another caveat in that our data was standardized by their z-scores, so every unit of change is measured in terms of standard deviations $\sigma$.

For example, the standard deviation of height differences observed is 9.30cm. With a reach_dif_std dy/dx of -0.0192, that means that red's probability of winning improves for every -0.0192 * 9.30cm = -0.17856 cm reach advantage they get or 0.0192cm reach they give up to their opponent. Controversially, this suggests that having shorter arms than your opponent may actual have an advantage in professional fighting.

Indeed, our analysis seems consistent for all other features:

age_dif_std with negative dy/dx suggests that being younger is advantageous
height_dif_std with positive dy/dx suggests that being taller is advantageous
odds_dif_std with negative dy/dx suggests that having a more negative betting odd (which is considered the favorite) is advantageous

Before we close the book on this analysis, however, we cannot just take our results as the absolute truth. With any statistical test, it is important to also evaluate our p-value to see if we have enough evidence in the data to support our claim. In this case, height_dif_std has a surprisingly high p-value.

A p-value of 0.480 is way above the conventional threshold of 0.05. This means, we do not have enough evidence to say that height has an impact on the outcome of a fight!

Clustering data for additional analysis¶

Earlier we mentioned the importance of weight class and how each weight class may form different trends. Out of curiosity, let's see if we can code up a way to show how each weight class' logistic regression results vary.

Note that at this stage, we are working with significantly smaller datasets so we'll have to take any conclusions made with a grain of salt.

In [ ]:

# Copy the original dataset
wc_logit = ufc_data[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std', 'outcome', 'weight_class']].copy(deep=True)
wc_logit = wc_logit.dropna()

# Group by 'weight_class' and create a dictionary of DataFrames
grouped_df_dict = dict(iter(wc_logit.groupby('weight_class')))

# Initialize lists to store DataFrames for each feature
age_dif_std_dfs = []
reach_dif_std_dfs = []
height_dif_std_dfs = []
odds_dif_std_dfs = []

# Iterate over weight classes
for weightclass in grouped_df_dict:
    test_class = weightclass
    X = grouped_df_dict.get(test_class)[['constant_1', 'age_dif_std', 'reach_dif_std', 'height_dif_std', 'odds_dif_std']]
    y = grouped_df_dict.get(test_class)['outcome']
    log_reg = sm.Logit(y, X).fit(disp=0)
    margeff_frame = log_reg.get_margeff().summary_frame()
    margeff_frame_select = margeff_frame.iloc[:, [0, 3]]
    margeff_frame_select = margeff_frame_select.round(4)

    # Append DataFrames for each feature to the respective list
    age_dif_std_dfs.append(margeff_frame_select.loc['age_dif_std'].rename(weightclass))
    reach_dif_std_dfs.append(margeff_frame_select.loc['reach_dif_std'].rename(weightclass))
    height_dif_std_dfs.append(margeff_frame_select.loc['height_dif_std'].rename(weightclass))
    odds_dif_std_dfs.append(margeff_frame_select.loc['odds_dif_std'].rename(weightclass))

# Concatenate the DataFrames in each list to create the final DataFrames
age_dif_std_df = pd.concat(age_dif_std_dfs, axis=1).T
reach_dif_std_df = pd.concat(reach_dif_std_dfs, axis=1).T
height_dif_std_df = pd.concat(height_dif_std_dfs, axis=1).T
odds_dif_std_df = pd.concat(odds_dif_std_dfs, axis=1).T

# Create a 2x2 grid of scatter plots
fig, axs = plt.subplots(2, 2, figsize=(12, 6))
fig.suptitle('Scatter Plots of Marginal Effects')

# Function to add horizontal line at y=0.05
def add_threshold_line(ax):
    ax.axhline(y=0.05, color='black', linestyle='-', linewidth=1)

# Function to add vertical line at x=0
def add_zero_line(ax):
    ax.axvline(x=0, color='black', linestyle='--', linewidth=1)

# Scatter plot for Age Difference Standardized
sns.scatterplot(data=age_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=age_dif_std_df.index, s=50, ax=axs[0, 0])
add_threshold_line(axs[0, 0])
add_zero_line(axs[0, 0])
axs[0, 0].set_title('Age Difference Standardized')
axs[0, 0].set_xlabel('dy/dx')
axs[0, 0].set_ylabel('Pr(>|z|)')
axs[0, 0].legend().set_visible(False)

# Scatter plot for Reach Difference Standardized
sns.scatterplot(data=reach_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=reach_dif_std_df.index, s=50, ax=axs[0, 1])
add_threshold_line(axs[0, 1])
add_zero_line(axs[0, 1])
axs[0, 1].set_title('Reach Difference Standardized')
axs[0, 1].set_xlabel('dy/dx')
axs[0, 1].set_ylabel('Pr(>|z|)')
axs[0, 1].legend().set_visible(False)

# Scatter plot for Height Difference Standardized
sns.scatterplot(data=height_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=height_dif_std_df.index, s=50, ax=axs[1, 0])
add_threshold_line(axs[1, 0])
add_zero_line(axs[1, 0])
axs[1, 0].set_title('Height Difference Standardized')
axs[1, 0].set_xlabel('dy/dx')
axs[1, 0].set_ylabel('Pr(>|z|)')
axs[1, 0].legend().set_visible(False)

# Scatter plot for Odds Difference Standardized
scatter = sns.scatterplot(data=odds_dif_std_df, x='dy/dx', y='Pr(>|z|)', hue=odds_dif_std_df.index, s=50, ax=axs[1, 1])
add_threshold_line(axs[1, 1])
add_zero_line(axs[1, 1])
axs[1, 1].set_title('Odds Difference Standardized')
axs[1, 1].set_xlabel('dy/dx')
axs[1, 1].set_ylabel('Pr(>|z|)')
axs[1, 1].legend().set_visible(False)

# Move the common legend outside the plot
scatter.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Display the DataFrames for each feature
print("Age Difference Standardized:")
print(age_dif_std_df, "\n")

print("Reach Difference Standardized:")
print(reach_dif_std_df, "\n")

print("Height Difference Standardized:")
print(height_dif_std_df, "\n")

print("Odds Difference Standardized:")
print(odds_dif_std_df, "\n")

Age Difference Standardized:
                        dy/dx  Pr(>|z|)
Bantamweight          -0.0260    0.1769
Catch Weight          -0.0968    0.1440
Featherweight         -0.0635    0.0022
Flyweight             -0.0415    0.1716
Heavyweight           -0.0074    0.7064
Light Heavyweight     -0.0123    0.5475
Lightweight           -0.0238    0.1461
Middleweight          -0.0104    0.5792
Welterweight          -0.0019    0.9076
Women's Bantamweight   0.0038    0.8994
Women's Featherweight -0.0834    0.3853
Women's Flyweight      0.0780    0.0282
Women's Strawweight   -0.0376    0.2121 

Reach Difference Standardized:
                        dy/dx  Pr(>|z|)
Bantamweight           0.0354    0.1747
Catch Weight           0.0535    0.5316
Featherweight         -0.0333    0.2057
Flyweight             -0.0270    0.5685
Heavyweight           -0.0977    0.0015
Light Heavyweight     -0.0606    0.0473
Lightweight           -0.0027    0.9028
Middleweight          -0.0437    0.0841
Welterweight           0.0132    0.5847
Women's Bantamweight  -0.0607    0.2921
Women's Featherweight -0.1791    0.2981
Women's Flyweight     -0.0016    0.9719
Women's Strawweight    0.0000    0.9999 

Height Difference Standardized:
                        dy/dx  Pr(>|z|)
Bantamweight          -0.0013    0.9630
Catch Weight           0.0356    0.6512
Featherweight          0.0157    0.5054
Flyweight              0.0631    0.1583
Heavyweight            0.0577    0.0328
Light Heavyweight      0.0028    0.9297
Lightweight            0.0130    0.5250
Middleweight           0.0163    0.5255
Welterweight          -0.0380    0.0968
Women's Bantamweight  -0.0014    0.9789
Women's Featherweight  0.0926    0.4381
Women's Flyweight     -0.0310    0.5445
Women's Strawweight    0.0274    0.4174 

Odds Difference Standardized:
                        dy/dx  Pr(>|z|)
Bantamweight          -0.1891    0.0000
Catch Weight          -0.1614    0.0001
Featherweight         -0.1387    0.0000
Flyweight             -0.1520    0.0000
Heavyweight           -0.1639    0.0000
Light Heavyweight     -0.1601    0.0000
Lightweight           -0.1807    0.0000
Middleweight          -0.1539    0.0000
Welterweight          -0.1720    0.0000
Women's Bantamweight  -0.1516    0.0000
Women's Featherweight -0.2617    0.0000
Women's Flyweight     -0.1364    0.0000
Women's Strawweight   -0.1798    0.0000

Interesting yet unsurprisingly, parsing down our 5000+ observations down to just 200-300 per cluster significantly affects our p-values and our ability to reject the null-hypothesis.

While this might seem anticlimactic, we actually do learn new tidbits of information. For example:

Heavyweights are the only fighters who gain a statistically significant advantage from being taller.
On the other hand, Heavyweights and Light Heavyweights are the only ones to gain an advantage in reach differential, and the advantage is seen for those with reachers shorter than their opponents!
With p-values virtually equal to 0, betting odds are a great predictor for fight outcome.
Interestingly, catch weight fights (which often take place on short-notice or because a fighter failed their weightcut) have a less non-zero p-value of 0.0001. This suggests that the uncertain nature of catch weight fights are also difficult to read for bettors!

Final thoughts¶

There's still so much we can dig through and learn from studying fight data, but at the end of the day, it's great to learn that physical attributes and conventional wisdom only play a minor role in determining the outcome of a fight. Training, experience, matchups, and a lot of other factors likely play a much larger role in helping you win a fight. But until we get more granular data with bigger datasets, we will leave those features shrouded in the noise of our model.

The Logistic regression¶

Maximum Likelihood Estimation (MLE)¶

What does it take to be the statistically advantageous Fighter?

What does our data look like?¶

Logistic Regression¶

Interpretting the logistic regression results¶

Clustering data for additional analysis¶

Final thoughts¶

What does it take to be the
statistically advantageous Fighter?