AI-Driven Soccer Formations Analysis: Exploring Player Data for Strategic Insights (Part I)

15 min readSep 17, 2024

Leveraging Machine Learning to Uncover Winning Formations and Player Performance in Professional Football.

Summary:

In this project, we leverage AI and machine learning to analyze soccer player data and predict the optimal positions for players based on their physical and technical attributes. By examining how attributes such as speed, agility, and passing affect a player’s performance in different roles, our AI model offers strategic insights into player positioning. The model identifies versatile talents, suggests unconventional roles, and helps manage squad depth during injuries. This approach could revolutionize how coaches assess player capabilities, unlocking hidden potential and improving team performance through data-driven decisions.

All the code:

GuillemMiralles/Field-Wizard (github.com)

2nd Part here:

AI-Driven Soccer Formations Analysis: Optimizing Team Formations Through Deep Learning (Part II) | by Guillem Miralles | Sep, 2024 | Medium

Let’s go to what we are passionate about: Football and Data Science.

Introduction: Understanding AI’s Role in Predicting Optimal Soccer Positions

Historical Shifts: How Positional Changes Have Shaped Football Careers

Football history is full of moments where coaches have made bold decisions that altered players’ careers and enhanced team performance. One famous example is Pep Guardiola’s decision to move Philipp Lahm, traditionally a right-back, into a central midfield role at Bayern Munich [1]. This shift unlocked a new dimension in Lahm’s game, allowing him to control the tempo and contribute defensively and offensively with greater impact. Similarly, Steven Gerrard, who spent the majority of his career as a central midfielder, was shifted to a more defensive midfield role towards the end of his playing days under Brendan Rodgers at Liverpool, to maximize his distribution and leadership from deeper positions.

Steven Gerrard: A Complete Midfielder — YouTube

There are also countless examples of players being moved due to injuries within their teams. For instance, Javier Mascherano, a world-class defensive midfielder, was pushed into the center-back role by Barcelona’s management when their defensive line was plagued with injuries [2]. Mascherano’s versatility shone through, and he became an integral part of their defense during the club’s dominant era. Barcelona’s Sergi Roberto has exemplified adaptability throughout his career, playing as a full-back, midfielder, and winger [3]. His versatility was particularly important in covering for injuries in a number of positions, demonstrating the vital role flexibility plays in modern football. In addition, players who have positional flexibility are becoming increasingly important due to rotations, injuries and the increasing number of games in football [4].

Harnessing AI to Optimize Player Performance and Positioning

Building on the impactful examples from football’s past, this project focuses on create an AI-driven model and data analysis that goes beyond instinct and circumstance, and instead uses data to make precise predictions about player suitability across multiple positions.

Key Objectives:

Analyze Player Attributes
We will develop AI models to analyze players’ physical and technical data, providing an in-depth understanding of how individual attributes correlate with performance in various roles.
Predict Optimal Positions
The AI will help predict the best positions for players, offering evidence-based recommendations rather than relying on traditional assessments or trial-and-error during games.
Discover Unconventional Roles
AI can reveal hidden potential in players, identifying roles they might not typically be considered for but could excel in.
Injury Management and Squad Depth
Injuries are a constant challenge for football teams. Our AI can be used to find the best positional replacements for injured players, ensuring minimal disruption to team performance.

Data Sources

In this project, we utilized datasets from open-source platforms like Kaggle, which provide rich information to train our AI models and make insightful predictions. Below are the main sources of data:

1. European Soccer Database

This dataset is a comprehensive collection of football data from 11 European countries, covering over 25,000 matches and more than 10,000 players from the 2008–2016 seasons [5]. It includes key tables such as:

Player Table: Contains basic player information like api_id, player name, height, and weight.

Player_Attributes Table: Offers 38 detailed player attributes recorded at different times during the season. Some attributes are:

sprint_speed: Player's maximum running speed.
agility: Player's flexibility and quick movement.
reactions: How fast a player responds to game events.
balance: Stability and coordination.
shot_power: Strength of the player's shot.

Correlation between Player’s Overall Rating VS Attributes — Snapshot of the Author

Match Table: This table provides detailed information about each match, including:

team_api_id: IDs of both home and away teams.
season: The season in which the match took place.
outcome: Goals scored by the home and away teams.
player_api_id: IDs of the players who participated in the match (e.g., home_player_1 to home_player_11 for home team and similarly for away team).
player positions: The exact position of each player on the field, represented by coordinates on the x-axis and y-axis (e.g., home_player_X1 and home_player_Y1 for Player 1).

Macthes by league in European Soccer Database (2008–2016): Snapshot of the Author

Team Table: The Team table is only used to transform the api of the home and away team to the real name of the team.

La Liga Achievements between 2008–2016 : Snapshot of the Author

2. FIFA Complete Player Dataset

We used detailed player statistics from the FIFA Career Mode datasets for the years 2015 and 2016 [6]. This data includes more than 100 individual attributes per player, allowing for comparisons across different game versions.

players_2015.csv
players_2016.csv

These files contain key statistics, and we leverage them because they provide information about the positions a player can play, which is crucial for our analysis.

Some players and their position: FIFA 15 and 16 — Snapshot of the Author

Unlocking Positional Insights: Data Analysis and Model Building

Data Cleaning and Preprocessing

The raw data we used was rich but required extensive cleaning. Here’s a breakdown of the steps we followed:

Handling Missing Values: Several attributes, such as player stats or match information, had missing values. We addressed this by either filling them with appropriate averages or removing incomplete entries to ensure data quality.
Encoding Categorical Variables: Attributes like “preferred foot” or “defensive work rate” were categorical, meaning they needed to be transformed into numerical formats using techniques such as one-hot encoding [7].
Combining Datasets:

We first merged the FIFA datasets from 2015 and 2016, ensuring that every player’s name, along with their corresponding positions, was accurately matched.
Afterward, we combined the player data with the Player Attributes table to enrich our dataset with information about the players’ physical and technical attributes, such as height, weight, speed, and agility.

For each player, we gathered approximately 40 key statistics per season, giving us an extensive overview of their performance. These stats included:

Physical Attributes: Height, weight, agility, balance, and speed.
Technical Attributes: Dribbling, finishing, passing, and tackling abilities.
Positions Played: Importantly, we had access to the positions or multiple positions a player has played during a given season.

We consolidated datasets from the European Soccer Database (2008–2016), allowing us to track players’ development over multiple seasons. This means that for each player, we have several records reflecting how their performance evolves.

For example, Mohamed Salah has multiple entries, one for each season, showing shifts in his attributes. In the PCA visualization below, red dots represent Salah’s stats over time compared to other Right-Wing (RW) players (in blue). This gives us valuable insights into how players’ performances vary, helping us predict their optimal playing positions.

PCA visualization of RW players with Mohamed Salah’s stats highlighted — Snapshot of the Author

Advanced Feature Engineering: Capturing Key Aspects of Player Performance

To improve the accuracy of our model, we applied feature engineering techniques [8] that consolidate individual player statistics into broader, more meaningful categories. This approach not only simplifies the data but also allows us to focus on the most relevant aspects of player performance.

For example, we created a Performance Index by averaging each player’s overall rating and potential, providing a more holistic view of their ability. Additionally, the Finishing Accuracy metric merges finishing, heading accuracy, and shot power — three key components for scoring potential — into a single, comprehensive score.

Other features include:

Skills Score, which aggregates crossing, dribbling, curve, and ball control for technical proficiency.
Attacking, Midfield, and Defensive Ratings, which combine relevant statistics to better capture positional abilities.
Goalkeeper Rating, which merges diving, handling, and reflex attributes to give a full overview of a goalkeeper’s potential.

# X is our dataframe
def NewFeatures(X):
  # Creating New variables
  # Body Mass Index
  X['wh'] = (X['weight'] + X['height']) / 2.0

  # Overalll Rating and Potential
  X['performance_index'] = (X['overall_rating'] + X['potential']) /2.0

  # Finishing
  X["finishing_acc"] = (X["finishing"] + X["heading_accuracy"] + X["shot_power"])/3.0

  # Skills:
  X['skills'] = (X['crossing'] + X['dribbling'] + X['curve'] + X['ball_control']) / 3.0

  # Mentality
  X['mentality'] = (X['aggression'] + X['interceptions'] + X['positioning'] +  X['vision']) / 4.0

  # Movment
  X['movment'] = (X['acceleration'] + X['sprint_speed'] + X['agility']+
                  X['reactions'] + X['balance']) /5.0

  # AT Rates
  X['attacking_rating'] = (X['crossing'] + X['finishing'] + X['heading_accuracy'] +
                              X['short_passing'] + X['volleys']) / 5.0

  # MC Rates
  X['midfield_rating'] = (X['short_passing'] + X['long_passing'] + X['ball_control'] +
                            X['vision'] + X['crossing'] + X['curve'] + X['dribbling'] +
                            X['free_kick_accuracy'] + X['long_shots']) / 9.0
  # DEF Rates
  X['defensive_rating'] = (X['marking'] + X['standing_tackle'] + X['sliding_tackle']) / 3.0

  # GK Rates
  X['goalkeeper_rating'] = (X['gk_diving'] + X['gk_handling'] + X['gk_kicking'] +
                              X['gk_positioning'] + X['gk_reflexes']) / 5.0

  return X

Some important things: EDA

In this scatter plot, players are clustered by position, with goalkeepers clearly separated from outfield players. The separation suggests that goalkeepers have distinctly different attributes compared to other positions. Midfielders, defenders, and forwards overlap more, but we can see how the top group is differentiated by the green colour: attackers, and below the blue: defenders.

PCA Scatter Plot: Player Position — Snapshot of the Author

This diagram helps us visually associate the positions on the field with player roles. It emphasizes how the different positions are spread out in relation to each other, particularly how the midfield acts as a bridge between defense and attack. It gives a clear visual of where players typically perform and shows how their physical and technical attributes align with specific roles on the pitch.

Soccer Field Positional Mapping — Source: FIFAUTEAM

We also can observe that certain player positions are represented more frequently than others, indicating a class imbalance in the dataset. This imbalance could pose a challenge when training a model, as it may lead to biased predictions toward the more frequent positions. The uneven distribution of players across different roles suggests the need for balancing techniques, such as oversampling or undersampling, to ensure the model can accurately predict player performance across all positions without being skewed by the dominant classes.

Number of Appearances of the Positions — Snapshot of the Author

One-Hot Encoding and Multilabel Problem

To handle the fact that a player can perform well in multiple positions, we used One-Hot Encoding [9]. This method transforms each position into a binary vector, where each position becomes a new column in the dataset. If a player is suited for a particular position, the value is set to 1; otherwise, it is set to 0. This encoding is essential for the multilabel classification problem, where each player can be classified into multiple categories (positions) simultaneously.

Different Approaches to Improve Model Performance

To find the best performing model, we tried various techniques to optimize the data and model performance:

Oversampling of Underrepresented Positions: Since some positions were underrepresented in the dataset, leading to a class imbalance, we applied oversampling. This technique increased the number of samples in the underrepresented classes, ensuring the model would learn equally from all positions and not just the most common ones [10].
Principal Component Analysis (PCA): We utilized PCA to reduce the dimensionality of the dataset, simplifying it while retaining over 97.5% of the data variance. By doing this, we aimed to eliminate noise and redundant information [11].
Feature Selection with Boruta: Lastly, we applied the Boruta algorithm to select the most important features for predicting player positions. This method helps filter out unnecessary attributes, allowing the model to focus on the key features that contribute to a player’s success in certain roles [12].

Final Model Architecture: Zlatanizer

For the final model, we achieved the best performance using the following approach:

No Oversampling: We chose not to apply oversampling to keep the natural distribution of the positions in the data.
No Dimensionality Reduction: We kept all the features intact without applying techniques like PCA to reduce dimensionality.
Normalized Data with MinMax: To scale the data and bring all attributes to a common range, we normalized the dataset using the MinMaxScaler [13].

In this neural network architecture, we used multiple layers and applied dropout to prevent overfitting. The network was designed to predict the optimal playing positions of players, and we set it up as a multilabel classification problem. This model uses a sigmoid activation function in the output layer to handle the multilabel nature, and binary cross-entropy was chosen as the loss function.

Furthermore, we implemented early stopping to ensure the model would halt training once the validation loss stopped improving, thereby preventing overfitting.

Zlatanizer: Final Model to predict position based on attributes — Snapshot of the Author

Results, Conclusions, and Future Directions

Results: Model Performance

After testing various models on our dataset, we achieved the following results in terms of accuracy and F1-score:

As observed, the Neural Network outperformed the other models, achieving the highest accuracy and F1-score. This demonstrates the robustness of using deep learning approaches, especially in complex, multi-label tasks such as predicting player positions based on a wide range of attributes.

Position-wise Evaluation Metrics for the Neural Network Model (Precision, Recall, F1-Score) — Snapshot of the Author

As shown in the evaluation plots, the model performs remarkably well across a range of positions, with consistently high precision, recall, and F1 scores. The slight variations between different roles suggest that the model can adapt to different positional requirements, ensuring a tailored prediction for each player based on their attributes.

Interpreting Model Predictions: Probability Scores and Player Position Suggestions

The key strengths of our model lie in its ability to provide probability scores for each position. Instead of simply assigning a player to a single role, the model outputs a probability distribution across multiple positions. This allows us to understand the likelihood of a player excelling in various positions, rather than limiting them to one predefined role.

For instance, a player might have a high probability of performing well as a central midfielder (CM) but also show moderate potential as a winger (RW). This information can be invaluable for coaches and analysts, as it gives them the flexibility to experiment with different formations and placements, especially in situations where squad rotations or injuries require quick decisions.

By leveraging this multi-label approach, the model can also uncover hidden potential. Players might excel in roles that are not traditionally associated with their profile, revealing untapped versatility.

Unlocking New Potential: Real-World Examples

Our model’s strength lies in offering probability scores for various positions, allowing coaches to explore potential roles beyond a player’s typical position. This flexibility can uncover hidden potential and provide strategic versatility. Here are a few examples:

1. David Alaba (2013 — Bayern Munich)

Current Position: Left-Back (LB)
Suggested Position: Center-Back (CB)
Model Insights: The model estimated a 71% probability of Alaba excelling as a left-back (LB) due to his vision, aerial ability, and defensive awareness. The center-back (CB) role showed a 63% probability, with no other position exceeding 35%. [14]

From 2016 onwards, Alaba successfully transitioned to center-back, playing a key role for Bayern Munich and continuing this role at Real Madrid, proving his success in both positions.

2. Joshua Kimmich (2014 — RB Leipzig)

Current Position: Defensive Midfielder (DM)
Suggested Position: Right-Back (RB)
Model Insights: The model shows a 78% probability for Kimmich to perform optimally as a defensive midfielder (DM), while also suggesting a 59% probability of excelling as a right-back (RB). [15]

Post-2016, Kimmich became one of the best right-backs in the world for Bayern Munich, proving the versatility the model highlighted, while also excelling in midfield when needed.

3. Fabinho (2015 — AS Monaco)

Current Position: Right-Back (RB)
Suggested Position: Defensive Midfielder (DM)
Model Insights: Fabinho stood out with a 77% probability of thriving as a defensive midfielder (DM), compared to 72% in his right-back (RB) role. His strong tackling, interception skills, and ability to read the game were highlighted as key factors for the shift. [16]

In 2018, Fabinho made this positional switch at Liverpool, becoming one of the most reliable defensive midfielders in Europe, validating the model’s predictions.

How a Team Can Apply This AI Model

Teams can use the AI model to make personalized predictions for each player, helping to analyze their potential in various positions based on their physical and technical attributes. By offering probability scores for multiple roles, the model allows coaching staff to:

Personalize player roles: Instead of fixed positions, the model suggests where players are most likely to succeed.
Conduct detailed analysis: Coaches can explore individual strengths, weaknesses, and compare players for specific positions, enabling smarter tactical decisions.
Tailor development plans: Based on the predictions, players can receive role-specific training to optimize their growth and performance across different positions.
Make dynamic adjustments: During games, the model’s insights can guide in-game position switches or strategic substitutions.

Future Directions and Enhancements

While our AI model has shown promising results, there are several ways we can further improve and expand its capabilities:

Incorporate Real-Time Data: Integrating live match data, such as GPS tracking, could enhance the accuracy of player position predictions, making the model more responsive to current performance trends.
Expand Dataset: By including newer datasets that span additional seasons and leagues, we could provide a more comprehensive view of player development across different competitive environments.
Custom Tactical Modeling: Extending the model to account for specific team tactics and formations would provide even deeper insights, allowing coaches to tailor strategies to their squad’s strengths in real-time.
Player Transfers and Scouting: This tool can be integrated into scouting systems to evaluate potential signings, predicting how new players might fit into existing tactical frameworks or excel in multiple positions.

By refining the model and expanding its applications, we can continue to enhance the strategic edge it offers teams, enabling them to better manage players, prevent injuries, and make more informed decisions on and off the pitch.

WOOOW! You made it to the end, thank you very much. If you liked it, help me with a clap to encourage me, and if you want to see how the project continues, go to the second part: AI-Driven Soccer Formations Analysis: Optimizing Team Formations Through Deep Learning (Part II) | by Guillem Miralles | Sep, 2024 | Medium.

It has been a pleasure to share this project with you :)

Feel free to contact me at guillemmiralles1@gmail.com with any questions you may have.

Resources:

[1] BBC Sport. (2024, September 18). The evolution of a full-back: Philipp Lahm on how position has changed. https://www.bbc.com/sport/football/67747395

[2] Marsden, S. (2023, March 4). How I played centre-back to survive at Barcelona. ESPN. https://www.espn.com/soccer/story/_/id/37544474/played-centre-back-survive-barcelona

[3] FootballCritic. (2024). Sergi Roberto: Player positions. https://www.footballcritic.com/sergi-roberto/player-positions/5657

[4] PlayerScout. (2024). Role flexibility in football: The ultimate guide. https://playerscout.co.uk/role-flexibility-football/

[5] Kaggle. (2024). European Soccer Database. https://www.kaggle.com/datasets/hugomathien/soccer

[6] Kaggle. (2024). FIFA 15 and 16 complete player dataset. https://www.kaggle.com/datasets/stefanoleone992/fifa-15-complete-player-dataset

[7] Upadhyay, R. (2019, October 10). Categorical feature encoding. Towards Data Science. https://towardsdatascience.com/categorical-feature-encoding-547707acf4e5

[8] IBM. (2024). What is feature engineering? https://www.ibm.com/topics/feature-engineering

[9] GeeksforGeeks. (2021, February 3). One-Hot Encoding in ML. https://www.geeksforgeeks.org/ml-one-hot-encoding/

[10] Khandelwal, S. (2019, October 24). Oversampling and undersampling in data science. Towards Data Science. https://towardsdatascience.com/oversampling-and-undersampling-5e2bbaf56dcf

[11] Lever, J., Krzywinski, M., & Altman, N. (2017). Principal component analysis. Nature Methods, 14, 641–642. https://www.nature.com/articles/nmeth.4346

[12] Sayak, P. (2020, July 18). Feature selection with Boruta in Python. Towards Data Science. https://towardsdatascience.com/feature-selection-with-boruta-in-python-676e3877e596

[13] Bundesliga. (2020). Is David Alaba the best centre-back in Germany? https://www.bundesliga.com/en/bundesliga/news/is-david-alaba-the-best-centre-back-in-germany-bayern-munich-austria-11365

[14] Bundesliga. (2022). Joshua Kimmich: A world-class right-back and midfielder. https://www.bundesliga.com/en/bundesliga/news/joshua-kimmich-a-world-class-right-back-and-midfielder-bayern-nagelsmann-18670

[15] The New York Times. (2023, May 17). Liverpool’s Fabinho and the tactical tweak: How Alexander-Arnold’s new role is changing the formation. https://www.nytimes.com/athletic/4523937/2023/05/17/liverpool-fabinho-formation-alexander-arnold/