📍 Recipe: Statsbomb API Integration#

This example shows how to connect directly to the official StatsBomb API using Flow, then query, filter, and process match data with minimal boilerplate.

Flow.statsbomb wraps the statsbombpy client to give you fast, flexible access to event, lineup, and 360 data - in a format ready for analysis.

🧰 In This Recipe, You’ll Learn:#

  • How to access StatsBomb data using Flow.statsbomb methods

  • How to query nested JSON records without flattening

  • How to filter, transform, and prepare data for analysis or storage

  • How to use Flow as an ETL tool to move data from API to file or database

This is the easiest way to stream structured football data into your own pipelines.

Imports#

[1]:
from pprint import pprint
import sqlite3
import warnings

from penaltyblog.matchflow import Flow, where_equals, get_field, get_index
from statsbombpy.api_client import NoAuthWarning

# Suppress Statsbomb's NoAuthWarning warnings since we're using the open data
warnings.filterwarnings("ignore", category=NoAuthWarning)

Competitions#

Get the first competition listed for Italy.

[2]:
result = (
    Flow.statsbomb.competitions()
    .filter(where_equals("country_name", "Italy"))
    .collect()
)

pprint(result[0])
{'competition_gender': 'male',
 'competition_id': 12,
 'competition_international': False,
 'competition_name': 'Serie A',
 'competition_youth': False,
 'country_name': 'Italy',
 'match_available': '2024-06-25T23:56:11.910924',
 'match_available_360': None,
 'match_updated': '2024-06-25T23:56:11.910924',
 'match_updated_360': None,
 'season_id': 27,
 'season_name': '2015/2016'}

Matches#

Get the matches for a given competition and season and filter to specific nested fields using “dot” notation.

[3]:
result = (
    Flow.statsbomb.matches(competition_id=1238, season_id=108)
    .select(
        "competition.competition_name",
        "home_team.home_team_name",
        "away_team.away_team_name",
        "referee.name",
        "referee.country.name",
    )
    .rename(
        **{
            "competition.competition_name": "competition_name",
            "home_team.home_team_name": "home_team_name",
            "away_team.away_team_name": "away_team_name",
            "referee.name": "referee_name",
            "referee.country.name": "referee_country_name",
        }
    )
    .collect()
)

pprint(result[:3])
[{'away_team': {},
  'away_team_name': 'Kerala Blasters',
  'competition': {},
  'competition_name': 'Indian Super league',
  'home_team': {},
  'home_team_name': 'Hyderabad',
  'referee': {'country': {}},
  'referee_country_name': 'India',
  'referee_name': 'Crystal John'},
 {'away_team': {},
  'away_team_name': 'Jamshedpur',
  'competition': {},
  'competition_name': 'Indian Super league',
  'home_team': {},
  'home_team_name': 'Kerala Blasters',
  'referee': {'country': {}},
  'referee_country_name': 'India',
  'referee_name': 'Harish Kundu'},
 {'away_team': {},
  'away_team_name': 'Hyderabad',
  'competition': {},
  'competition_name': 'Indian Super league',
  'home_team': {},
  'home_team_name': 'ATK Mohun Bagan',
  'referee': {'country': {}},
  'referee_country_name': 'India',
  'referee_name': 'Ramachandran Venkatesh'}]

Lineups#

Get the first player in the lineup for each team for a given match_id.

[4]:
result = (
    Flow.statsbomb.lineups(match_id=8657)
    .assign(player=lambda x: get_field(x, "lineup.0"))
    .select("team_name", "player")
    .collect()
)

pprint(result)
[{'player': {'cards': [],
             'country': {'id': 22, 'name': 'Belgium'},
             'jersey_number': 17,
             'player_id': 2954,
             'player_name': 'Youri Tielemans',
             'player_nickname': None,
             'positions': [{'end_reason': 'Substitution - Off (Tactical)',
                            'from': '00:00',
                            'from_period': 1,
                            'position': 'Right Defensive Midfield',
                            'position_id': 9,
                            'start_reason': 'Starting XI',
                            'to': '77:47',
                            'to_period': 2}]},
  'team_name': 'Belgium'},
 {'player': {'cards': [],
             'country': {'id': 68, 'name': 'England'},
             'jersey_number': 20,
             'player_id': 3094,
             'player_name': 'Bamidele Alli',
             'player_nickname': 'Dele Alli',
             'positions': [{'end_reason': 'Final Whistle',
                            'from': '83:28',
                            'from_period': 2,
                            'position': 'Right Center Midfield',
                            'position_id': 13,
                            'start_reason': 'Substitution - On (Tactical)',
                            'to': None,
                            'to_period': None}]},
  'team_name': 'England'}]

Events#

Get all events for a given match_id and count who took the most shots.

[5]:
result = (
    Flow.statsbomb.events(match_id=8657)
    .filter(where_equals("type.name", "Shot"))
    .group_by("player.name")
    .summary({"n_shots": ("count", "player.name")})
    .sort_by("n_shots", ascending=False)
    .limit(3)
)

for record in result:
    print(record)
{'player.name': 'Harry Maguire', 'n_shots': 3}
{'player.name': 'Eric Dier', 'n_shots': 3}
{'player.name': 'Thomas Meunier', 'n_shots': 2}

Filter the events for a given match_id to select only passes and save them to a database table. For simplicity, we’ll just use a local SQLite database.

[6]:
# Create a connection to the SQLite database
conn = sqlite3.connect("/tmp/passes.db")

# Save the DataFrame to the SQLite database, appending to the table
results = (
    Flow.statsbomb.events(match_id=8657)
    .filter(lambda r: get_field(r, "type.name") == "Pass")
    .assign(
        player_id=lambda r: get_field(r, "player.id"),
        player_name=lambda r: get_field(r, "player.name"),
        start_x=lambda r: get_field(r, "location.0"),
        start_y=lambda r: get_field(r, "location.1"),
        end_x=lambda r: get_field(r, "pass.end_location.0"),
        end_y=lambda r: get_field(r, "pass.end_location.1"),
        outcome=lambda r: get_field(r, "pass.outcome.name"),
    )
    .select(
        "player_id", "player_name", "start_x", "start_y", "end_x", "end_y", "outcome"
    )
    .to_pandas()
    .to_sql("passes", conn, if_exists="append", index=False)
)

# Close the connection
conn.close()