Using Flow with StatsBomb Data#

Open in Colab

Flow includes a built-in integration with the StatsBomb API, making it easy to stream structured football data directly into your pipelines.

Rather than loading everything upfront, Flow wraps the API as lazy operations - each call builds a plan that fetches the data only when needed (e.g., on .collect() or .to_pandas()).

βš™οΈ Setup#

Ensure your StatsBomb credentials are set as environment variables if you’re using private access:

export SB_USERNAME="your_username"
export SB_PASSWORD="your_password"

πŸš€ Getting Started#

from penaltyblog.matchflow import Flow

# Fetch all competitions
comps = Flow.statsbomb.competitions()

for comp in comps.head(3):
    print(comp)

All API calls return a Flow, so you can apply all usual transformations like .filter(), .select(), .assign(), etc.

πŸ” Available Endpoints#

Method

Description

.competitions()

All competitions available via API

.matches(competition_id, season_id)

Matches for a specific season

.events(match_id)

All events in a match

.lineups(match_id)

Lineups and formation for a match

.player_match_stats(match_id)

Player-level stats for a match

.player_season_stats(competition_id, season_id)

Player stats over a season

.team_match_stats(match_id)

Team stats for a match

.team_season_stats(competition_id, season_id)

Team stats over a season

All of these return a lazy Flow

πŸ§ͺ Example: Shots in a Match#

from penaltyblog.matchflow import Flow, where_equals

shots = (
    Flow.statsbomb.events(match_id=19716)
    .filter(where_equals("type.name", "Shot"))
    .select("player.name", "location", "shot.outcome.name")
)

for shot in shots.head(3):
    print(shot)

🧼 Filtering & Transforming#

Because Flow supports deep access to nested fields, you can work directly with StatsBomb’s JSON structure without needing to flatten first:

from penaltyblog.matchflow import Flow, where_equals

top_scorers = (
    Flow.statsbomb.player_season_stats(competition_id=43, season_id=106)
    .filter(lambda r: r["goals"] >= 5)
    .select("player.name", "team.name", "goals")
)

🐒 Lazy Until Needed#

Remember, nothing is downloaded or processed until you materialize the flow:

  • .collect() β†’ fetches all records

  • .to_pandas() β†’ fetches and converts to DataFrame

  • .head(n) β†’ fetches just the first n records

df = Flow.statsbomb.competitions().to_pandas()
print(df)

πŸ”’ Authenticated Access#

All API methods accept a creds dictionary, or you can use environment variables:

Flow.statsbomb.events(match_id=123, creds={"user": "...", "passwd": "..."})

🧠 Tips#

  • Useful for clubs or analysts already using StatsBomb data

  • Flows can be joined with your internal data or flattened and saved

  • Try .flatten().to_jsonl() to export clean JSONL for later

πŸ“ Summary#

Flow’s StatsBomb integration:

  • βœ… Keeps your data structured

  • βœ… Streams on demand (not loaded eagerly)

  • βœ… Integrates with full Flow pipeline tools

  • βœ… Works with both open and authenticated endpoints

Interactive Example#

For a comprehensive, hands-on demonstration of working with StatsBomb data, try the interactive Colab notebook. The notebook walks you through loading data from the StatsBomb API, filtering it, and visualizing the results. You can modify the code, experiment with different parameters, and see how the data change in real-time.

Open in Colab