MatchFlow#

Open in Colab

MatchFlow is a lightweight toolkit for working with structured football data, especially nested JSON like StatsBomb event files or match-level logs. Whether you’re building quick explorations or full pipelines, MatchFlow helps you work directly with deeply structured data using a clean, lazy, and chainable API.

What is MatchFlow?#

Flow is not a DataFrame, it’s a stream-first query engine built for irregular, event-based football data.

You can:

  • Load JSON, JSONL, or entire folders of match data

  • Filter and transform records lazily with .filter(), .assign(), .select()

  • Group and summarize using .group_by() + .summary()

  • Join datasets, explode lists, split arrays, pivot rows

  • Work with nested data without flattening too early

  • Chain steps fluently, materialize only when ready

  • Filtering using string expressions, like "age > 30 and team == @team_name"

  • Stream data directly from the StatsBomb or Opta APIs

All transformations are lazy; nothing runs until you ask for results with .collect(), .to_pandas(), .to_jsonl() etc.

Interactive Examples#

For a comprehensive, hands-on demonstration of the Matchflow, try the interactive Colab notebook. The notebook walks you downloading data directly from the Statsbomb API (including Statsbomb’s free, open data sets), building data pipelines, and creating interactive vizualisations using penaltyblog’s Pitch plotting library. You can modify the code, experiment with different parameters, and see how the data changes in real-time.

Open in Colab

Guide Index#

Guide Index#

Section

Description

Why Nested Data Isn’t a Problem - It’s the Point

Why working with nested football data needs a new tool

Introduction to Flow

Introduction to MatchFlow

Basic Pipelines: Transforming Your Data

Filtering, assigning, selecting, and shaping your data

Grouping and Aggregating Data

Summarizing by team, player, period, and more

Advanced Pipeline Operations

Sorting, ranking, joining and deduplicating

Schema Validation and Type Casting

Schema inference, casting, and field validation

Working with Files: Input & Output

Working with JSON, JSONL, folders, glob patterns

Utility, Inspection & Interoperability

Exploring structure, peeking at records, debugging

Best Practices, Performance & Troubleshooting

Materialization, memory, performance, clean code

Filtering Data with Predicates and Helpers

Reusable filters like where_equals(), and_()

The Query Method

Filtering using string expressions, like "age > 30 and team == @team_name"

Query Optimization

Smart plan rewrites for faster execution

Using Flow with StatsBomb Data

Streaming data directly from the StatsBomb API

Using Flow with Opta Data

Streaming data directly from the Opta API

Quick Start#

from penaltyblog.matchflow import Flow, where_equals

# Load and filter StatsBomb shots
flow = (
   Flow.statsbomb.events(match_id=19716)
   .filter(where_equals("type.name", "Shot"))
   .select("player.name", "location", "shot.statsbomb_xg")
)

for shot in flow.head(5):
   print(shot)

Ready to Flow?#

Pick a section from the guide above, or jump in with .from_jsonl(), .from_folder(), or .statsbomb.events() and start building your pipeline.

Need help? Ask questions, file issues, or suggest improvements any time.