MatchFlow#

MatchFlow is a lightweight toolkit for working with structured football data, especially nested JSON like StatsBomb event files or match-level logs. Whether you’re building quick explorations or full pipelines, MatchFlow helps you work directly with deeply structured data using a clean, lazy, and chainable API.

What is MatchFlow?#

Flow is not a DataFrame, it’s a stream-first query engine built for irregular, event-based football data.

You can:

Load JSON, JSONL, or entire folders of match data
Filter and transform records lazily with .filter(), .assign(), .select()
Group and summarize using .group_by() + .summary()
Join datasets, explode lists, split arrays, pivot rows
Work with nested data without flattening too early
Chain steps fluently, materialize only when ready
Filtering using string expressions, like "age > 30 and team == @team_name"
Stream data directly from the StatsBomb or Opta APIs

All transformations are lazy; nothing runs until you ask for results with .collect(), .to_pandas(), .to_jsonl() etc.

Interactive Examples#

For a comprehensive, hands-on demonstration of the Matchflow, try the interactive Colab notebook. The notebook walks you downloading data directly from the Statsbomb API (including Statsbomb’s free, open data sets), building data pipelines, and creating interactive vizualisations using penaltyblog’s Pitch plotting library. You can modify the code, experiment with different parameters, and see how the data changes in real-time.

Guide Index#

Guide Index#
Section	Description
Why Nested Data Isn’t a Problem - It’s the Point	Why working with nested football data needs a new tool
Introduction to Flow	Introduction to MatchFlow
Basic Pipelines: Transforming Your Data	Filtering, assigning, selecting, and shaping your data
Grouping and Aggregating Data	Summarizing by team, player, period, and more
Advanced Pipeline Operations	Sorting, ranking, joining and deduplicating
Schema Validation and Type Casting	Schema inference, casting, and field validation
Working with Files: Input & Output	Working with JSON, JSONL, folders, glob patterns
Utility, Inspection & Interoperability	Exploring structure, peeking at records, debugging
Best Practices, Performance & Troubleshooting	Materialization, memory, performance, clean code
Filtering Data with Predicates and Helpers	Reusable filters like `where_equals()`, `and_()`
The Query Method	Filtering using string expressions, like `"age > 30 and team == @team_name"`
Query Optimization	Smart plan rewrites for faster execution
Using Flow with StatsBomb Data	Streaming data directly from the StatsBomb API
Using Flow with Opta Data	Streaming data directly from the Opta API

Quick Start#

from penaltyblog.matchflow import Flow, where_equals

# Load and filter StatsBomb shots
flow = (
   Flow.statsbomb.events(match_id=19716)
   .filter(where_equals("type.name", "Shot"))
   .select("player.name", "location", "shot.statsbomb_xg")
)

for shot in flow.head(5):
   print(shot)

Ready to Flow?#

Pick a section from the guide above, or jump in with .from_jsonl(), .from_folder(), or .statsbomb.events() and start building your pipeline.

Need help? Ask questions, file issues, or suggest improvements any time.

Table of Contents