Query Optimization#
MatchFlow includes a built-in query optimizer that transparently rewrites your pipeline to improve performance while preserving semantics.
In general:
✅ safer pipelines
✅ faster execution
✅ smaller intermediate data
✅ better scalability
🚀 When Optimization Happens#
By default, flows are unoptimized:
flow = Flow.from_jsonl("match_events.jsonl")
To enable optimization, pass optimize=True:
flow = Flow.from_jsonl("match_events.jsonl", optimize=True)
Or explicitly at collect-time:
flow.collect(optimize=True)
Any visualization (explain(), plot_plan(), etc.) can also show optimized plans.
🧠 What The Optimizer Does#
The optimizer currently performs conservative rule-based rewrites, including:
🧐 Example#
Consider the following flow:
flow = (
Flow.from_jsonl("match_events.jsonl")
.assign(team_name=lambda r: r["team"]["name"])
.filter(lambda r: r["type"]["name"] == "Pass")
.select("minute", "second", "team_name")
.limit(100)
)
Without optimization:
from_jsonl → assign → filter → select → limit
With optimization:
from_jsonl → filter → assign → select → limit
The
filter()is pushed earlier.The
assign()andselect()are reordered.The
limit()is moved earlier.Fewer rows flow through the pipeline.
🔍 Explain Your Plans#
You can always inspect both raw and optimized plans:
flow.explain(compare=True)
Or visualize them:
flow.plot_plan(compare=True)
🚫 What The Optimizer Does Not Do (yet…)#
Complex join reordering
Predicate simplification
Cost-based optimization
Group-by optimizations
The optimizer is designed to be safe-by-default: it will only reorder steps when correctness can be statically guaranteed.
⚙ Optimizer Safety Model#
MatchFlow applies conservative optimizations to preserve correctness when working with arbitrary user-defined functions:
- ✅ Safe to optimize:
Operations like
select(),drop(),limit(),filter()(when independent),sort(),group_by(), and other structural plan steps.- 🚫 Not assumed safe to reorder:
map()assign()any user-defined
filter()with non-trivial predicatesany lambdas or custom functions
- 🔒 Why conservative?
Unlike SQL engines, MatchFlow makes no assumptions about:
Commutativity: e.g.
map()andfilter()may not commute.Determinism: user functions may depend on external state, random values, timestamps, etc.
Purity: functions may have side-effects or depend on execution order.
- ⚠ Fusion:
Consecutive
map()/assign()/filter()steps may be fused together at plan build time (syntactic fusion).Fusion never involves reordering; it only combines adjacent steps for efficiency.
- 🔬 Invariant:
The execution semantics of any user-specified Flow remain the same under optimization, unless steps were fused at creation time.
Summary#
You Write |
Optimizer Makes Fast |
|---|---|
Declarative pipelines |
Minimal and efficient execution plans |
Readable code |
Faster runtime |
Safe transformations |
Transparent optimization |