Query Optimization#

MatchFlow includes a built-in query optimizer that transparently rewrites your pipeline to improve performance while preserving semantics.

In general:

  • ✅ safer pipelines

  • ✅ faster execution

  • ✅ smaller intermediate data

  • ✅ better scalability

🚀 When Optimization Happens#

By default, flows are unoptimized:

flow = Flow.from_jsonl("match_events.jsonl")

To enable optimization, pass optimize=True:

flow = Flow.from_jsonl("match_events.jsonl", optimize=True)

Or explicitly at collect-time:

flow.collect(optimize=True)

Any visualization (explain(), plot_plan(), etc.) can also show optimized plans.

🧠 What The Optimizer Does#

The optimizer currently performs conservative rule-based rewrites, including:

🧐 Example#

Consider the following flow:

flow = (
    Flow.from_jsonl("match_events.jsonl")
    .assign(team_name=lambda r: r["team"]["name"])
    .filter(lambda r: r["type"]["name"] == "Pass")
    .select("minute", "second", "team_name")
    .limit(100)
)

Without optimization:

from_jsonl  assign  filter  select  limit

With optimization:

from_jsonl  filter  assign  select  limit
  • The filter() is pushed earlier.

  • The assign() and select() are reordered.

  • The limit() is moved earlier.

  • Fewer rows flow through the pipeline.

🔍 Explain Your Plans#

You can always inspect both raw and optimized plans:

flow.explain(compare=True)

Or visualize them:

flow.plot_plan(compare=True)

🚫 What The Optimizer Does Not Do (yet…)#

  • Complex join reordering

  • Predicate simplification

  • Cost-based optimization

  • Group-by optimizations

The optimizer is designed to be safe-by-default: it will only reorder steps when correctness can be statically guaranteed.

⚙ Optimizer Safety Model#

MatchFlow applies conservative optimizations to preserve correctness when working with arbitrary user-defined functions:

✅ Safe to optimize:

Operations like select(), drop(), limit(), filter() (when independent), sort(), group_by(), and other structural plan steps.

🚫 Not assumed safe to reorder:
  • map()

  • assign()

  • any user-defined filter() with non-trivial predicates

  • any lambdas or custom functions

🔒 Why conservative?

Unlike SQL engines, MatchFlow makes no assumptions about:

  • Commutativity: e.g. map() and filter() may not commute.

  • Determinism: user functions may depend on external state, random values, timestamps, etc.

  • Purity: functions may have side-effects or depend on execution order.

⚠ Fusion:
  • Consecutive map() / assign() / filter() steps may be fused together at plan build time (syntactic fusion).

  • Fusion never involves reordering; it only combines adjacent steps for efficiency.

🔬 Invariant:

The execution semantics of any user-specified Flow remain the same under optimization, unless steps were fused at creation time.

Summary#

You Write

Optimizer Makes Fast

Declarative pipelines

Minimal and efficient execution plans

Readable code

Faster runtime

Safe transformations

Transparent optimization