==================
Query Optimization
==================

MatchFlow includes a **built-in query optimizer** that transparently rewrites your pipeline to improve performance while preserving semantics.

In general:

- ✅ safer pipelines
- ✅ faster execution
- ✅ smaller intermediate data
- ✅ better scalability

🚀 When Optimization Happens
============================

By default, flows are unoptimized:

.. code-block:: python

   flow = Flow.from_jsonl("match_events.jsonl")

To enable optimization, pass ``optimize=True``:

.. code-block:: python

   flow = Flow.from_jsonl("match_events.jsonl", optimize=True)

Or explicitly at collect-time:

.. code-block:: python

   flow.collect(optimize=True)

Any visualization (``explain()``, ``plot_plan()``, etc.) can also show optimized plans.

🧠 What The Optimizer Does
==========================

The optimizer currently performs **conservative rule-based** rewrites, including:

+--------------------------------+-----------------------------------------------------------------------------+
| Optimization                   | Description                                                                 |
+================================+=============================================================================+
| **Filter Pushdown**            | Moves ``filter()`` earlier to reduce data earlier                          |
+--------------------------------+-----------------------------------------------------------------------------+
| **Limit Pushdown**             | Moves ``limit()`` closer to source                                         |
+--------------------------------+-----------------------------------------------------------------------------+
| **Select/Drop Pushdown**       | Drops unused fields as early as safely possible                            |
+--------------------------------+-----------------------------------------------------------------------------+
| **Map/Assign Fusion**          | Merges consecutive ``map()``, ``assign()``, ``filter()`` into a single fused step |
+--------------------------------+-----------------------------------------------------------------------------+
| **Redundant Step Elimination** | Removes unnecessary repeated ``drop()``, ``dropna()``                      |
+--------------------------------+-----------------------------------------------------------------------------+
| **Rolling Validation**         | Warns if rolling summaries lack prior ``sort()`` step                      |
+--------------------------------+-----------------------------------------------------------------------------+

🧐 Example
==========

Consider the following flow:

.. code-block:: python

   flow = (
       Flow.from_jsonl("match_events.jsonl")
       .assign(team_name=lambda r: r["team"]["name"])
       .filter(lambda r: r["type"]["name"] == "Pass")
       .select("minute", "second", "team_name")
       .limit(100)
   )

Without optimization:

.. code-block:: bash

   from_jsonl → assign → filter → select → limit

With optimization:

.. code-block:: bash

   from_jsonl → filter → assign → select → limit

- The ``filter()`` is pushed earlier.
- The ``assign()`` and ``select()`` are reordered.
- The ``limit()`` is moved earlier.
- Fewer rows flow through the pipeline.

🔍 Explain Your Plans
=====================

You can always inspect both raw and optimized plans:

.. code-block:: python

   flow.explain(compare=True)

Or visualize them:

.. code-block:: python

   flow.plot_plan(compare=True)

🚫 What The Optimizer Does Not Do (yet...)
==========================================

- Complex join reordering
- Predicate simplification
- Cost-based optimization
- Group-by optimizations

The optimizer is designed to be **safe-by-default**: it will only reorder steps when correctness can be statically guaranteed.

⚙ Optimizer Safety Model
========================

MatchFlow applies **conservative optimizations** to preserve correctness when working with arbitrary user-defined functions:

✅ Safe to optimize:
  Operations like ``select()``, ``drop()``, ``limit()``, ``filter()`` (when independent), ``sort()``, ``group_by()``, and other structural plan steps.

🚫 Not assumed safe to reorder:
  - ``map()``
  - ``assign()``
  - any user-defined ``filter()`` with non-trivial predicates
  - any lambdas or custom functions

🔒 Why conservative?
  Unlike SQL engines, MatchFlow makes no assumptions about:

  - Commutativity: e.g. ``map()`` and ``filter()`` may not commute.
  - Determinism: user functions may depend on external state, random values, timestamps, etc.
  - Purity: functions may have side-effects or depend on execution order.

⚠ Fusion:
  - Consecutive ``map()`` / ``assign()`` / ``filter()`` steps may be fused together at plan build time (syntactic fusion).
  - Fusion never involves reordering; it only combines adjacent steps for efficiency.

🔬 Invariant:
  The execution semantics of any user-specified Flow remain the same under optimization, unless steps were fused at creation time.

Summary
=======

+---------------------------+---------------------------------------+
| You Write                 | Optimizer Makes Fast                  |
+===========================+=======================================+
| **Declarative pipelines** | Minimal and efficient execution plans |
+---------------------------+---------------------------------------+
| **Readable code**         | Faster runtime                        |
+---------------------------+---------------------------------------+
| **Safe transformations**  | Transparent optimization              |
+---------------------------+---------------------------------------+