============================================= Best Practices, Performance & Troubleshooting ============================================= Flow is designed for **clarity**, **composability**, and **structured JSON pipelines**. But to use it effectively โ€” especially on large or semi-structured data โ€” you need to understand how Flow executes and when data is consumed. ๐Ÿง  Think in DAGs, Not DataFrames ================================ Flow builds a **deferred plan** of steps (like a DAG). Nothing runs until you collect results: .. code-block:: python flow = Flow.from_folder("data/").filter(...).assign(...).select(...) At this point, **no data has been read**. ๐Ÿšจ When Execution Happens ========================= Flow starts processing only when you: - ``.collect()`` - ``.to_pandas()`` - ``.to_json()``, ``.to_jsonl()`` - Iterate over the Flow - Call ``.first()``, ``.keys()``, ``len()`` โš ๏ธ When Materialization Happens =============================== Certain operations require the **full dataset** and will materialize in memory: +---------------+-----------------------------+ | Operation | Reason | +===============+=============================+ | ``.group_by()`` | Groups must be fully built | +---------------+-----------------------------+ | ``.summary()`` | Aggregates need full access | +---------------+-----------------------------+ | ``.sort_by()`` | Requires sorting all rows | +---------------+-----------------------------+ | ``.join()`` | Right side is pre-loaded | +---------------+-----------------------------+ | ``.pivot()`` | Reshapes after aggregation | +---------------+-----------------------------+ | ``.limit()`` | Buffers to truncate | +---------------+-----------------------------+ | ``.cache()`` | Explicit full collection | +---------------+-----------------------------+ You can always call ``.explain()`` to see where materialization occurs: .. code-block:: python flow.explain() ๐Ÿงช Inspect Safely ================== You can preview without consuming the full plan: .. code-block:: python Flow.from_jsonl("match.jsonl").head(3) ``.head(n)`` adds a ``.limit()`` and returns the first ``n`` results via ``.collect()``. It's a safe way to preview data. ๐Ÿ” Fork Pipelines Naturally ============================ .. code-block:: python f = Flow.from_jsonl("match.jsonl") attacks = f.filter(lambda r: r["team"] == "Arsenal") defence = f.filter(lambda r: r["team"] == "Manchester City") Because the pipeline is just a plan, each branch is safe and isolated. ๐Ÿงฐ Use ``.pipe()`` for Debugging or Custom Steps ================================================ You can insert custom logic mid-pipeline with ``.pipe()``: .. code-block:: python def peek(flow): print(flow.head(3)) return flow Flow.from_jsonl("match.jsonl").pipe(peek).filter(...) ๐Ÿ”„ Pure Functions = Safer Pipelines =================================== Since ``.map()`` and ``.assign()`` modify records, avoid side effects or mutating shared input. Prefer using ``.from_records(copy.deepcopy(data))`` if you're passing mutable records from outside. ๐Ÿ’ก Performance Tips =================== - Prefer ``.from_jsonl()`` over ``.from_json()`` for large files - Minimize ``.sort_by()`` or ``.group_by()`` until late in pipeline - Use ``.filter()`` early to reduce data as soon as possible - Avoid flattening too early. Use ``.select()`` to access nested fields instead ๐Ÿง  Summary ========== +---------------------------+------------------------------------------+ | Principle | Recommendation | +===========================+==========================================+ | Inspection | Use ``.head(n)`` to preview | +---------------------------+------------------------------------------+ | Debugging | Use ``.pipe()`` for custom hooks | +---------------------------+------------------------------------------+ | Materialization Awareness | Use ``.explain()`` to understand plan | +---------------------------+------------------------------------------+ | Filtering Early | Always filter before heavy ops | +---------------------------+------------------------------------------+ Flow gives you a structured, schema-aware, and composable pipeline for working with JSON, especially valuable when you want to defer flattening and stay close to raw data.