Best Practices, Performance & Troubleshooting#

Flow is designed for clarity, composability, and structured JSON pipelines. But to use it effectively — especially on large or semi-structured data — you need to understand how Flow executes and when data is consumed.

🧠 Think in DAGs, Not DataFrames#

Flow builds a deferred plan of steps (like a DAG). Nothing runs until you collect results:

flow = Flow.from_folder("data/").filter(...).assign(...).select(...)

At this point, no data has been read.

🚨 When Execution Happens#

Flow starts processing only when you:

.collect()
.to_pandas()
.to_json(), .to_jsonl()
Iterate over the Flow
Call .first(), .keys(), len()

⚠️ When Materialization Happens#

Certain operations require the full dataset and will materialize in memory:

You can always call .explain() to see where materialization occurs:

flow.explain()

🧪 Inspect Safely#

You can preview without consuming the full plan:

Flow.from_jsonl("match.jsonl").head(3)

.head(n) adds a .limit() and returns the first n results via .collect(). It’s a safe way to preview data.

🔁 Fork Pipelines Naturally#

f = Flow.from_jsonl("match.jsonl")

attacks = f.filter(lambda r: r["team"] == "Arsenal")
defence = f.filter(lambda r: r["team"] == "Manchester City")

Because the pipeline is just a plan, each branch is safe and isolated.

🧰 Use `.pipe()` for Debugging or Custom Steps#

You can insert custom logic mid-pipeline with .pipe():

def peek(flow):
    print(flow.head(3))
    return flow

Flow.from_jsonl("match.jsonl").pipe(peek).filter(...)

🔄 Pure Functions = Safer Pipelines#

Since .map() and .assign() modify records, avoid side effects or mutating shared input.

Prefer using .from_records(copy.deepcopy(data)) if you’re passing mutable records from outside.

💡 Performance Tips#

Prefer .from_jsonl() over .from_json() for large files
Minimize .sort_by() or .group_by() until late in pipeline
Use .filter() early to reduce data as soon as possible
Avoid flattening too early. Use .select() to access nested fields instead

🧠 Summary#

Principle	Recommendation
Inspection	Use `.head(n)` to preview
Debugging	Use `.pipe()` for custom hooks
Materialization Awareness	Use `.explain()` to understand plan
Filtering Early	Always filter before heavy ops

Flow gives you a structured, schema-aware, and composable pipeline for working with JSON, especially valuable when you want to defer flattening and stay close to raw data.

Table of Contents