Best Practices, Performance & Troubleshooting#
Flow is designed for clarity, composability, and structured JSON pipelines. But to use it effectively β especially on large or semi-structured data β you need to understand how Flow executes and when data is consumed.
π§ Think in DAGs, Not DataFrames#
Flow builds a deferred plan of steps (like a DAG). Nothing runs until you collect results:
flow = Flow.from_folder("data/").filter(...).assign(...).select(...)
At this point, no data has been read.
π¨ When Execution Happens#
Flow starts processing only when you:
.collect().to_pandas().to_json(),.to_jsonl()Iterate over the Flow
Call
.first(),.keys(),len()
β οΈ When Materialization Happens#
Certain operations require the full dataset and will materialize in memory:
You can always call .explain() to see where materialization occurs:
flow.explain()
π§ͺ Inspect Safely#
You can preview without consuming the full plan:
Flow.from_jsonl("match.jsonl").head(3)
.head(n) adds a .limit() and returns the first n results via .collect(). Itβs a safe way to preview data.
π Fork Pipelines Naturally#
f = Flow.from_jsonl("match.jsonl")
attacks = f.filter(lambda r: r["team"] == "Arsenal")
defence = f.filter(lambda r: r["team"] == "Manchester City")
Because the pipeline is just a plan, each branch is safe and isolated.
π§° Use .pipe() for Debugging or Custom Steps#
You can insert custom logic mid-pipeline with .pipe():
def peek(flow):
print(flow.head(3))
return flow
Flow.from_jsonl("match.jsonl").pipe(peek).filter(...)
π Pure Functions = Safer Pipelines#
Since .map() and .assign() modify records, avoid side effects or mutating shared input.
Prefer using .from_records(copy.deepcopy(data)) if youβre passing mutable records from outside.
π‘ Performance Tips#
Prefer
.from_jsonl()over.from_json()for large filesMinimize
.sort_by()or.group_by()until late in pipelineUse
.filter()early to reduce data as soon as possibleAvoid flattening too early. Use
.select()to access nested fields instead
π§ Summary#
Principle |
Recommendation |
|---|---|
Inspection |
Use |
Debugging |
Use |
Materialization Awareness |
Use |
Filtering Early |
Always filter before heavy ops |
Flow gives you a structured, schema-aware, and composable pipeline for working with JSON, especially valuable when you want to defer flattening and stay close to raw data.