Expected Threat (xT)#
This module implements a position-based Expected Threat (xT) model for football event data. The implementation follows the direct linear algebra formulation:
which is solved as:
This is the direct solver approach (no iterative convergence path).
Key characteristics#
Position-based xT on a normalized
0-100pitch.Grid discretization with a practical default of
16x12.One unified xT surface from all included attacking event families.
Passes, carries, throw-ins, goal kicks, corners, and free kicks are treated as move actions.
Shots and direct free-kick shots are treated as shot actions.
Each move family maintains its own transition matrix, with sparse families shrunk toward the pooled baseline (see per-family transitions below).
Failed actions act as an implicit turnover discount — they consume probability without contributing transitions, so distant cells have lower xT (see turnover discount below).
Successful moves are scored by the delta
xT(end) - xT(start).Goal probability is estimated per cell with light beta-binomial smoothing.
Plotting integrates with
penaltyblog.viz.Pitch.Provider-specific coordinate ranges can be normalized into the internal
0-100xT coordinate system viaXTEventSchema.Out-of-bounds coordinate handling is explicit and configurable via
XTModel(coord_policy=...).
Supported event families#
Move events:
pass— always includedcarry— included by default (include_carries=True)throw_in— included by default (include_throw_ins=True)goal_kick— included by default (include_goal_kicks=True)free_kick— included by default (include_free_kicks=True)corner— included by default (include_corners=True)
Shot events:
shot— always includedfree_kick_shot— included wheninclude_free_kicks=True
Ignored events:
penalty,penalty_kick,own_goal,shot_against,shootout,postmatch_penaltyAny event not classifiable into a move or shot
Turnover discount#
The denominator for action probabilities includes all move attempts (successful and failed), not just successful ones. This means:
The gap \(1 - \text{shot\_prob} - \text{move\_prob}\) is the per-cell probability of losing possession without progressing the ball. This acts as a natural discount: cells far from goal need many successful transitions to reach a shooting position, and each step has a chance of failure that compounds multiplicatively through the linear solve.
Per-family transitions#
Rather than pooling all move events into a single transition matrix, the model builds a separate transition matrix per move family. This means a throw-in from a cell near the touchline has a different destination distribution from an open-play pass originating in the same cell.
To handle sparse families (e.g. free kicks, which may have very few observations from any single cell), each family’s transition row is shrunk toward the pooled transition:
With the default k = 5, a family needs roughly 5+ events from a cell
before its pattern meaningfully diverges from the pooled baseline.
The combined move-transition product used in the solve is:
where \(p_f(cell)\) is the fraction of all actions (shots + all move attempts) from that cell that are successful moves of family f.
Provider-agnostic schema#
The XTEventSchema class defines column/range/label
mapping for a DataFrame (or penaltyblog.matchflow.Flow). A single
is_success column has consistent meaning
across event types: for moves it means the action was completed
successfully; for shots it means a goal was scored.
This success signal is required when fitting or scoring xT.
XTEventSchema and XTModel are strict about success labels
by default.
The recommended input is a boolean is_success column. Numeric 0/1
is also accepted. Provider-specific strings such as "Complete",
"Incomplete", "Goal", or "Saved" must be mapped explicitly with
success_value_map.
Why strict validation is the default#
The canonical schema expects is_success to be boolean, and that remains
the recommended format. Many real event feeds encode success as strings
or provider-specific labels, but guessing those labels is error-prone and
can silently corrupt xT values.
xT therefore rejects non-boolean string labels by default and tells you
to provide an explicit success_value_map. This is safer for analysts,
new Python users, and coding agents because incorrect labels fail fast
instead of producing plausible but wrong results.
For maximum control, pass
XTEventSchema(success_value_map=...) to
XTModel.fit(...) / XTModel.score(...).
If xT encounters unsupported values, it raises a ValueError that shows
the offending labels and points you to success_value_map.
Coordinate ranges and normalization#
Internally, xT uses a normalized 0-100 pitch for both axes.
If your provider uses different ranges (for example x=0..120,
y=0..80), declare them in XTEventSchema:
schema = XTEventSchema(
x="x",
y="y",
event_type="event_type",
end_x="end_x",
end_y="end_y",
is_success="is_success",
x_range=(0, 120),
y_range=(0, 80),
)
Coordinate validation policy#
After normalization, XTModel validates coordinates before clipping.
Use coord_policy to control behavior when values fall outside 0..100:
"warn"(default): emit a warning and clip."error": raiseValueError."clip": silently clip.
This applies in both fit and score.
Usage#
Fit on a raw DataFrame or MatchFlow Flow (quick path)#
You can pass a DataFrame or penaltyblog.matchflow.Flow directly to
fit/score.
If your columns already use canonical names (x, y, event_type,
end_x, end_y, is_success), no extra arguments are needed:
from penaltyblog.xt import XTModel
xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(df)
scored = xt.score(df)
This quick path assumes is_success is already boolean (or numeric 0/1).
If your provider uses strings such as "Complete" or "Goal", pass
XTEventSchema(success_value_map=...) as shown below.
With MatchFlow:
from penaltyblog.matchflow import Flow
from penaltyblog.xt import XTModel
flow = Flow.from_records(records)
xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(flow)
scored = xt.score(flow)
For non-canonical column names, ranges, or label mapping, define an
XTEventSchema:
from penaltyblog.xt import XTEventSchema
schema = XTEventSchema(
x="location_x",
y="location_y",
event_type="type_primary",
end_x="end_location_x",
end_y="end_location_y",
is_success="is_successful",
x_range=(0, 120),
y_range=(0, 80),
event_type_map={"Pass": "pass", "Shot": "shot"},
success_value_map={"Complete": True, "Incomplete": False, "Goal": True},
)
xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(df, schema=schema)
scored = xt.score(df) # reuses fitted schema by default
For many users this is the best default workflow:
Start with a raw provider DataFrame.
Pass explicit column mappings.
Pass
event_type_mapfor event labels.Pass
success_value_mapfor success/outcome labels.
That keeps the transformation visible and reproducible.
If you call score with default options, it reuses the schema
from fit so you do not need to repeat column mappings.
Load a pretrained surface#
from penaltyblog.xt import load_pretrained_xt
model = load_pretrained_xt(name="default")
The bundled "default" artifact is fit on ~14 million events
across multiple seasons, including passes, throw-ins, free kicks, corners,
and shots taken from the big five European leagues.
Scoring#
Scoring adds three columns to the input DataFrame:
xt_start— xT value at the start location (set for all moves with valid start coordinates, including failed moves)xt_end— xT value at the end location (set only for successful moves)xt_added—xt_end - xt_start(set only for successful moves)
Note
xt_start is populated for all moves with valid start coordinates,
regardless of whether the move succeeded. This lets you measure the
possession value that was risked on a failed pass. xt_end and
xt_added are only set for successful moves that also have valid
end coordinates.
By default, score() reuses the schema saved during fit:
scored = model.score(data)
scored[["xt_start", "xt_end", "xt_added"]]
To score with a different schema (for example, data from a different
provider), pass a new XTEventSchema and set use_fit_schema=False:
scored = model.score(other_df, schema=other_schema, use_fit_schema=False)
Querying xT values#
After fitting (or loading a pretrained model), you can query the xT value
at any normalized 0-100 coordinate with XTModel.value_at():
model.value_at(85, 50) # xT near the penalty spot
model.value_at(10, 50) # xT in the defensive half
For bulk lookups, use the vectorised XTModel.values_at():
import numpy as np
xs = np.linspace(0, 100, 16)
ys = np.linspace(0, 100, 12)
xx, yy = np.meshgrid(xs, ys)
vals = model.values_at(xx.ravel(), yy.ravel())
heatmap = vals.reshape(12, 16)
Both methods clip coordinates to the valid 0..100 range according to
the model’s coord_policy.
Saving and loading models#
Fitted models can be persisted to .npz files and reloaded later:
model.save("my_xt_model.npz")
loaded = XTModel.load("my_xt_model.npz")
Saved files are portable and contain all arrays needed to score new events or query values. You do not need access to the original training data.
Plotting#
Plotting uses penaltyblog.viz.Pitch under the hood:
model.plot()
You can pass an existing Pitch instance or
override the default colours and opacity:
model.plot(colorscale="Viridis", opacity=0.7, show_colorbar=True)
API Reference#
XTModel#
XTEventSchema#
XTData#
load_pretrained_xt#
Fitted attributes#
After calling XTModel.fit(), the following attributes are available:
surface_—numpy.ndarrayof shape(n_rows, n_cols)containing the fitted xT values.shot_probability_— per-cell probability that an action is a shot.goal_probability_— per-cell probability that a shot results in a goal (beta-binomial smoothed).move_probability_— per-cell probability that an action is a successful move.transition_matrix_— effective combined transition matrix of shape(n_cells, n_cells).metadata_— dict with model hyperparameters, included families, and fit schema.included_move_families_— list of move event types that were present in the training data.included_shot_families_— list of shot event types that were present in the training data.fitted_—Trueonce the model has been fit or loaded.
Troubleshooting#
Missing is_success#
Error: Missing success information. Provide an is_success column ...
The is_success column is mandatory for both fit and score.
For moves it records whether the action was completed; for shots it records
whether a goal was scored. If your provider uses string labels, map them
with XTEventSchema(success_value_map={...}).
Missing move end coordinates#
Error: xT fit requires move end coordinates when move events are present.
When the data contains move events (passes, carries, etc.), fit needs
end_x and end_y columns so it can learn transition patterns. Use
XTEventSchema(end_x=..., end_y=...) to map your provider’s destination
coordinate columns.
No shot events in training data#
Warning: No shot events found in the training data. The fitted xT surface will be all zeros.
A model trained without shots cannot learn goal probability and will return
zero everywhere. Check that your event_type_map correctly maps provider
shot labels to "shot" (and "free_kick_shot" if applicable).
Invalid success labels#
Error: Invalid values found in is_success: ...
xT rejects non-boolean, non-numeric labels by default. If your feed uses
strings such as "Complete" or "Goal", provide an explicit mapping:
schema = XTEventSchema(success_value_map={"Complete": True, "Incomplete": False, "Goal": True})
model.fit(df, schema=schema)
Ill-conditioned matrix#
Warning: xT transition matrix is ill-conditioned ...
This can happen with very sparse data or a very fine grid relative to the dataset size. The surface may contain extreme or unstable values. Remedies:
Increase the amount of training data.
Use a coarser grid (e.g.
n_cols=12, n_rows=8instead of16x12).Check that your data contains a representative mix of shots and moves.
Out-of-bounds coordinates#
Warning/Error: xT fit/score received coordinates outside expected 0..100
Your provider probably uses a different coordinate scale (e.g. StatsBomb
0-120 x 0-80). Declare the correct ranges in XTEventSchema:
schema = XTEventSchema(x_range=(0, 120), y_range=(0, 80))
Notes#
Each move family has its own transition matrix, shrunk toward the pooled baseline for sparse cells.
Failed actions act as a turnover discount — they reduce
move_probwithout contributing transitions.Goal probability is estimated per cell with light smoothing.
Each event type maps unambiguously to a role — e.g.
corneris always a move,free_kick_shotis always a shot.