Expected Threat (xT)#

This module implements a position-based Expected Threat (xT) model for football event data. The implementation follows the direct linear algebra formulation:

\[X = S + MTX\]

which is solved as:

\[(I - MT) X = S\]

This is the direct solver approach (no iterative convergence path).

Key characteristics#

Position-based xT on a normalized 0-100 pitch.
Grid discretization with a practical default of 16x12.
One unified xT surface from all included attacking event families.
Passes, carries, throw-ins, goal kicks, corners, and free kicks are treated as move actions.
Shots and direct free-kick shots are treated as shot actions.
Each move family maintains its own transition matrix, with sparse families shrunk toward the pooled baseline (see per-family transitions below).
Failed actions act as an implicit turnover discount — they consume probability without contributing transitions, so distant cells have lower xT (see turnover discount below).
Successful moves are scored by the delta xT(end) - xT(start).
Goal probability is estimated per cell with light beta-binomial smoothing.
Plotting integrates with penaltyblog.viz.Pitch.
Provider-specific coordinate ranges can be normalized into the internal 0-100 xT coordinate system via XTEventSchema.
Out-of-bounds coordinate handling is explicit and configurable via XTModel(coord_policy=...).

Supported event families#

Move events:

pass — always included
carry — included by default (include_carries=True)
throw_in — included by default (include_throw_ins=True)
goal_kick — included by default (include_goal_kicks=True)
free_kick — included by default (include_free_kicks=True)
corner — included by default (include_corners=True)

Shot events:

shot — always included
free_kick_shot — included when include_free_kicks=True

Ignored events:

penalty, penalty_kick, own_goal, shot_against, shootout, postmatch_penalty
Any event not classifiable into a move or shot

Turnover discount#

The denominator for action probabilities includes all move attempts (successful and failed), not just successful ones. This means:

\[\text{move\_prob}(i) = \frac{\text{successful\_moves}(i)}{\text{shots}(i) + \text{all\_moves}(i)}\]

The gap \(1 - \text{shot\_prob} - \text{move\_prob}\) is the per-cell probability of losing possession without progressing the ball. This acts as a natural discount: cells far from goal need many successful transitions to reach a shooting position, and each step has a chance of failure that compounds multiplicatively through the linear solve.

Per-family transitions#

Rather than pooling all move events into a single transition matrix, the model builds a separate transition matrix per move family. This means a throw-in from a cell near the touchline has a different destination distribution from an open-play pass originating in the same cell.

To handle sparse families (e.g. free kicks, which may have very few observations from any single cell), each family’s transition row is shrunk toward the pooled transition:

\[T_f^{smooth}(i) = \frac{counts_f(i) + k \cdot T_{pooled}(i)}{n_f(i) + k}\]

With the default k = 5, a family needs roughly 5+ events from a cell before its pattern meaningfully diverges from the pooled baseline.

The combined move-transition product used in the solve is:

\[MT = \sum_f \operatorname{diag}(p_f) \cdot T_f^{smooth}\]

where \(p_f(cell)\) is the fraction of all actions (shots + all move attempts) from that cell that are successful moves of family f.

Provider-agnostic schema#

The XTEventSchema class defines column/range/label mapping for a DataFrame (or penaltyblog.matchflow.Flow). A single is_success column has consistent meaning across event types: for moves it means the action was completed successfully; for shots it means a goal was scored.

This success signal is required when fitting or scoring xT.

XTEventSchema and XTModel are strict about success labels by default. The recommended input is a boolean is_success column. Numeric 0/1 is also accepted. Provider-specific strings such as "Complete", "Incomplete", "Goal", or "Saved" must be mapped explicitly with success_value_map.

Why strict validation is the default#

The canonical schema expects is_success to be boolean, and that remains the recommended format. Many real event feeds encode success as strings or provider-specific labels, but guessing those labels is error-prone and can silently corrupt xT values.

xT therefore rejects non-boolean string labels by default and tells you to provide an explicit success_value_map. This is safer for analysts, new Python users, and coding agents because incorrect labels fail fast instead of producing plausible but wrong results.

For maximum control, pass XTEventSchema(success_value_map=...) to XTModel.fit(...) / XTModel.score(...).

If xT encounters unsupported values, it raises a ValueError that shows the offending labels and points you to success_value_map.

Coordinate ranges and normalization#

Internally, xT uses a normalized 0-100 pitch for both axes. If your provider uses different ranges (for example x=0..120, y=0..80), declare them in XTEventSchema:

schema = XTEventSchema(
    x="x",
    y="y",
    event_type="event_type",
    end_x="end_x",
    end_y="end_y",
    is_success="is_success",
    x_range=(0, 120),
    y_range=(0, 80),
)

Coordinate validation policy#

After normalization, XTModel validates coordinates before clipping. Use coord_policy to control behavior when values fall outside 0..100:

"warn" (default): emit a warning and clip.
"error": raise ValueError.
"clip": silently clip.

This applies in both fit and score.

Usage#

Fit on a raw DataFrame or MatchFlow Flow (quick path)#

You can pass a DataFrame or penaltyblog.matchflow.Flow directly to fit/score. If your columns already use canonical names (x, y, event_type, end_x, end_y, is_success), no extra arguments are needed:

from penaltyblog.xt import XTModel

xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(df)
scored = xt.score(df)

This quick path assumes is_success is already boolean (or numeric 0/1). If your provider uses strings such as "Complete" or "Goal", pass XTEventSchema(success_value_map=...) as shown below.

With MatchFlow:

from penaltyblog.matchflow import Flow
from penaltyblog.xt import XTModel

flow = Flow.from_records(records)
xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(flow)
scored = xt.score(flow)

For non-canonical column names, ranges, or label mapping, define an XTEventSchema:

from penaltyblog.xt import XTEventSchema

schema = XTEventSchema(
    x="location_x",
    y="location_y",
    event_type="type_primary",
    end_x="end_location_x",
    end_y="end_location_y",
    is_success="is_successful",
    x_range=(0, 120),
    y_range=(0, 80),
    event_type_map={"Pass": "pass", "Shot": "shot"},
    success_value_map={"Complete": True, "Incomplete": False, "Goal": True},
)
xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
xt.fit(df, schema=schema)
scored = xt.score(df)  # reuses fitted schema by default

For many users this is the best default workflow:

Start with a raw provider DataFrame.
Pass explicit column mappings.
Pass event_type_map for event labels.
Pass success_value_map for success/outcome labels.

That keeps the transformation visible and reproducible.

If you call score with default options, it reuses the schema from fit so you do not need to repeat column mappings.

Load a pretrained surface#

from penaltyblog.xt import load_pretrained_xt

model = load_pretrained_xt(name="default")

The bundled "default" artifact is fit on ~14 million events across multiple seasons, including passes, throw-ins, free kicks, corners, and shots taken from the big five European leagues.

Scoring#

Scoring adds three columns to the input DataFrame:

xt_start — xT value at the start location (set for all moves with valid start coordinates, including failed moves)
xt_end — xT value at the end location (set only for successful moves)
xt_added — xt_end - xt_start (set only for successful moves)

Note

xt_start is populated for all moves with valid start coordinates, regardless of whether the move succeeded. This lets you measure the possession value that was risked on a failed pass. xt_end and xt_added are only set for successful moves that also have valid end coordinates.

By default, score() reuses the schema saved during fit:

scored = model.score(data)
scored[["xt_start", "xt_end", "xt_added"]]

To score with a different schema (for example, data from a different provider), pass a new XTEventSchema and set use_fit_schema=False:

scored = model.score(other_df, schema=other_schema, use_fit_schema=False)

Querying xT values#

After fitting (or loading a pretrained model), you can query the xT value at any normalized 0-100 coordinate with XTModel.value_at():

model.value_at(85, 50)   # xT near the penalty spot
model.value_at(10, 50)   # xT in the defensive half

For bulk lookups, use the vectorised XTModel.values_at():

import numpy as np

xs = np.linspace(0, 100, 16)
ys = np.linspace(0, 100, 12)
xx, yy = np.meshgrid(xs, ys)
vals = model.values_at(xx.ravel(), yy.ravel())
heatmap = vals.reshape(12, 16)

Both methods clip coordinates to the valid 0..100 range according to the model’s coord_policy.

Saving and loading models#

Fitted models can be persisted to .npz files and reloaded later:

model.save("my_xt_model.npz")
loaded = XTModel.load("my_xt_model.npz")

Saved files are portable and contain all arrays needed to score new events or query values. You do not need access to the original training data.

Plotting#

Plotting uses penaltyblog.viz.Pitch under the hood:

model.plot()

You can pass an existing Pitch instance or override the default colours and opacity:

model.plot(colorscale="Viridis", opacity=0.7, show_colorbar=True)

API Reference#

XTModel#

XTEventSchema#

XTData#

load_pretrained_xt#

Fitted attributes#

After calling XTModel.fit(), the following attributes are available:

surface_ — numpy.ndarray of shape (n_rows, n_cols) containing the fitted xT values.
shot_probability_ — per-cell probability that an action is a shot.
goal_probability_ — per-cell probability that a shot results in a goal (beta-binomial smoothed).
move_probability_ — per-cell probability that an action is a successful move.
transition_matrix_ — effective combined transition matrix of shape (n_cells, n_cells).
metadata_ — dict with model hyperparameters, included families, and fit schema.
included_move_families_ — list of move event types that were present in the training data.
included_shot_families_ — list of shot event types that were present in the training data.
fitted_ — True once the model has been fit or loaded.

Troubleshooting#

Missing `is_success`#

Error: Missing success information. Provide an is_success column ...

The is_success column is mandatory for both fit and score. For moves it records whether the action was completed; for shots it records whether a goal was scored. If your provider uses string labels, map them with XTEventSchema(success_value_map={...}).

Missing move end coordinates#

Error: xT fit requires move end coordinates when move events are present.

When the data contains move events (passes, carries, etc.), fit needs end_x and end_y columns so it can learn transition patterns. Use XTEventSchema(end_x=..., end_y=...) to map your provider’s destination coordinate columns.

No shot events in training data#

Warning: No shot events found in the training data. The fitted xT surface will be all zeros.

A model trained without shots cannot learn goal probability and will return zero everywhere. Check that your event_type_map correctly maps provider shot labels to "shot" (and "free_kick_shot" if applicable).

Invalid success labels#

Error: Invalid values found in is_success: ...

xT rejects non-boolean, non-numeric labels by default. If your feed uses strings such as "Complete" or "Goal", provide an explicit mapping:

schema = XTEventSchema(success_value_map={"Complete": True, "Incomplete": False, "Goal": True})
model.fit(df, schema=schema)

Ill-conditioned matrix#

Warning: xT transition matrix is ill-conditioned ...

This can happen with very sparse data or a very fine grid relative to the dataset size. The surface may contain extreme or unstable values. Remedies:

Increase the amount of training data.
Use a coarser grid (e.g. n_cols=12, n_rows=8 instead of 16x12).
Check that your data contains a representative mix of shots and moves.

Out-of-bounds coordinates#

Warning/Error: xT fit/score received coordinates outside expected 0..100

Your provider probably uses a different coordinate scale (e.g. StatsBomb 0-120 x 0-80). Declare the correct ranges in XTEventSchema:

schema = XTEventSchema(x_range=(0, 120), y_range=(0, 80))

Notes#

Each move family has its own transition matrix, shrunk toward the pooled baseline for sparse cells.
Failed actions act as a turnover discount — they reduce move_prob without contributing transitions.
Goal probability is estimated per cell with light smoothing.
Each event type maps unambiguously to a role — e.g. corner is always a move, free_kick_shot is always a shot.