Expected Threat (xT) ==================== This module implements a **position-based Expected Threat (xT)** model for football event data. The implementation follows the direct linear algebra formulation: .. math:: X = S + MTX which is solved as: .. math:: (I - MT) X = S This is the **direct solver** approach (no iterative convergence path). Key characteristics ------------------- - Position-based xT on a normalized ``0-100`` pitch. - Grid discretization with a practical default of ``16x12``. - **One unified xT surface** from all included attacking event families. - Passes, carries, throw-ins, goal kicks, corners, and free kicks are treated as **move** actions. - Shots and direct free-kick shots are treated as **shot** actions. - Each move family maintains its own transition matrix, with sparse families shrunk toward the pooled baseline (see :ref:`per-family transitions ` below). - Failed actions act as an implicit **turnover discount** — they consume probability without contributing transitions, so distant cells have lower xT (see :ref:`turnover discount ` below). - Successful moves are scored by the delta ``xT(end) - xT(start)``. - Goal probability is estimated **per cell** with light beta-binomial smoothing. - Plotting integrates with :class:`penaltyblog.viz.Pitch`. - Provider-specific coordinate ranges can be normalized into the internal ``0-100`` xT coordinate system via :class:`~penaltyblog.xt.XTEventSchema`. - Out-of-bounds coordinate handling is explicit and configurable via ``XTModel(coord_policy=...)``. Supported event families ------------------------ Move events: - ``pass`` — always included - ``carry`` — included by default (``include_carries=True``) - ``throw_in`` — included by default (``include_throw_ins=True``) - ``goal_kick`` — included by default (``include_goal_kicks=True``) - ``free_kick`` — included by default (``include_free_kicks=True``) - ``corner`` — included by default (``include_corners=True``) Shot events: - ``shot`` — always included - ``free_kick_shot`` — included when ``include_free_kicks=True`` Ignored events: - ``penalty``, ``penalty_kick``, ``own_goal``, ``shot_against``, ``shootout``, ``postmatch_penalty`` - Any event not classifiable into a move or shot .. _turnover-discount: Turnover discount ----------------- The denominator for action probabilities includes **all move attempts** (successful and failed), not just successful ones. This means: .. math:: \text{move\_prob}(i) = \frac{\text{successful\_moves}(i)}{\text{shots}(i) + \text{all\_moves}(i)} The gap :math:`1 - \text{shot\_prob} - \text{move\_prob}` is the per-cell probability of losing possession without progressing the ball. This acts as a natural discount: cells far from goal need many successful transitions to reach a shooting position, and each step has a chance of failure that compounds multiplicatively through the linear solve. .. _per-family-transitions: Per-family transitions ---------------------- Rather than pooling all move events into a single transition matrix, the model builds a **separate transition matrix per move family**. This means a throw-in from a cell near the touchline has a different destination distribution from an open-play pass originating in the same cell. To handle sparse families (e.g. free kicks, which may have very few observations from any single cell), each family's transition row is **shrunk toward the pooled transition**: .. math:: T_f^{smooth}(i) = \frac{counts_f(i) + k \cdot T_{pooled}(i)}{n_f(i) + k} With the default ``k = 5``, a family needs roughly 5+ events from a cell before its pattern meaningfully diverges from the pooled baseline. The combined move-transition product used in the solve is: .. math:: MT = \sum_f \operatorname{diag}(p_f) \cdot T_f^{smooth} where :math:`p_f(cell)` is the fraction of all actions (shots + all move attempts) from that cell that are successful moves of family *f*. Provider-agnostic schema ------------------------ The :class:`~penaltyblog.xt.XTEventSchema` class defines column/range/label mapping for a DataFrame (or ``penaltyblog.matchflow.Flow``). A single ``is_success`` column has consistent meaning across event types: for moves it means the action was completed successfully; for shots it means a goal was scored. This success signal is required when fitting or scoring xT. ``XTEventSchema`` and ``XTModel`` are strict about success labels by default. The recommended input is a boolean ``is_success`` column. Numeric ``0``/``1`` is also accepted. Provider-specific strings such as ``"Complete"``, ``"Incomplete"``, ``"Goal"``, or ``"Saved"`` must be mapped explicitly with ``success_value_map``. Why strict validation is the default ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The canonical schema expects ``is_success`` to be boolean, and that remains the recommended format. Many real event feeds encode success as strings or provider-specific labels, but guessing those labels is error-prone and can silently corrupt xT values. xT therefore rejects non-boolean string labels by default and tells you to provide an explicit ``success_value_map``. This is safer for analysts, new Python users, and coding agents because incorrect labels fail fast instead of producing plausible but wrong results. For maximum control, pass ``XTEventSchema(success_value_map=...)`` to ``XTModel.fit(...)`` / ``XTModel.score(...)``. If xT encounters unsupported values, it raises a ``ValueError`` that shows the offending labels and points you to ``success_value_map``. Coordinate ranges and normalization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Internally, xT uses a normalized ``0-100`` pitch for both axes. If your provider uses different ranges (for example ``x=0..120``, ``y=0..80``), declare them in ``XTEventSchema``: .. code-block:: python schema = XTEventSchema( x="x", y="y", event_type="event_type", end_x="end_x", end_y="end_y", is_success="is_success", x_range=(0, 120), y_range=(0, 80), ) Coordinate validation policy ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After normalization, ``XTModel`` validates coordinates before clipping. Use ``coord_policy`` to control behavior when values fall outside ``0..100``: - ``"warn"`` (default): emit a warning and clip. - ``"error"``: raise ``ValueError``. - ``"clip"``: silently clip. This applies in both ``fit`` and ``score``. Usage ----- Fit on a raw DataFrame or MatchFlow Flow (quick path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can pass a DataFrame or ``penaltyblog.matchflow.Flow`` directly to ``fit``/``score``. If your columns already use canonical names (``x``, ``y``, ``event_type``, ``end_x``, ``end_y``, ``is_success``), no extra arguments are needed: .. code-block:: python from penaltyblog.xt import XTModel xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn") xt.fit(df) scored = xt.score(df) This quick path assumes ``is_success`` is already boolean (or numeric ``0``/``1``). If your provider uses strings such as ``"Complete"`` or ``"Goal"``, pass ``XTEventSchema(success_value_map=...)`` as shown below. With MatchFlow: .. code-block:: python from penaltyblog.matchflow import Flow from penaltyblog.xt import XTModel flow = Flow.from_records(records) xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn") xt.fit(flow) scored = xt.score(flow) For non-canonical column names, ranges, or label mapping, define an ``XTEventSchema``: .. code-block:: python from penaltyblog.xt import XTEventSchema schema = XTEventSchema( x="location_x", y="location_y", event_type="type_primary", end_x="end_location_x", end_y="end_location_y", is_success="is_successful", x_range=(0, 120), y_range=(0, 80), event_type_map={"Pass": "pass", "Shot": "shot"}, success_value_map={"Complete": True, "Incomplete": False, "Goal": True}, ) xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn") xt.fit(df, schema=schema) scored = xt.score(df) # reuses fitted schema by default For many users this is the best default workflow: 1. Start with a raw provider DataFrame. 2. Pass explicit column mappings. 3. Pass ``event_type_map`` for event labels. 4. Pass ``success_value_map`` for success/outcome labels. That keeps the transformation visible and reproducible. If you call ``score`` with default options, it reuses the schema from ``fit`` so you do not need to repeat column mappings. Load a pretrained surface ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from penaltyblog.xt import load_pretrained_xt model = load_pretrained_xt(name="default") The bundled ``"default"`` artifact is fit on ~14 million events across multiple seasons, including passes, throw-ins, free kicks, corners, and shots taken from the big five European leagues. Scoring ^^^^^^^ Scoring adds three columns to the input DataFrame: - ``xt_start`` — xT value at the start location (set for all moves with valid start coordinates, **including failed moves**) - ``xt_end`` — xT value at the end location (set only for successful moves) - ``xt_added`` — ``xt_end - xt_start`` (set only for successful moves) .. note:: ``xt_start`` is populated for **all** moves with valid start coordinates, regardless of whether the move succeeded. This lets you measure the possession value that was risked on a failed pass. ``xt_end`` and ``xt_added`` are only set for *successful* moves that also have valid end coordinates. By default, ``score()`` reuses the schema saved during ``fit``: .. code-block:: python scored = model.score(data) scored[["xt_start", "xt_end", "xt_added"]] To score with a different schema (for example, data from a different provider), pass a new ``XTEventSchema`` and set ``use_fit_schema=False``: .. code-block:: python scored = model.score(other_df, schema=other_schema, use_fit_schema=False) Querying xT values ^^^^^^^^^^^^^^^^^^ After fitting (or loading a pretrained model), you can query the xT value at any normalized ``0-100`` coordinate with :meth:`XTModel.value_at`: .. code-block:: python model.value_at(85, 50) # xT near the penalty spot model.value_at(10, 50) # xT in the defensive half For bulk lookups, use the vectorised :meth:`XTModel.values_at`: .. code-block:: python import numpy as np xs = np.linspace(0, 100, 16) ys = np.linspace(0, 100, 12) xx, yy = np.meshgrid(xs, ys) vals = model.values_at(xx.ravel(), yy.ravel()) heatmap = vals.reshape(12, 16) Both methods clip coordinates to the valid ``0..100`` range according to the model's ``coord_policy``. Saving and loading models ^^^^^^^^^^^^^^^^^^^^^^^^^ Fitted models can be persisted to ``.npz`` files and reloaded later: .. code-block:: python model.save("my_xt_model.npz") loaded = XTModel.load("my_xt_model.npz") Saved files are portable and contain all arrays needed to score new events or query values. You do not need access to the original training data. Plotting ^^^^^^^^ Plotting uses :class:`penaltyblog.viz.Pitch` under the hood: .. code-block:: python model.plot() You can pass an existing :class:`~penaltyblog.viz.Pitch` instance or override the default colours and opacity: .. code-block:: python model.plot(colorscale="Viridis", opacity=0.7, show_colorbar=True) API Reference ------------- XTModel ^^^^^^^ .. autoclass:: penaltyblog.xt.XTModel :members: :undoc-members: XTEventSchema ^^^^^^^^^^^^^ .. autoclass:: penaltyblog.xt.XTEventSchema :members: :undoc-members: XTData ^^^^^^ .. autoclass:: penaltyblog.xt.XTData :members: :undoc-members: load_pretrained_xt ^^^^^^^^^^^^^^^^^^ .. autofunction:: penaltyblog.xt.load_pretrained_xt Fitted attributes ^^^^^^^^^^^^^^^^^ After calling :meth:`XTModel.fit`, the following attributes are available: - ``surface_`` — ``numpy.ndarray`` of shape ``(n_rows, n_cols)`` containing the fitted xT values. - ``shot_probability_`` — per-cell probability that an action is a shot. - ``goal_probability_`` — per-cell probability that a shot results in a goal (beta-binomial smoothed). - ``move_probability_`` — per-cell probability that an action is a successful move. - ``transition_matrix_`` — effective combined transition matrix of shape ``(n_cells, n_cells)``. - ``metadata_`` — dict with model hyperparameters, included families, and fit schema. - ``included_move_families_`` — list of move event types that were present in the training data. - ``included_shot_families_`` — list of shot event types that were present in the training data. - ``fitted_`` — ``True`` once the model has been fit or loaded. Troubleshooting --------------- Missing ``is_success`` ^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Missing success information. Provide an is_success column ...`` The ``is_success`` column is mandatory for both ``fit`` and ``score``. For moves it records whether the action was completed; for shots it records whether a goal was scored. If your provider uses string labels, map them with ``XTEventSchema(success_value_map={...})``. Missing move end coordinates ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``xT fit requires move end coordinates when move events are present.`` When the data contains move events (passes, carries, etc.), ``fit`` needs ``end_x`` and ``end_y`` columns so it can learn transition patterns. Use ``XTEventSchema(end_x=..., end_y=...)`` to map your provider's destination coordinate columns. No shot events in training data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Warning:** ``No shot events found in the training data. The fitted xT surface will be all zeros.`` A model trained without shots cannot learn goal probability and will return zero everywhere. Check that your ``event_type_map`` correctly maps provider shot labels to ``"shot"`` (and ``"free_kick_shot"`` if applicable). Invalid success labels ^^^^^^^^^^^^^^^^^^^^^^ **Error:** ``Invalid values found in is_success: ...`` xT rejects non-boolean, non-numeric labels by default. If your feed uses strings such as ``"Complete"`` or ``"Goal"``, provide an explicit mapping: .. code-block:: python schema = XTEventSchema(success_value_map={"Complete": True, "Incomplete": False, "Goal": True}) model.fit(df, schema=schema) Ill-conditioned matrix ^^^^^^^^^^^^^^^^^^^^^^ **Warning:** ``xT transition matrix is ill-conditioned ...`` This can happen with very sparse data or a very fine grid relative to the dataset size. The surface may contain extreme or unstable values. Remedies: - Increase the amount of training data. - Use a coarser grid (e.g. ``n_cols=12, n_rows=8`` instead of ``16x12``). - Check that your data contains a representative mix of shots and moves. Out-of-bounds coordinates ^^^^^^^^^^^^^^^^^^^^^^^^^ **Warning/Error:** ``xT fit/score received coordinates outside expected 0..100`` Your provider probably uses a different coordinate scale (e.g. StatsBomb ``0-120`` x ``0-80``). Declare the correct ranges in ``XTEventSchema``: .. code-block:: python schema = XTEventSchema(x_range=(0, 120), y_range=(0, 80)) Notes ----- - Each move family has its own transition matrix, shrunk toward the pooled baseline for sparse cells. - Failed actions act as a turnover discount — they reduce ``move_prob`` without contributing transitions. - Goal probability is estimated per cell with light smoothing. - Each event type maps unambiguously to a role — e.g. ``corner`` is always a move, ``free_kick_shot`` is always a shot.