Expected Threat (xT)
====================

This module implements a **position-based Expected Threat (xT)** model for
football event data. The implementation follows the direct linear algebra
formulation:

.. math::

   X = S + MTX

which is solved as:

.. math::

   (I - MT) X = S

This is the **direct solver** approach (no iterative convergence path).

Key characteristics
-------------------

- Position-based xT on a normalized ``0-100`` pitch.
- Grid discretization with a practical default of ``16x12``.
- **One unified xT surface** from all included attacking event families.
- Passes, carries, throw-ins, goal kicks, corners, and free kicks
  are treated as **move** actions.
- Shots and direct free-kick shots are treated as **shot** actions.
- Each move family maintains its own transition matrix, with sparse
  families shrunk toward the pooled baseline (see
  :ref:`per-family transitions <per-family-transitions>` below).
- Failed actions act as an implicit **turnover discount** — they consume
  probability without contributing transitions, so distant cells have
  lower xT (see :ref:`turnover discount <turnover-discount>` below).
- Successful moves are scored by the delta ``xT(end) - xT(start)``.
- Goal probability is estimated **per cell** with light beta-binomial smoothing.
- Plotting integrates with :class:`penaltyblog.viz.Pitch`.
- Provider-specific coordinate ranges can be normalized into the internal
  ``0-100`` xT coordinate system via :class:`~penaltyblog.xt.XTEventSchema`.
- Out-of-bounds coordinate handling is explicit and configurable via
  ``XTModel(coord_policy=...)``.

Supported event families
------------------------

Move events:

- ``pass`` — always included
- ``carry`` — included by default (``include_carries=True``)
- ``throw_in`` — included by default (``include_throw_ins=True``)
- ``goal_kick`` — included by default (``include_goal_kicks=True``)
- ``free_kick`` — included by default (``include_free_kicks=True``)
- ``corner`` — included by default (``include_corners=True``)

Shot events:

- ``shot`` — always included
- ``free_kick_shot`` — included when ``include_free_kicks=True``

Ignored events:

- ``penalty``, ``penalty_kick``, ``own_goal``, ``shot_against``,
  ``shootout``, ``postmatch_penalty``
- Any event not classifiable into a move or shot

.. _turnover-discount:

Turnover discount
-----------------

The denominator for action probabilities includes **all move attempts**
(successful and failed), not just successful ones. This means:

.. math::

   \text{move\_prob}(i) = \frac{\text{successful\_moves}(i)}{\text{shots}(i) + \text{all\_moves}(i)}

The gap :math:`1 - \text{shot\_prob} - \text{move\_prob}` is the per-cell
probability of losing possession without progressing the ball. This acts
as a natural discount: cells far from goal need many successful transitions
to reach a shooting position, and each step has a chance of failure that
compounds multiplicatively through the linear solve.

.. _per-family-transitions:

Per-family transitions
----------------------

Rather than pooling all move events into a single transition matrix, the
model builds a **separate transition matrix per move family**. This means
a throw-in from a cell near the touchline has a different destination
distribution from an open-play pass originating in the same cell.

To handle sparse families (e.g. free kicks, which may have very few
observations from any single cell), each family's transition row is
**shrunk toward the pooled transition**:

.. math::

   T_f^{smooth}(i) = \frac{counts_f(i) + k \cdot T_{pooled}(i)}{n_f(i) + k}

With the default ``k = 5``, a family needs roughly 5+ events from a cell
before its pattern meaningfully diverges from the pooled baseline.

The combined move-transition product used in the solve is:

.. math::

   MT = \sum_f \operatorname{diag}(p_f) \cdot T_f^{smooth}

where :math:`p_f(cell)` is the fraction of all actions (shots + all move
attempts) from that cell that are successful moves of family *f*.

Provider-agnostic schema
------------------------

The :class:`~penaltyblog.xt.XTEventSchema` class defines column/range/label
mapping for a DataFrame (or ``penaltyblog.matchflow.Flow``). A single
``is_success`` column has consistent meaning
across event types: for moves it means the action was completed
successfully; for shots it means a goal was scored.

This success signal is required when fitting or scoring xT.

``XTEventSchema`` and ``XTModel`` are strict about success labels
by default.
The recommended input is a boolean ``is_success`` column. Numeric ``0``/``1``
is also accepted. Provider-specific strings such as ``"Complete"``,
``"Incomplete"``, ``"Goal"``, or ``"Saved"`` must be mapped explicitly with
``success_value_map``.

Why strict validation is the default
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The canonical schema expects ``is_success`` to be boolean, and that remains
the recommended format. Many real event feeds encode success as strings
or provider-specific labels, but guessing those labels is error-prone and
can silently corrupt xT values.

xT therefore rejects non-boolean string labels by default and tells you
to provide an explicit ``success_value_map``. This is safer for analysts,
new Python users, and coding agents because incorrect labels fail fast
instead of producing plausible but wrong results.

For maximum control, pass
``XTEventSchema(success_value_map=...)`` to
``XTModel.fit(...)`` / ``XTModel.score(...)``.

If xT encounters unsupported values, it raises a ``ValueError`` that shows
the offending labels and points you to ``success_value_map``.

Coordinate ranges and normalization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Internally, xT uses a normalized ``0-100`` pitch for both axes.
If your provider uses different ranges (for example ``x=0..120``,
``y=0..80``), declare them in ``XTEventSchema``:

.. code-block:: python

   schema = XTEventSchema(
       x="x",
       y="y",
       event_type="event_type",
       end_x="end_x",
       end_y="end_y",
       is_success="is_success",
       x_range=(0, 120),
       y_range=(0, 80),
   )

Coordinate validation policy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After normalization, ``XTModel`` validates coordinates before clipping.
Use ``coord_policy`` to control behavior when values fall outside ``0..100``:

- ``"warn"`` (default): emit a warning and clip.
- ``"error"``: raise ``ValueError``.
- ``"clip"``: silently clip.

This applies in both ``fit`` and ``score``.

Usage
-----

Fit on a raw DataFrame or MatchFlow Flow (quick path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can pass a DataFrame or ``penaltyblog.matchflow.Flow`` directly to
``fit``/``score``.
If your columns already use canonical names (``x``, ``y``, ``event_type``,
``end_x``, ``end_y``, ``is_success``), no extra arguments are needed:

.. code-block:: python

   from penaltyblog.xt import XTModel

   xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
   xt.fit(df)
   scored = xt.score(df)

This quick path assumes ``is_success`` is already boolean (or numeric ``0``/``1``).
If your provider uses strings such as ``"Complete"`` or ``"Goal"``, pass
``XTEventSchema(success_value_map=...)`` as shown below.

With MatchFlow:

.. code-block:: python

   from penaltyblog.matchflow import Flow
   from penaltyblog.xt import XTModel

   flow = Flow.from_records(records)
   xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
   xt.fit(flow)
   scored = xt.score(flow)

For non-canonical column names, ranges, or label mapping, define an
``XTEventSchema``:

.. code-block:: python

   from penaltyblog.xt import XTEventSchema

   schema = XTEventSchema(
       x="location_x",
       y="location_y",
       event_type="type_primary",
       end_x="end_location_x",
       end_y="end_location_y",
       is_success="is_successful",
       x_range=(0, 120),
       y_range=(0, 80),
       event_type_map={"Pass": "pass", "Shot": "shot"},
       success_value_map={"Complete": True, "Incomplete": False, "Goal": True},
   )
   xt = XTModel(n_cols=16, n_rows=12, coord_policy="warn")
   xt.fit(df, schema=schema)
   scored = xt.score(df)  # reuses fitted schema by default

For many users this is the best default workflow:

1. Start with a raw provider DataFrame.
2. Pass explicit column mappings.
3. Pass ``event_type_map`` for event labels.
4. Pass ``success_value_map`` for success/outcome labels.

That keeps the transformation visible and reproducible.

If you call ``score`` with default options, it reuses the schema
from ``fit`` so you do not need to repeat column mappings.

Load a pretrained surface
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from penaltyblog.xt import load_pretrained_xt

   model = load_pretrained_xt(name="default")

The bundled ``"default"`` artifact is fit on ~14 million events
across multiple seasons, including passes, throw-ins, free kicks, corners,
and shots taken from the big five European leagues.

Scoring
^^^^^^^

Scoring adds three columns to the input DataFrame:

- ``xt_start`` — xT value at the start location (set for all moves with
  valid start coordinates, **including failed moves**)
- ``xt_end`` — xT value at the end location (set only for successful moves)
- ``xt_added`` — ``xt_end - xt_start`` (set only for successful moves)

.. note::

   ``xt_start`` is populated for **all** moves with valid start coordinates,
   regardless of whether the move succeeded. This lets you measure the
   possession value that was risked on a failed pass. ``xt_end`` and
   ``xt_added`` are only set for *successful* moves that also have valid
   end coordinates.

By default, ``score()`` reuses the schema saved during ``fit``:

.. code-block:: python

   scored = model.score(data)
   scored[["xt_start", "xt_end", "xt_added"]]

To score with a different schema (for example, data from a different
provider), pass a new ``XTEventSchema`` and set ``use_fit_schema=False``:

.. code-block:: python

   scored = model.score(other_df, schema=other_schema, use_fit_schema=False)

Querying xT values
^^^^^^^^^^^^^^^^^^

After fitting (or loading a pretrained model), you can query the xT value
at any normalized ``0-100`` coordinate with :meth:`XTModel.value_at`:

.. code-block:: python

   model.value_at(85, 50)   # xT near the penalty spot
   model.value_at(10, 50)   # xT in the defensive half

For bulk lookups, use the vectorised :meth:`XTModel.values_at`:

.. code-block:: python

   import numpy as np

   xs = np.linspace(0, 100, 16)
   ys = np.linspace(0, 100, 12)
   xx, yy = np.meshgrid(xs, ys)
   vals = model.values_at(xx.ravel(), yy.ravel())
   heatmap = vals.reshape(12, 16)

Both methods clip coordinates to the valid ``0..100`` range according to
the model's ``coord_policy``.

Saving and loading models
^^^^^^^^^^^^^^^^^^^^^^^^^

Fitted models can be persisted to ``.npz`` files and reloaded later:

.. code-block:: python

   model.save("my_xt_model.npz")
   loaded = XTModel.load("my_xt_model.npz")

Saved files are portable and contain all arrays needed to score new events
or query values. You do not need access to the original training data.

Plotting
^^^^^^^^

Plotting uses :class:`penaltyblog.viz.Pitch` under the hood:

.. code-block:: python

   model.plot()

You can pass an existing :class:`~penaltyblog.viz.Pitch` instance or
override the default colours and opacity:

.. code-block:: python

   model.plot(colorscale="Viridis", opacity=0.7, show_colorbar=True)

API Reference
-------------

XTModel
^^^^^^^

.. autoclass:: penaltyblog.xt.XTModel
   :members:
   :undoc-members:

XTEventSchema
^^^^^^^^^^^^^

.. autoclass:: penaltyblog.xt.XTEventSchema
   :members:
   :undoc-members:

XTData
^^^^^^

.. autoclass:: penaltyblog.xt.XTData
   :members:
   :undoc-members:

load_pretrained_xt
^^^^^^^^^^^^^^^^^^

.. autofunction:: penaltyblog.xt.load_pretrained_xt

Fitted attributes
^^^^^^^^^^^^^^^^^

After calling :meth:`XTModel.fit`, the following attributes are available:

- ``surface_`` — ``numpy.ndarray`` of shape ``(n_rows, n_cols)`` containing the fitted xT values.
- ``shot_probability_`` — per-cell probability that an action is a shot.
- ``goal_probability_`` — per-cell probability that a shot results in a goal (beta-binomial smoothed).
- ``move_probability_`` — per-cell probability that an action is a successful move.
- ``transition_matrix_`` — effective combined transition matrix of shape ``(n_cells, n_cells)``.
- ``metadata_`` — dict with model hyperparameters, included families, and fit schema.
- ``included_move_families_`` — list of move event types that were present in the training data.
- ``included_shot_families_`` — list of shot event types that were present in the training data.
- ``fitted_`` — ``True`` once the model has been fit or loaded.

Troubleshooting
---------------

Missing ``is_success``
^^^^^^^^^^^^^^^^^^^^^^

**Error:** ``Missing success information. Provide an is_success column ...``

The ``is_success`` column is mandatory for both ``fit`` and ``score``.
For moves it records whether the action was completed; for shots it records
whether a goal was scored. If your provider uses string labels, map them
with ``XTEventSchema(success_value_map={...})``.

Missing move end coordinates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Error:** ``xT fit requires move end coordinates when move events are present.``

When the data contains move events (passes, carries, etc.), ``fit`` needs
``end_x`` and ``end_y`` columns so it can learn transition patterns. Use
``XTEventSchema(end_x=..., end_y=...)`` to map your provider's destination
coordinate columns.

No shot events in training data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Warning:** ``No shot events found in the training data. The fitted xT surface will be all zeros.``

A model trained without shots cannot learn goal probability and will return
zero everywhere. Check that your ``event_type_map`` correctly maps provider
shot labels to ``"shot"`` (and ``"free_kick_shot"`` if applicable).

Invalid success labels
^^^^^^^^^^^^^^^^^^^^^^

**Error:** ``Invalid values found in is_success: ...``

xT rejects non-boolean, non-numeric labels by default. If your feed uses
strings such as ``"Complete"`` or ``"Goal"``, provide an explicit mapping:

.. code-block:: python

   schema = XTEventSchema(success_value_map={"Complete": True, "Incomplete": False, "Goal": True})
   model.fit(df, schema=schema)

Ill-conditioned matrix
^^^^^^^^^^^^^^^^^^^^^^

**Warning:** ``xT transition matrix is ill-conditioned ...``

This can happen with very sparse data or a very fine grid relative to the
dataset size. The surface may contain extreme or unstable values. Remedies:

- Increase the amount of training data.
- Use a coarser grid (e.g. ``n_cols=12, n_rows=8`` instead of ``16x12``).
- Check that your data contains a representative mix of shots and moves.

Out-of-bounds coordinates
^^^^^^^^^^^^^^^^^^^^^^^^^

**Warning/Error:** ``xT fit/score received coordinates outside expected 0..100``

Your provider probably uses a different coordinate scale (e.g. StatsBomb
``0-120`` x ``0-80``). Declare the correct ranges in ``XTEventSchema``:

.. code-block:: python

   schema = XTEventSchema(x_range=(0, 120), y_range=(0, 80))

Notes
-----

- Each move family has its own transition matrix, shrunk toward the pooled
  baseline for sparse cells.
- Failed actions act as a turnover discount — they reduce ``move_prob``
  without contributing transitions.
- Goal probability is estimated per cell with light smoothing.
- Each event type maps unambiguously to a role — e.g. ``corner``
  is always a move, ``free_kick_shot`` is always a shot.