How to read the verification metrics
ROC AUC measures how well the forecast ranks lightning-prone cells above
non-lightning cells
across every possible decision threshold. A value near 1.00 means the system almost
always ranks
true lightning environments above quiet environments, while a value near 0.50
behaves like random
guessing. Because ROC looks at both hit rate and false-alarm rate over all thresholds, it is a
strong measure of
overall discrimination skill.
PR-AUC focuses on precision and recall, so it
answers a slightly
different question: when the model highlights a grid cell as risky, how often is that signal useful,
and how many
true lightning cells are being captured? This matters a lot for lightning because positive events
are relatively rare
compared with the total number of grid cells. A model can post a respectable ROC AUC while still
having modest
precision in rare-event situations, which is why PR-AUC is a valuable companion metric.
Brier score measures the average squared error between the forecast probability and
the observed
binary outcome. Lower is better, with 0.00 being perfect. It complements ROC/PR by
checking whether
the probabilities themselves are numerically sensible, not just well ranked.
The reliability curve compares the average forecast probability in each occupied
bin with the actual
observed lightning frequency in that bin. Curves close to the diagonal are well calibrated. Above
the diagonal means
HOCO was underconfident there; below it means the forecast probabilities were too high.
The flash-density chart adds intensity context by plotting observed strikes per
approximate
1,000 km^2 against forecast-probability bins. HOCO still does not forecast explicit
flash counts, so
this is a diagnostic relationship rather than a direct count-forecast skill score, but it helps show
whether higher
probabilities are lining up with more concentrated lightning.
The reference systems below are included as context only. Their domains, lead
times, targets
(e.g. total lightning vs cloud-to-ground lightning), spatial scales, and verification methods differ
from HOCO, so
they should not be treated as direct like-for-like rankings against the curves above.
HOCO verification scale: the current verification logic evaluates forecasts on an
approximately
8 km grid over the UK and then aggregates neighbouring cells into approximately
24 km spatial blocks before scoring. That reduces the worst storm-clustering
problem from treating
every adjacent grid cell as fully independent. It is still not a full storm-object verification
system, but it now
includes both calibration diagnostics and an observed density view in addition to occurrence
discrimination.
A block is still treated as a binary event for ROC AUC, PR-AUC, Brier score, and
the reliability
curve: if at least one lightning strike lands inside it, the observed occurrence label is positive.
The added
density diagnostic partly addresses that limitation by also tracking how many strikes accumulated
within each
probability bin, but it should still be read as supporting context rather than a full
explicit-intensity forecast.
Forecast polygons are now matched more strictly as well: there is no extra blanket 10 km polygon
buffer in the
verifier, and overlapping forecast zones are combined into a single block probability instead of
stopping at the
first matching zone. That makes the benchmark less optimistic than the earlier version.
Research note: public lightning papers do not all publish the same skill metrics. A
few provide
scalar ROC AUC or PR-AUC values, while others focus on CSI,
POD/FAR, reliability,
or lead time instead. The table below prioritizes studies that actually report AUC-style values, and
leaves the
metric cell blank where a directly comparable AUC/PR-AUC was not published in the cited source.
|
System |
Typical resolution / scale |
Published ROC AUC / PR-AUC |
Published skill note |
Source |
| ECMWF IFS
lightning parametrization |
Global model around 9
km operationally in the cited note; some verification examples are aggregated to 50
km+ scales and 6 h+ windows. |
Not reported in the
cited newsletter article |
Useful ensemble skill
to at least day 3; reported map correlations up to about 0.75 for daily averages
over 5-degree boxes. |
ECMWF Newsletter 155 |
| Met Office UKV
/ UKCP CPM |
Convection-permitting
kilometre-scale guidance; the cited UKCP CPM report evaluates lightning more
qualitatively than with a directly comparable ROC curve. |
Not reported in the
cited report |
Lightning output is
assessed only subjectively in this report; the scheme is described as performing
better in summer and overpredicting in winter. |
Met Office UKCP convection-permitting
report |
|
HARMONIE-AROME
|
2.5 km horizontal grid
spacing in the cited Met Eireann description. |
Not reported in the
cited overview |
Met Eireann notes that
the higher-resolution HARMONIE-AROME setup tends to forecast small-scale features
such as thunderstorms better than coarser global models. |
Met Eireann: The Ins and Outs of Weather Models |
| DWD CellMOS +
ICON |
ICON-EU is about 6.5
km native (~7 km output grid); ICON-D2 is 2.2 km for very-short-range hazardous
convection. |
Not reported in the
cited operational page |
Operational
thunderstorm guidance updated every 5 minutes from radar, lightning observations,
and ICON model guidance, including empirical flash-count information. |
DWD CellMOS system |
| NOAA
LightningCast |
Satellite-based
next-hour nowcasting on geostationary imager grids rather than a UK land-only
verification mesh like HOCO's 8 km grid. |
Not reported as ROC
AUC / PR-AUC in the cited paper |
Deep-learning
next-hour lightning nowcasting; the NOAA paper reports that the system frequently
provides around 20 minutes or more of lead time to new lightning activity. |
Weather and Forecasting (2022): LightningCast |
| Operational
gridded CG lightning forecast (ConUS + Alaska) |
Uniform 20 km
continental grid. |
ROC AUC = 0.9398
(NARR-driven AK), 0.9444 (GFS-init AK); 0.9328 down to 0.8697 from 0-3 h out to
168-171 h. |
One of the clearest
publicly reported scalar AUC benchmarks I found for a gridded lightning-probability
system. |
Fire (2024): Probabilistic Forecasting of Lightning
Strikes |
| 3D
weather-radar CNN lightning-strike identifier |
Strike-location
identification from 3D weather radar volumes; not directly the same task as HOCO's
grid verification. |
CNN: ROC AUC = 0.798,
PR-AUC = 0.534. |
Useful because it
reports both ROC AUC and PR-AUC on an imbalanced lightning-identification problem.
|
Frontiers (2021): Lightning Strike
Location Identification Based on 3D Weather Radar Data |
| 3D
weather-radar RF baseline |
Same dataset and task
as the CNN row above. |
RF: ROC AUC = 0.765,
PR-AUC = 0.475. |
A useful
non-deep-learning comparison point from the same benchmark study. |
Frontiers (2021): same benchmark
table |
| Western U.S.
grid-cell CNN lightning-day model |
Individually trained
grid-cell models over the western United States; climate-oriented lightning-day
prediction rather than event-scale HOCO polygons. |
AUC > 0.9 in parts
of the interior Southwest; AUC < 0.6 in some Pacific coastal areas. |
This is an AMS
conference abstract, so it gives a regional AUC range rather than a single published
national mean. |
AMS 2024 abstract: Using Deep Learning
to Predict Cloud-to-Ground Lightning in the Western United States |
| Southern Great
Plains RF lightning-occurrence model |
Site-focused
summertime lightning occurrence around the ARM Southern Great Plains region. |
AUC = 0.850. |
Preprint result based
on ARM variables plus Earth Networks lightning data; useful as another published
scalar AUC reference. |
EGUsphere preprint: ML investigation of
summertime lightning frequency |
| Bangladesh
pre-monsoon XGBoost lightning model |
Regional ERA5-based
pre-monsoon lightning classifier over Bangladesh. |
AUC = 0.76. |
The abstract also
reports POD = 86.01% and accuracy = 71.08%, which helps contextualize the moderate
AUC. |
JASTP (2025): Prediction of lightning
events over Bangladesh |
| MDE-UNet lightning-identification
network |
2 km spatial resolution over Guangdong Province in the
published experiment. |
AUC-ROC = 0.9939 (vs. baseline UNet at 0.9907). |
Very high score, but on a lightning-identification task
with a different dataset, target, and regional setup than HOCO. |
Remote Sensing (2026): MDE-UNet |