Statistical and AI methods to detect school-level outbreaks from routinely collected absence data

This PhD opportunity is being offered as part of the LSTM and Lancaster University Doctoral Training Partnership. Find out more about the studentships and how to apply

Abstract Research during the SARS-CoV-2 pandemic demonstrated the utility of novel data sources for understanding pathogen transmission in and between schools. In particular, school-level absences data acted as a complementary SARS-CoV-2 surveillance system, and has been used to parameterise and calibrate transmission-dynamic models of school interventions. Given the existing routine collection of school absence data in the UK, there is an under utilised potential for this to be harnessed for future early outbreak detection and control. Several statistical approaches can be used to predict infectious disease cases based on time-series data from small geographical areas (1-3). Non-linear random effects models are a useful statistical approach that allow spatially correlated random effects to be introduced. These correlated random effects need not be based on spatial distance alone; in the context of schools, correlations could also depend upon the connectedness of schools within the school-household network. The use of machine learning (ML) and artificial intelligence (AI) methodology in epidemiological surveillance and modelling is in its infancy and has not yet been applied to understanding school-level transmission control. Such approaches have the potential to outperform traditional statistical approaches because of their ability to capture complex non-linear relationships from high-dimensional inputs. Early outbreak detection could transform the way we control transmission in schools. During the SARS-CoV-2 pandemic, schools were given rigid guidance at either the regional or national level – advice to schools was independent of a school’s specific epidemiological context. Approaches that identify which schools are likely to experience a large wave of infections if control measures are not implemented could allow for targeted interventions that minimise school-level transmission and disruption simultaneously. For outbreaks of other pathogens, including measles and scarlet fever, schools are currently advised to contact their local UK Health Security Agency (UKHSA) health protection team if they have ‘a higher than previously experienced and/or rapidly increasing number of absences due to the same infection’ (4). Statistical/AI approaches may be able to detect such rapid increases earlier and more consistently than current practice. Aims: This PhD project will aim to: 1. develop methods capable of identifying which schools are in the early stages of an outbreak from routinely collected data in schools, using the SARS-CoV-2 pandemic as a case study. 2. understand whether ML/AI approaches outperform ‘traditional’ statistical approaches in this task 3. use transmission modelling to demonstrate the efficacy of this approach in response to future infection outbreaks.
Where does this project lie in the translational pathway? T1 - Basic Research
Methodological Aspects The candidate will develop statistical and AI approaches to detect school-level outbreaks, initially using school-level data collected by DfE during the SARS-CoV-2 pandemic as a case study. These data, combined with other relevant data, could potentially be used to predict which schools are likely to be in the early stages of an outbreak and go on to have high levels of absences in the near future. Initially, the candidate would use non-linear random effects models, but the candidate will have the flexibility to explore alternative approaches. The project would then involve exploring whether temporal graph convolutional networks (GCNs), or other AI approaches, outperform non-linear random effect models in the task of predicting future cases at the school level. A recent study used temporal GCNs to forecast SARS-CoV-2 infections in German subregions, outperforming other considered approaches (5). GCNs are an extension of convolutional neural networks, which use convolutional filters to aggregate information over adjacent areas in a grid. In a GCN, the convolution is applied to the local neighbourhood of nodes in a graph, which in the context of schools could capture the connnectedness of schools via shared siblings and spatial proximity. AI models would be trained on i) current and previous pupil/staff absences in a school, ii) current and previous pupil/staff absences in 'nearby' schools, iii) epidemiological surveillance data and iv) school-level and other sociodemographic factors, with the goal of predicting future absences levels in schools. Given that outbreaks can be successfully predicted when absences are tied to a specific infection (as during the COVID-19 pandemic), the candidate could go on to explore whether such methods could be used to predict ‘outbreaks’ or impending periods of high absence when the cause of absence is less specific, (e.g. when ‘illness-related absences’ are considered). Transmission-dynamic models of transmission within and between schools could then be used to assess whether early outbreak detection could help mitigate infections and absences in schools in a future epidemic context. A range of modelling approaches could be used, including detailed agent-based simulation of transmission in schools. The project would then involve exploring whether temporal graph convolutional networks (GCNs), or other AI approaches, outperform non-linear random effect models in the task of predicting future cases at the school level. A recent study used temporal GCNs to forecast SARS-CoV-2 infections in German subregions, outperforming other considered approaches (5). GCNs are an extension of convolutional neural networks, which use convolutional filters to aggregate information over adjacent areas in a grid. In a GCN, the convolution is applied to the local neighbourhood of nodes in a graph, which in the context of schools could capture the connnectedness of schools via shared siblings and spatial proximity. AI models would be trained on i) current and previous pupil/staff absences in a school, ii) current and previous pupil/staff absences in 'nearby' schools, iii) epidemiological surveillance data and iv) school-level and other sociodemographic factors, with the goal of predicting future absences levels in schools. Given that outbreaks can be successfully predicted when absences are tied to a specific infection (as during the COVID-19 pandemic), the candidate could go on to explore whether such methods could be used to predict ‘outbreaks’ or impending periods of high absence when the cause of absence is less specific, (e.g. when ‘illness-related absences’ are considered).
Expected Outputs The project is anticipated to lead to three publications in reputable journals, corresponding to the three aforementioned aims.
Training Opportunities Training in statistical methods, programming, and epidemiological modelling will be delivered through the Health Data Science MSc programme taught at Lancaster. Extra support for machine learning techniques will also be available through Lancaster’s research hub for Mathematics for AI in Real-world Systems (MARS). Further training opportunities may be explored at IDDconf conference workshops or similar, and through links with other institutions as part of the JUNIPER consortium.
Skills Required The project would be suitable for a candidate with a very strong statistical background. The candidate need not already be experienced in ML/AI methodology, but should have the ability to learn these.
Subject Areas Health Policy and Health Systems Research
Key Publications associated with this project

Paul, Michaela, and Leonhard Held. "Predictive assessment of a non‐linear random effects model for multivariate time series of infectious disease counts." Statistics in medicine 30.10 (2011): 1118-1136.

Held, Leonhard, Sebastian Meyer, and Johannes Bracher. "Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture." Statistics in medicine 36.22 (2017): 3443-3460.

Mellor, Jonathon, et al. "Forecasting influenza hospital admissions within English sub-regions using hierarchical generalised additive models." Communications medicine 3.1 (2023): 190.

UKHSA, https://www.gov.uk/government/publications/health-protection-in-schools-and-other-childcare-facilities

Fritz, Cornelius, Emilio Dorigatti, and David Rügamer. "Combining graph neural networks and spatio-temporal disease models to improve the prediction of weekly COVID-19 cases in Germany." Scientific Reports 12.1 (2022): 3930.