**cosit2013**Day 1 has seen the start of exciting workshops and tutorials at #COSIT2013

## Keynote: Trevor Bailey

### An overview of statistical approaches to structuring complex correlations in multivariate spatial data

##### Trevor Bailey, University of Exeter

This talk focusses on an overview of statistical approaches to analyzing and understanding multivariate spatial data, a need which is regularly encountered, not only in the earth sciences (including climate and environmental applications), but also in many other disciplines such as demography, ecology, agriculture, biology, epidemiology, public health and indeed in some areas of economics. The defining feature of such data is the availability of a vector of measurements on a set of different and potentially related response variables at each spatial location in the region studied, often with also an associated vector of potential explanatory variables measured at each of these sites. Such multivariate spatial data may exhibit not only correlations between variables at each site, but also spatial autocorrelation within each variable, and spatial cross-correlation between variables at neighbouring sites. Any analysis or modelling must therefore allow for dependency structures that are both complex and inevitably confounded in the observed data. Moreover, if repeat observations are present on the response vector, they often refer to different points in time and add temporal autocorrelations or cross-correlations into the already complex mix of potential correlation structures.

In recent years a range of different statistical approaches has been proposed for handling such complexities and the literature on the subject is now fairly extensive. One broad distinction in these approaches relates to the spatial indexing of the data under investigation. Leaving aside point process data, where the spatial indexing itself is a random set, other forms of spatial data may generally be categorised into that which is continuously indexed across space (conceptually the phenomena under study have values at all points in the region studied) and that for which the spatial indexing is restricted to a fixed discrete set of locations in space (conceptually the phenomena under study only have values on a predefined regular or irregular lattice of locations). The former are often referred to as 'geostatistical' or 'point-level' data (examples being point measurements relating to pollution, atmospheric conditions etc.) whilst the latter are often termed 'areal' or 'lattice' data (examples being disease counts or economic measures in small areas etc.). Whilst concepts and methods overlap, there are also important differences in the broad thrust of the statistical approaches used to analyse and model 'geostatistical' as opposed to 'lattice' data. For example, in the former a direct approach to modelling the joint probability distribution of the responses is typically adopted, whereas in the 'lattice' case this joint distribution is usually formulated indirectly through the specification of the conditional distribution in any area given values in neighbouring areas.

This talk will mostly focus on a review of statistical approaches and methods appropriate to the analysis of multivariate 'geostatistical' data. One line of work in that area seeks simplification of the multivariate structure by preliminary extraction of underlying components or factors that exhibit 'useful' spatial correlation or cross-correlation properties. This approach can be split into two categories: 'manifest variable methods', in which the factors are defined as simple linear combinations of the observed (i.e.'manifest') variables having particular correlation or spatial autocorrelation/cross-correlation behaviour, and 'latent variable methods' in which the factors are defined as linear combinations of unobserved (i.e. 'latent') variables plus an associated error term. The former category is thus purely descriptive, while the latter involves model formulation and associated inference procedures. Such latent structure models naturally lead into discussion of the more general subject of random field models for point-level multivariate spatial responses. Recent developments in this area typically incorporate spatial structure using mixtures of Gaussian processes in either single-stage or hierarchical models and often involve latent structure. A range of different approaches has been proposed to construct valid but suitably flexible classes of cross-covariances. Inference is now typically implemented in a Bayesian framework via Markov chain Monte Carlo (MCMC) methodology. This latter strand of work culminates in generalised linear models for non Gaussian multivariate spatial responses. The talk will review key aspects of the above two approaches to the analysis of multivariate 'geostatistical' data. The aim is to provide an overview of the scope of the different approaches and highlight explicit or implicit links between them wherever possible, rather than give a comprehensive comparative review.

Time will not permit much consideration in this talk of how the methods for multivariate 'geostatistical' data relate to those for regularly or irregularly based multivariate 'lattice' data. However, there may scope for briefly touching of the use of latent structure and conditional autoregressive processes ((CAR) or multivariate conditional autoregressive processes (MCAR) in such applications with covariance structures based upon simple adjacency properties or parametric functions of spatial separation. Recent forms of such generalised hierarchical CAR models again use Bayesian and MCMC methods.

Back to keynote speakers