Data fusion and modelling is an effective way to add value to existing datasets of sensor measurements. This can be achieved via aggregation of datasets or inference of new data from relationships and patterns within existing datasets. Generic fusion provisions data fusion in a way that separates configuration and data from the algorithmic processing itself, allowing re-use of algorithms and pre/post-processing between datasets. Re-use of algorithms and techniques lowers the cost of development, configuration and deployment of fusion services.
Data fusion and modelling techniques are used to integrate observation data, contextual data and phenomenological models from different sources in order to obtain new environmental information where and when sensor measurements are not available. Observation sensors may include in-situ, airborne and space-borne types, while models may include deterministic and stochastic models.
In addition, data fusion numerical techniques provide a framework for integrating information uncertainties which are generated from sensor measurements and models with various inaccuracies.
In SANY we have developed four distinct types of reusable fusion services:
The following video demonstrations of SANY fusion services are available:
Innovation and impact:
These services have been implemented for environmental decision-supportapplications by various project partners in context with the SensorSA. They were then validated under multiple risk domain applications. These included pilot applications specialising in the prediction of microbial risk of exceedance in bathing waters in the Gulf of Gdansk (Poland), atmospheric pollution risks and false alarms in the City of Linz (Austria) and underground risks of subsidence in the City of Toulon (France).
Causal fusion refers to the indirect prediction of a target variable using a selection of explanatory variables. A number of causal fusion methods have been developed for use within the SANY project, including multiple linear regressions and neural networks. In most cases, historical target and explanatory variables along with real-time explanatory variables from in-situ sensors held in OGC compliant SOSs or from spatial fusion processes are accessed via an OGC compliant WPS. The resultant predictions are supplied to an OGC compliant SOS, and/or viewed through a web-interface. Two types of casual fusion algorithms were used in SANY: multi-linear regressions and neural networks.
Linear regression is used to construct a prediction formula for the target variable, given values of explanatory variables, by minimizing the sum of squared errors of linear fitting. Before constructing the linear regression formula, each explanatory variable is tested in order to determine whether a linear relationship to the target variable exists. The target variable is then predicted as a linear combination of the explanatory variables.
Linear regression is one of the most widely used modelling methods because of its effectiveness and completeness. Although the majority of processes are nonlinear in nature, many of them are well-approximated by linear models.
Linear regression estimates unknown parameters and assesses whether these parameters are statistically significant, which often has a clear meaning to scientific questions. Linear regression also assesses whether the model is statistically significant. The resulting model can be used to predict the target variable and confidence intervals.
Neural networks are mathematical structures which are analogous to biological neural networks. The artificial neurons are set in layers and interconnected with each other. The neural networks are capable of processing non-linear statistical data and modelling complex relationships between inputs and outputs.
The most basic radial basis network consists of three separate layers. The input layer is the explanatory variables. The second layer is a hidden layer of high dimension. The output layer is the response of the network. The network topology is determined by the number of hidden units. One response is involved in this application.
Neural Networks are generally considered a ‘black box’ approach since the model parameters are hard to interpret in terms of physical meanings.
Spatial data fusion services provide spatial trends of environmental parameters using observation data which are collated from a network of in situ sensors. This leads to the prediction of environmental parameters in areas where sensing is not available. The computation and analysis of spatial data uncertainties can also lead to identifying the areas where new sensor observations are required. Three types of spatial fusion algorithms have been developed and tested in SANY:
Krigingi
Kriging is a method of spatial interpolation, which predicts values of an environmental parameter following observations of the same parameter at a finite number of sensors locations. The spatial predictions are simply weighted averages of the observed parameter values, according to the respective distances between the sensor points respective locations. The weights in Kriging are computed so that the variance is minimised. In this sense, Kriging is often called Optimal Interpolation.
The dependency of the interpolation weights on the distances between sensors is manifested in a variogram. The Kriging variogram essentially describes the variance of the difference between two distinct spatial observations. Furthermore, a realistic modelling of the variogram, should be based on reasonably accurate observations and a good understanding of the most dominant environmental processes that influence the spatial and temporal trends of the environmental parameter under study. This is of paramount importance for good Kriging results.
The numerical procedures in Kriging additionally involve the determination of measures of uncertainty when estimating environmental parameters in a spatial domain of interest. The approach leads to a good assessment of how observation sensors should be spatially distributed for achieving minimum uncertainty in spatial fusion.
To support Kriging we have added elevation correction, periodic variable support (e.g. wind direction), automated variogram selection and multi-region variogram support.
Bayesian Maximum Entropy
Data fusion methods based upon Bayesian Maximum Entropy (BME) are able to consider soft sensor data, e.g. the sensor value lies in an interval, and additional phenomenological knowledge in the form of models. The results are statistics encompassing the uncertainty of the spatial/temporal interpolation given the uncertainty of the available information.
The overall BME fusion method is structured in three stages:
If the general knowledge G comprises the mean and covariance, and if S includes only hard data, then the BME estimate coincides with the simple Kriging estimate. Similarly, if G is limited to the variogram and if S includes only hard data, then the BME estimate coincides with the ordinary Kriging estimate.
When applying the BME method in the SensorSAi, the knowledge S is represented as an observation collection described with the O&M model and including uncertainty information in uncertML. The map resulting from the posterior stage
is represented as coverage with associated uncertainty information.
Socio-economic spatial correlation
This spatial correlation algorithm fuses information from an economic impact database about buildings, cracking and the potential economic impact of this cracking with ground displacement sensor data in the Barcelona region. Correlation is between each identified building and the nearest ground displacement measurement. A displacement threshold limit, determined via temporal analysis of the regions historical correlation between displacements and eventual cracking, is used to flag buildings at different likelihoods of cracking. The output is a list of buildings, spatially correlated displacement values for each building, economic information such as tax and rental income and a 'likely cracking' warning flag based on the ground displacement threshold limit.
Temporal fusion can be used to predict the target variable directly from past observations of the target variable itself. The essential difference between temporal fusion and causal fusion is that temporal fusion takes the internal structure of data into account. In the SensorSA time-series data from in-situ sensors are obtained from SOS instances. The resultant predictions from the temporal fusion service are supplied to an OGC compliant SOS instance via a ‘virtual sensor’ controlled by an SPS instance.
Methods for time series analysis are often divided into two domains: timedomain and frequency domain. The frequency domain approach is more suited to exploratory analysis. The time-domain approach is discussed here. Time series usually contains some typical patterns:
Apart from the above regular patterns, an irregular component in the time series reflects non-systematic movements in the process.
The regular patterns can be identified through exploratory analysis or empirical knowledge of the process. At this stage, one must decide the order of trend, i.e., whether it is a random walk or a local linear trend, the existence
of seasonal component and its period, the order of autoregressive and moving average components.
Two temporal fusion algorithems have been investigated in SANY: state-space modelling and Kalman filters
Once data patterns are identified, models for time series can be formed using an autoregressive integrated moving average model or state-space form.
The state-space form has enormous power to handle a wide range of time series models.
The basic structures such as trend and seasonal cycles are expressed explicitly in the model and are easy to interpret. The state-space form consists of a measurement equation and a transition equation. The transition equation contains the dynamics of the system under investigation and generates state variables. The measurement equation relates observable variables to state variables.
After time series are modelled and put in state-space form, the Kalman filter algorithm may be used to produce predictions and smoothing of the statespace vector.
The Kalman filter is an important algorithm in many applications since it facilitates online estimation and enables the estimation and prediction of the state vector to be continually updated as new observations become available.
The Kalman filter is derived on the assumption that the disturbance and initial state vector are normally distributed. It gives optimal estimation of the state vector in the sense that it minimizes the mean square error within the class of linear estimators. It consists of two steps: prediction and update. The prediction step predicts the state variable and the prediction error to the next time step using the transition equation. The update step modifies the prediction once the observation at the current time step becomes available.
The Kalman filter also facilitates maximum likelihood estimation of the unknown parameters in the model. It enables the likelihood function to be calculated via prediction error decomposition. The maximum likelihood estimation can be carried out numerically or by an Expectation Maximization (EM) algorithm. The EM algorithm takes on a simple form comparing to the numerical solution and it always increases the likelihood during the iteration. The EM algorithm also tolerates missing observations and has a natural procedure to adjust the estimators.