Developing an Optimal Spatial Predictive Model for Seabed Sand Content Using Machine Learning, Geostatistics, and Their Hybrid Methods

Seabed sediment predictions at regional and national scales in Australia are mainly based on bathymetry-related variables due to the lack of backscatter-derived data. In this study, we applied random forests (RFs), hybrid methods of RF and geostatistics, and generalized boosted regression modelling (GBM), to seabed sand content point data and acoustic multibeam data and their derived variables, to develop an accurate model to predict seabed sand content at a local scale. We also addressed relevant issues with variable selection. It was found that: (1) backscatter-related variables are more important than bathymetry-related variables for sand predictive modelling; (2) the inclusion of highly correlated predictors can improve predictive accuracy; (3) the rank orders of averaged variable importance (AVI) and accuracy contribution change with input predictors for RF and are not necessarily matched; (4) a knowledge-informed AVI method (KIAVI2) is recommended for RF; (5) the hybrid methods and their averaging can significantly improve predictive accuracy and are recommended; (6) relationships between sand and predictors are non-linear; and (7) variable selection methods for GBM need further study. Accuracy-improved predictions of sand content are generated at high resolution, which provide important baseline information for environmental management and conservation. Citation: Li, J.; Siwabessy, J.; Huang, Z.; Nichol, S. Developing an Optimal Spatial Predictive Model for Seabed Sand Content Using Machine Learning, Geostatistics, and Their Hybrid Methods. Geosciences 2019, 9, 180. https://doi.org/10.3390/geosciences9040180

A new R package for spatial predictive modelling: spm

The accuracy of spatially continuous environmental data, usually generated from point samples using spatial prediction methods, is crucial for evidence-informed environmental management and conservation. Improving the accuracy by identifying the most accurate methods is essential, but also challenging since the accuracy is often data specific and affected by multiple factors. Recently developed hybrid methods of machine learning methods and geostatistics have shown their advantages in spatial predictive modelling in environmental sciences and significantly improved predictive accuracy. An R package, ‘spm: Spatial Predictive Modelling’, has been developed to introduce these methods and has been recently released for R users. It not only introduces the hybrid methods for improving predictive accuracy, but can also be used to improve modelling efficiency. This presentation will briefly introduce the developmental history of novel hybrid geostatistical and machine learning methods in spm. It will introduce spm, by covering: 1) spatial predictive methods, 2) new hybrid methods of geostatistical and machine learning methods, 3) assessment of predictive accuracy, 4) applications of spatial predictive models, and 5) relevant functions in spm. It will then demonstrate how to apply some functions in spm to relevant datasets and to show the resultant improvement in predictive accuracy and modelling efficiency. Although in this presentation, spm is applied to data in environmental sciences, it can be applied to data in other relevant disciplines. Presentation at the 2018 useR! conference

A Critical Review of Spatial Predictive Modeling Process in Environmental Sciences with Reproducible Examples in R

Spatial predictive methods are increasingly being used to generate predictions across various disciplines in environmental sciences. Accuracy of the predictions is critical as they form the basis for environmental management and conservation. Therefore, improving the accuracy by selecting an appropriate method and then developing the most accurate predictive model(s) is essential. However, it is challenging to select an appropriate method and find the most accurate predictive model for a given dataset due to many aspects and multiple factors involved in the modeling process. Many previous studies considered only a portion of these aspects and factors, often leading to sub-optimal or even misleading predictive models. This study evaluates a spatial predictive modeling process, and identifies nine major components for spatial predictive modeling. Each of these nine components is then reviewed, and guidelines for selecting and applying relevant components and developing accurate predictive models are provided. Finally, reproducible examples using spm, an R package, are provided to demonstrate how to select and develop predictive models using machine learning, geostatistics, and their hybrid methods according to predictive accuracy for spatial predictive modeling; reproducible examples are also provided to generate and visualize spatial predictions in environmental sciences. Citation: Li, J. A Critical Review of Spatial Predictive Modeling Process in Environmental Sciences with Reproducible Examples in R. Appl. Sci. 2019, 9, 2048. https://doi.org/10.3390/app9102048

Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what?

Assessing the accuracy of predictive models is critical because predictive models have been increasingly used across various disciplines and predictive accuracy determines the quality of resultant predictions. Pearson product-moment correlation coefficient (r) and the coefficient of determination (r2) are among the most widely used measures for assessing predictive models for numerical data, although they are argued to be biased, insufficient and misleading. In this study, geometrical graphs were used to illustrate what were used in the calculation of r and r2 and simulations were used to demonstrate the behaviour of r and r2 and to compare three accuracy measures under various scenarios. Relevant confusions about r and r2, has been clarified. The calculation of r and r2 is not based on the differences between the predicted and observed values. The existing error measures suffer various limitations and are unable to tell the accuracy. Variance explained by predictive models based on cross-validation (VEcv) is free of these limitations and is a reliable accuracy measure. Legates and McCabe’s efficiency (E1) is also an alternative accuracy measure. The r and r2 do not measure the accuracy and are incorrect accuracy measures. The existing error measures suffer limitations. VEcv and E1 are recommended for assessing the accuracy. The applications of these accuracy measures would encourage accuracy-improved predictive models to be developed to generate predictions for evidence-informed decision-making. Citation: Li J (2017) Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what? <o>PLoS ONE 12(8): e0183250. https://doi.org/10.1371/journal.pone.0183250

Assessing the accuracy of spatial predictive models in the environmental sciences

Spatial predictive models have been increasingly employed to generate spatial predictions for environmental management and conservation in parallel to the advancement in data acquisition, data processing and computing capabilities. The accuracy of predictive models and their predictions is crucial to evidence-informed decision making and policy. However, the accuracy of predictive models in general is unknown and often accessed using error measures or even correlation measure. In this study, we clarified relevant issues about variance explained for predictive models (VEcv), established the relationships between commonly used predictive error measures like root mean square error (RMSE) and VEcv, unified these measures under VEcv, discovered that VEcv is independent of unit/scale and data variation, quantified the relationships between these error measures and data variation, and quantified the relationship between relative root mean square error (RRMSE) and relative mean absolute error (RMAE). We then assessed the performance of predictive models in the environmental sciences based on about 300 previously published applications and then classified the predictive models based on their performance. This study provided a tool to directly compare the accuracy of predictive models for data with different unit/scale and variation, and established a cross-disciplinary context and benchmark for assessing predictive models in environmental sciences and other disciplines. Recommendations for future studies were provided to objectively assess the performance of predictive models and make the accuracy of predictive models for different disciplines directly comparable. Abstract presented at the 23rd Australian Statistical Conference 2016

Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness

Spatial distribution of sponge species richness (SSR) and its relationship with environment are important for marine ecosystem management, but they are either unavailable or unknown. Hence we applied random forest (RF), generalised linear model (GLM) and their hybrid methods with geostatistical techniques to SSR data by addressing relevant issues with variable selection and model selection. It was found that: 1) of five variable selection methods, one is suitable for selecting optimal RF predictive models; 2) traditional model selection methods are unsuitable for identifying GLM predictive models and joint application of RF and AIC can select accuracy-improved models; 3) highly correlated predictors may improve RF predictive accuracy; 4) hybrid methods for RF can accurately predict count data; and 5) effects of model averaging are method-dependent. This study depicted the non-linear relationships of SSR and predictors, generated spatial distribution of SSR with high accuracy and revealed the association of high SSR with hard seabed features. Citation: Jin Li, Belinda Alvarez, Justy Siwabessy, Maggie Tran, Zhi Huang, Rachel Przeslawski, Lynda Radke, Floyd Howard, Scott Nichol, Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness, Environmental Modelling & Software, Volume 97, 2017, Pages 112-129, https://doi.org/10.1016/j.envsoft.2017.07.016