# The development of GAP probability maps

Continuous coverage of geospatial information is very useful for the development of predictive water quality models. The first step involves an initial selection of geospatial data that could be correlated to geogenic groundwater contamination (predictor or independent variables). This is a critical step that requires an understanding of the processes causing contaminant release, as described above. To aid in deciding which geospatial layers to use, a summary of release processes and typically associated data have been compiled for arsenic (Table A) and fluoride (Table B).

The second step involves selecting the area for which a map is to be developed, which is divided into pixels (or squares). The pixel size depends on the resolution of the available data (1 km^{2} in GAP). The average value of the measured concentration data within each pixel is used for model calibration. For each measured data point (dependent variable), an iterative statistical analysis is made with the predictor variables to find the degree of positive or negative correlation between the two and to determine the coefficients of the model parameters.
It is important to have a range of measured data values that encompasses a comparable proportion of high and low values. This spread of data is essential, since the model will be able to predict only across this same value range. In the case of logistic regression, where the dependent variable is taken to be either high or low (1 or 0), the cut-off between the two is commonly chosen to be the contaminant concentration limit determined by the authorities (e.g. WHO) as being acceptable for human consumption.
The same is true for independent variables. For example, if the values of an independent variable are the same for all of the data points being modelled, this variable cannot explain any of the variance found in the data since the independent variable itself does not vary. It is therefore important in being able to establish a correlation that the dependent data points take in a broad range of independent variable values. This could then mean targeting a groundwater sampling campaign in specific regions with differences in, for example, geology or soil type and not necessarily where high arsenic levels are expected.
A rule of thumb for the minimum size of the dataset (of the dependent variable) to be modelled is to have a ratio of at least 10 cases to every independent variable. In this instance, “cases” refers to the smaller of the number of high or low data values (1 or 0) in the dataset. For example, when using three independent variables with a dataset having 60% high values / 40% low values, the dataset should contain at least: 10 x 3 / 0.4 = 75 samples (REF).

It should be noted that in both low- and high-pH conditions there is a potential for elevated dissolved fluoride concentrations in groundwater because of limited dissolved calcium concentrations that could otherwise control dissolved fluoride concentrations by the precipitation of fluorite (CaF2(s)).

## References on geospatial modelling

Amini, M., K. C. Abbaspour, M. Berg, L. Winkel, S. J. Hug, E. Hoehn, H. Yang, and C. A. Johnson. "Statistical Modeling of Global Geogenic Arsenic Contamination in Groundwater." Environmental Science and Technology 42, no. 10 (2008a): 3669-75.

Amini, M., K. Mueller, K. C. Abbaspour, T. Rosenberg, M. Afyuni, K. N. Müller, M. Sarr, and C. A. Johnson. "Statistical Modeling of Global Geogenic Fluoride Contamination in Groundwaters." Environmental Science and Technology 42, no. 10 (2008b): 3662-68.

Winkel, L., M. Berg, M. Amini, S. J. Hug, and C. Annette Johnson. "Predicting Groundwater Arsenic Contamination in Southeast Asia from Surface Parameters." Nature Geoscience 1, no. 8 (2008): 536-42.

Winkel, L., M. Berg, C. Stengel, and T. Rosenberg. "Hydrogeological Survey Assessing Arsenic and Other Groundwater Contaminants in the Lowlands of Sumatra, Indonesia." Applied Geochemistry 23, no. 11 (2008): 3019-28.

Amini M., Abbaspour K., Johnson C.A. (2010) A comparison of different rule-based statistical models for modeling geogenic groundwater contamination. Environmental Modelling & Software, 25, 1650-1657. Berg, M., L. Winkel, M. Amini, S. J. Hug, and C. A. Johnson. "Delineating Areas of Groundwater Arsenic Contamination from Surface Parameters and Geology at Depth." In Arsenic in Geosphere and Human Health - As2010, edited by J.S. Jean, J. Bundschuh and J. Bhattacharya, 79-81. Leiden - Netherlands: CRC Press/Balkema, 2010.

Winkel, L., Trang, P. T. K, Lan, V., M., Stengel, C., Amini, M., Ha, N. T., Viet, P. H and M. Berg. “Arsenic pollution of groundwater in Vietnam exacerbated by deep aquifer exploitation for more than a century” PNAS 108, no. 4 (2011) 1246-51.

Rodriguez-Lado L., G. Sun, M. Berg, Q. Zhang,, H. Xue, Q. Zheng and C.A. Johnson (2013) Groundwater arsenic contamination throughout China. Science, 341(6148), 866-868.

Ravenscroft, P. "Predicting the global extent of arsenic pollution of groundwater and its potential impact on human health." UNICEF, New York (2007).