By the end of this practical, you should feel comfortable:
k
changes smoothsgam.check
and rqgam.check
obs_exp
library(Distance)
## Loading required package: mrds
## This is mrds 2.2.3
## Built: R 4.0.2; ; 2020-08-01 10:33:56 UTC; unix
##
## Attaching package: 'Distance'
## The following object is masked from 'package:mrds':
##
## create.bins
library(dsm)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
## Loading required package: numDeriv
## This is dsm 2.3.0
## Built: R 4.0.2; ; 2020-07-16 23:56:50 UTC; unix
library(ggplot2)
library(knitr)
Load the data and the fitted dsm
objects from the previous exercises:
load("spermwhale.RData")
load("dsms.RData")
load("df-models.RData")
We’re going to check the DSMs that we fitted in the previous practical, then save those that we think are good!
k
changes smoothsFirst checking that k
is big enough, we should really do this during model fitting, but we’ve separated this up for the practical exercises.
First look at the text output of gam.check
, are the values of k'
for your models close to the edf
in the outputted table. Here’s a silly example where I’ve deliberately set k
too small:
dsm_k_check_eg <- dsm(count ~ s(Depth, k=4),
df_hn, segs, obs,
family=tw())
## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!
gam.check(dsm_k_check_eg)
##
## Method: REML Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-1.66079e-07,1.171305e-07]
## (score 392.2823 & scale 5.216272).
## Hessian positive definite, eigenvalue range [0.8702242,302.9225].
## Model rank = 4 / 4
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Depth) 3.0 2.9 0.78 0.18
Note here again I’m using a “chunk” option to suppress the plots printed by gam.check
Generally if the EDF is close to the value of k
you supplied, it is worth doubling k
and refitting to see what happens. You can always switch back to the smaller k
if there is little difference. The ?choose.k
manual page can offer some guidance.
Continuing with that example, if we double k
:
dsm_k_check_eg <- dsm(count ~ s(Depth, k=8),
df_hn, segs, obs,
family=tw())
## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!
gam.check(dsm_k_check_eg)
##
## Method: REML Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-5.675949e-08,4.752562e-08]
## (score 392.2128 & scale 5.216315).
## Hessian positive definite, eigenvalue range [1.313321,300.6316].
## Model rank = 8 / 8
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Depth) 7.00 4.47 0.78 0.2
We get something much more reasonable. Doubling again
dsm_k_check_eg <- dsm(count ~ s(Depth, k=16),
df_hn, segs, obs,
family=tw())
## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!
gam.check(dsm_k_check_eg)
##
## Method: REML Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-4.245788e-07,3.205948e-07]
## (score 392.2614 & scale 5.213115).
## Hessian positive definite, eigenvalue range [1.440887,300.5744].
## Model rank = 16 / 16
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Depth) 15.00 4.83 0.78 0.19
We see almost no improvement, so might refit the second model here with k=8
.
Check to see if any of your models need this treatment.
Now following from the slides, let’s try plotting residuals vs. covariates. We can use these diagnostics (along with those in the following section) to assess how well the model has accounted for the structure in the data. This involves a little data manipulation and a little thought about how to chop up the covariate values.
I’ll show an example here using the not very good model we fitted above, so we can see what happens when things go wrong:
dsm_k_low <- dsm(count ~ s(Depth, k=4),
df_hn, segs, obs,
family=tw())
## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!
# get the data used to fit the model
resid_dat <- dsm_k_low$data
# get the residuals
resid_dat$resids <- residuals(dsm_k_low)
# now let's chop x into 20 chunks
resid_dat$x_chop <- cut(resid_dat$x,
seq(min(resid_dat$x), max(resid_dat$x), len=20))
# and plot the boxplot
boxplot(resids~x_chop, data=resid_dat, ylab="Deviance residuals", xlab="x")
This seems to show that something is going on in the far east of the data – there’s a wide range of residuals there. Note that for our data the most easterly points are also North (since our data is somewhat diagonal in shape, with respect to the compass). So let’s see if this is an issue with x
alone (East-West) or y
(North-Southness) is an issue too.
# now let's chop y into 20 chunks
resid_dat$y_chop <- cut(resid_dat$y,
seq(min(resid_dat$y), max(resid_dat$y), len=20))
# and plot the boxplot
boxplot(resids~y_chop, data=resid_dat, ylab="Deviance residuals", xlab="y")
We can also do this chopping with Depth
(the only covariate in this model):
# now let's chop Depth into 20 chunks
resid_dat$Depth_chop <- cut(resid_dat$Depth,
seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20))
boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth")
Now we see something more pathological, there is pattern in these residuals.
Let’s make the same plot for a model with an spatial smooth in it:
dsm_better <- dsm(count ~ s(x,y) + s(Depth),
df_hn, segs, obs,
family=tw())
## Warning in make.data(response, ddf.obj, segment.data, observation.data, : Some
## observations are outside of detection function truncation!
# get the data from the larger k model
resid_dat <- dsm_better$data
# get the residuals
resid_dat$resids <- residuals(dsm_better)
# now let's chop Depth into 20 chunks
resid_dat$Depth_chop <- cut(resid_dat$Depth,
seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20))
boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth")
This looks better (note the change in vertical axis scale).
Try this out with your models, using the covariates in the model and perhaps x
and y
too. You may want to change the number of chunks (via the len
argument to cut()
).
We can also fit a model to the residuals to see if there is residual structure left in the model. The idea here is that we know that the model won’t fit perfectly everywhere, but is the difference between model and data (the residuals) systematic. Going back to our “bad” model above, we can see what happens when things go wrong:
# get the data used to fit the model
resid_dat <- dsm_k_low$data
# get the residuals
resid_dat$resids <- residuals(dsm_k_low)
# fit the model with the same predictors, but larger k
resid_model_depth <- gam(resids ~ s(Depth, k=40),
data=resid_dat,
family=gaussian(), method="REML")
Note that we used the gam
function from mgcv
(which is what happens inside dsm
anyway), to fit our residual model. The important things to note are:
gam()
in place of dsm()
.resids
, the column we added above.family=gaussian()
.method="REML"
which happens automatically in dsm()
, and makes sure the best method is used for fitting the smooths.We want to look at the EDFs associated with the terms in the model:
summary(resid_model_depth)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## resids ~ s(Depth, k = 40)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.54782 0.03492 -15.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Depth) 3.304 4.14 8.315 1.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0344 Deviance explained = 3.77%
## -REML = 1422 Scale est. = 1.1571 n = 949
We can see that there’s still a bit of pattern there, but not much (you can also try using plot()
to investigate this).
Now trying that with the better model:
# get the data used to fit the model
resid_dat_better <- dsm_better$data
# get the residuals
resid_dat_better$resids <- residuals(dsm_better)
# fit the model with the same predictors, but larger k
resid_model_better <- gam(resids ~ s(Depth, bs="ts", k=40) + s(x,y, bs="ts", k=100),
data=resid_dat_better,
family=gaussian(), method="REML")
Now again looking at the EDFs:
summary(resid_model_better)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## resids ~ s(Depth, bs = "ts", k = 40) + s(x, y, bs = "ts", k = 100)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.45018 0.03071 -14.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Depth) 2.435 39 0.550 6.79e-06 ***
## s(x,y) 1.028 99 0.045 0.0255 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0296 Deviance explained = 3.32%
## -REML = 1302.3 Scale est. = 0.8952 n = 949
We see that there’s not much going on in this model. Again you can use plot()
here, remembering to compare the vertical axis limits and look at the width of the confidence bands.
Try this out with your model(s). What do you think is missing?
rqgam.check
We can use rqgam.check()
to look at the residual check plots for this model. Here we’re looking for pattern in the plots that would indicate that there is unmodelled structure. Ideally the plot should look like a messy blob of points.
rqgam.check(dsm_better)
## Loading required namespace: tweedie
Try this out with your models and see if there is any pattern in the randomised quantile residuals.
obs_exp
To compare the observed vs. expected counts from the model, we need to aggregate the data at some level. We can use obs_exp()
to do this. For continuous covariates in the detection function or spatial model, we just need to specify the cut points. For example:
obs_exp(dsm_better, "Depth", c(0, 1000, 2000, 3000, 4000, 6000))
## (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03] (4e+03,6e+03]
## Observed 4.000000 52.53333 139.1667 35.00000 8.000000
## Expected 5.163824 49.92310 126.8073 38.65673 8.134937
Try this out for your models.
Now save the models that you’d like to use to predict with later (that have good check results!): I recommend saving a few models so you can compare the results later.
# add your models here
#save(dsm_nb_xy_ms,
# file="dsms-checked.RData")