## Aims By the end of this practical, you should feel comfortable: - Looking at how changing `k` changes smooths - Plotting residuals vs. covariates - Fitting models to the residuals - Using `gam.check` and `rqgam.check` - Using `obs_exp` ## Load data and packages ```{r load-packages-03} library(Distance) library(dsm) library(ggplot2) library(knitr) ``` Load the data and the fitted `dsm` objects from the previous exercises: ```{r load-data-03} load("spermwhale.RData") load("dsms.RData") load("df-models.RData") ``` We're going to check the DSMs that we fitted in the previous practical, then save those that we think are good! --- **Pre-model fitting** As we did in the previous exercise, we must remove the observations from the spatial data that we excluded when we fitted the detection function, i.e. those observations at distances greater than the truncation. ```{r truncate-obs-03} obs <- obs[obs$distance <= df_hn$ddf$meta.data$width, ] ``` --- ## Looking at how changing `k` changes smooths First checking that `k` is big enough, we should really do this during model fitting, but we've separated this up for the practical exercises. First look at the text output of `gam.check`, are the values of `k'` for your models close to the `edf` in the outputted table. Here's a silly example where I've deliberately set `k` too small: ```{r k-check-example, fig.show='hide'} dsm_k_check_eg <- dsm(count ~ s(Depth, k=4), df_hn, segs, obs, family=tw()) gam.check(dsm_k_check_eg) ``` Generally if the EDF is close to the value of `k` you supplied, it is worth doubling `k` and refitting to see what happens. You can always switch back to the smaller `k` if there is little difference. The `?choose.k` manual page can offer some guidance. Continuing with that example, if we double `k`: ```{r k-check-example-double-k, fig.show='hide'} dsm_k_check_eg <- dsm(count ~ s(Depth, k=8), df_hn, segs, obs, family=tw()) gam.check(dsm_k_check_eg) ``` We get something much more reasonable. Doubling again ```{r k-check-example-double-double-k, fig.show='hide'} dsm_k_check_eg <- dsm(count ~ s(Depth, k=16), df_hn, segs, obs, family=tw()) gam.check(dsm_k_check_eg) ``` We see almost no improvement, so might refit the second model here with `k=8`. Check to see if any of your models need this treatment. --- **Checking `k` size** I saved the following models last time: `dsm_nb_xy_ms_p1_4`, `dsm_tw_xy_ms`, `dsm_nb_noxy_ms_p1_3`, `dsm_nb_xy`, `dsm_tw_xy`, `dsm_nb_xy_ms_ae` and `dsm_tw_xy_ms_ae`. I'll run `gam.check` on each. Note that I've used the `fig.show='hide'` "chunk" option in RMarkdown to suppress the plots here and save space ```{r check-k-saved, fig.show='hide'} gam.check(dsm_nb_xy_ms_p1_4) gam.check(dsm_tw_xy_ms) gam.check(dsm_nb_noxy_ms_p1_3) gam.check(dsm_nb_xy) gam.check(dsm_tw_xy) gam.check(dsm_nb_xy_ms_ae) gam.check(dsm_tw_xy_ms_ae) ``` (That might have taken a while to run!) Sifting through these results, most of the cases were the $p$-value is significant (at any level beyond 0.1) it's the case that the EDF is far from $k^\prime$. The only models I am worried about are `dsm_nb_xy` and `dsm_tw_xy` where the $p$-value is singificant *and* $k^\prime$ is "near" the EDF (it's not that near but is is over half, so let's check it out). For the Tweedie model we set `k=25` and got an EDF of 14.5, so we'll double `k` to 50: ```{r refit-tw-xy, fig.show='hide'} dsm_tw_xy <- dsm(count~s(x,y, bs="ts", k=50), df_hn, segs, obs, family=tw()) summary(dsm_tw_xy) gam.check(dsm_tw_xy) ``` Okay that's a moderate jump there! We can see now the EDF is less than half of $k^\prime$ **but** as we can see the $p$-value is still significant. It's not likely that increasing $k$ will do anything here. As I said in the lecture, we need to look at $k^\prime$, the EDF and the $p$-value here to see what's going on. We can do the same thing with the negative binomial model, there the EDF was 13.9 when `k=25`. ```{r refit-nb-xy, fig.show='hide'} dsm_nb_xy <- dsm(count~s(x,y, bs="ts", k=50), df_hn, segs, obs, family=nb()) summary(dsm_nb_xy) gam.check(dsm_nb_xy) ``` Again we see a wee jump in the EDF but the $p$-value is still significant. For the same reasons as above, we're not going to worry too much about that. ## Plotting residuals vs. covariates Now following from the slides, let's try plotting residuals vs. covariates. We can use these diagnostics (along with those in the following section) to assess how well the model has accounted for the structure in the data. This involves a little data manipulation and a little thought about how to chop up the covariate values. I'll show an example here using the not very good model we fitted above, so we can see what happens when things go wrong: ```{r resid-covar} dsm_k_low <- dsm(count ~ s(Depth, k=4), df_hn, segs, obs, family=tw()) # get the data used to fit the model resid_dat <- dsm_k_low$data # get the residuals resid_dat$resids <- residuals(dsm_k_low) # now let's chop x into 20 chunks resid_dat$x_chop <- cut(resid_dat$x, seq(min(resid_dat$x), max(resid_dat$x), len=20)) # and plot the boxplot boxplot(resids~x_chop, data=resid_dat, ylab="Deviance residuals", xlab="x") ``` This seems to show that something is going on in the far east of the data -- there's a wide range of residuals there. Note that for our data the most easterly points are also North (since our data is somewhat diagonal in shape, with respect to the compass). So let's see if this is an issue with `x` alone (East-West) or `y` (North-Southness) is an issue too. ```{r resid-covar-y} # now let's chop y into 20 chunks resid_dat$y_chop <- cut(resid_dat$y, seq(min(resid_dat$y), max(resid_dat$y), len=20)) # and plot the boxplot boxplot(resids~y_chop, data=resid_dat, ylab="Deviance residuals", xlab="y") ``` We can also do this chopping with `Depth` (the only covariate in this model): ```{r resid-covar-depth} # now let's chop Depth into 20 chunks resid_dat$Depth_chop <- cut(resid_dat$Depth, seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20)) boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth") ``` Now we see something more pathological, there is pattern in these residuals. Let's make the same plot for a model with an spatial smooth in it: ```{r resid-covar-depth-good} dsm_better <- dsm(count ~ s(x,y) + s(Depth), df_hn, segs, obs, family=tw()) # get the data from the larger k model resid_dat <- dsm_better$data # get the residuals resid_dat$resids <- residuals(dsm_better) # now let's chop Depth into 20 chunks resid_dat$Depth_chop <- cut(resid_dat$Depth, seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20)) boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth") ``` This looks better (note the change in vertical axis scale). Try this out with your models, using the covariates in the model and perhaps `x` and `y` too. You may want to change the number of chunks (via the `len` argument to `cut()`). --- **Looking for patterns in the residuals** Trying this out with the models without smooths of space in them (the habitat models): ```{r resid-covar-hab} # get the data from the larger k model resid_dat <- dsm_nb_noxy_ms_p1_3$data # get the residuals resid_dat$resids <- residuals(dsm_nb_noxy_ms_p1_3) # now let's chop x into 20 chunks resid_dat$x_chop <- cut(resid_dat$x, seq(min(resid_dat$x), max(resid_dat$x), len=20)) # now let's chop y into 20 chunks resid_dat$y_chop <- cut(resid_dat$y, seq(min(resid_dat$y), max(resid_dat$y), len=20)) par(mfrow=c(1,2)) boxplot(resids~x_chop, data=resid_dat, ylab="Deviance residuals", xlab="x", las=2) boxplot(resids~y_chop, data=resid_dat, ylab="Deviance residuals", xlab="y", las=2) ``` Comparing this to the model with space in it: ```{r resid-covar-xy} # get the data from the larger k model resid_dat <- dsm_nb_xy_ms_p1_4$data # get the residuals resid_dat$resids <- residuals(dsm_nb_xy_ms_p1_4) # now let's chop x into 20 chunks resid_dat$x_chop <- cut(resid_dat$x, seq(min(resid_dat$x), max(resid_dat$x), len=20)) # now let's chop y into 20 chunks resid_dat$y_chop <- cut(resid_dat$y, seq(min(resid_dat$y), max(resid_dat$y), len=20)) par(mfrow=c(1,2)) boxplot(resids~x_chop, data=resid_dat, ylab="Deviance residuals", xlab="x", las=2) boxplot(resids~y_chop, data=resid_dat, ylab="Deviance residuals", xlab="y", las=2) ``` Neither are perfect, there is some pattern in them but we don't see a really tight pattern as we do in the pathological versions above. --- ## Fitting models to the residuals We can also fit a model to the residuals to see if there is residual structure left in the model. The idea here is that we know that the model won't fit perfectly everywhere, but is the difference between model and data (the residuals) systematic. Going back to our "bad" model above, we can see what happens when things go wrong: ```{r resid-fit} # get the data used to fit the model resid_dat <- dsm_k_low$data # get the residuals resid_dat$resids <- residuals(dsm_k_low) # fit the model with the same predictors, but larger k resid_model_depth <- gam(resids ~ s(Depth, k=40), data=resid_dat, family=gaussian(), method="REML") ``` Note that we used the `gam` function from `mgcv` (which is what happens inside `dsm` anyway), to fit our residual model. The important things to note are: - `gam()` in place of `dsm()`. - The response is `resids`, the column we added above. - Theoretically we expect the deviance residuals to be distributed normally, so we set `family=gaussian()`. - Need to set `method="REML"` which happens automatically in `dsm()`, and makes sure the best method is used for fitting the smooths. We want to look at the EDFs associated with the terms in the model: ```{r resid-fit-summaries} summary(resid_model_depth) ``` We can see that there's still a bit of pattern there, but not much (you can also try using `plot()` to investigate this). Now trying that with the better model: ```{r resid-fit-good} # get the data used to fit the model resid_dat_better <- dsm_better$data # get the residuals resid_dat_better$resids <- residuals(dsm_better) # fit the model with the same predictors, but larger k resid_model_better <- gam(resids ~ s(Depth, bs="ts", k=40) + s(x,y, bs="ts", k=100), data=resid_dat_better, family=gaussian(), method="REML") ``` Now again looking at the EDFs: ```{r resid-fit-good-summary} summary(resid_model_better) ``` We see that there's not much going on in this model. Again you can use `plot()` here, remembering to compare the vertical axis limits and look at the width of the confidence bands. Try this out with your model(s). What do you think is missing? --- **Fitting to the residuals** Trying this out again with `dsm_nb_noxy_ms_p1_3` (habitat model) and `dsm_nb_xy_ms_p1_4` (model with space included). I'll start with the habitat model, where we didn't include `s(x,y)` from the start. Let's fit a model to the residuals that includes all of the possible covariates, to see if we're missing anything... ```{r resid-covar-fit-hab} # make a template data frame from the fitted data resid_dat_hab <- dsm_nb_noxy_ms_p1_3$data # get the residuals resid_dat_hab$resids <- residuals(dsm_nb_noxy_ms_p1_3) # fit the model with all predictors, but larger k resid_model_hab <- gam(resids ~ s(x,y, bs="ts", k=60) + s(Depth, bs="ts", k=20) + s(DistToCAS, bs="ts", k=20) + s(SST, bs="ts", k=20) + s(EKE, bs="ts", k=20) + s(NPP, bs="ts", k=20), data=resid_dat_hab, family=gaussian(), method="REML") summary(resid_model_hab) ``` Most of these EDFs are very very small, that's good! It means we've explained most of what's going on in the data using this model. We can see this a little better by plotting the model terms: ```{r} plot(resid_model_hab, scheme=2, pages=1) ``` We can see that the EDF of around 1 for `s(x,y)` here is giving a slight gradient in the spatial smooth of the residuals, but no complex pattern (it's a plane through the space). Depth might cause use a bit more concern. Let's try refitting the model, increasing the `k` for `s(Depth)` and `s(EKE)`: ```{r hab-refit} dsm_nb_noxy_ms_refit <- dsm(count~s(Depth, bs="ts", k=20) + #s(DistToCAS, bs="ts") + # 1 #s(SST, bs="ts") + # 2 s(EKE, bs="ts", k=20) + s(NPP, bs="ts"), df_hn, segs, obs, family=nb()) summary(dsm_nb_noxy_ms_refit) ``` We don't see any change here, so it's not that `k` is the issue. More likely there's just not enough information in the data to account for the variation that's there. For example `NPP` is a rough guide to net primary productivity (it's a measure of epipelagic micronekton, much smaller than the squid we usually see sperm whales eat), so this may not be "close enough" to the prey that sperm whales are interested in. Trying the same thing for the model that has a smooth of space in it... ```{r resid-covar-fit-xy} # make a template data frame from the fitted data resid_dat_xy <- dsm_nb_xy_ms_p1_4$data # get the residuals resid_dat_xy$resids <- residuals(dsm_nb_xy_ms_p1_4) # fit the model with the same predictors, but larger k resid_model_xy <- gam(resids ~ s(x,y, bs="ts", k=60) + s(Depth, bs="ts", k=20) + s(DistToCAS, bs="ts", k=20) + s(SST, bs="ts", k=20) + s(EKE, bs="ts", k=20) + s(NPP, bs="ts", k=20), data=resid_dat_xy, family=gaussian(), method="REML") summary(resid_model_xy) ``` Again we see that space and `EKE` have small effects (close to EDF of 1) and `Depth` has a slightly larger one. Looking at a plot of these effects: ```{r} plot(resid_model_xy, scheme=2, pages=1) ``` We see very very similar results to the above. Let's try refitting that model, but increasing `k` for `s(x,y)`, `s(Depth)` and `s(EKE)`: ```{r} dsm_nb_xy_ms_refit <- dsm(count~ #s(x,y, bs="ts", k=60) + # 1 s(Depth, bs="ts", k=20) + #s(DistToCAS, bs="ts") + # 2 #s(SST, bs="ts") + # 3 s(EKE, bs="ts", k=20) + s(NPP, bs="ts"), df_hn, segs, obs, family=nb()) summary(dsm_nb_xy_ms_refit) ``` So! We've now ended-up with the same model as `dsm_nb_noxy_ms_refit`! This again shows that we need to look at multiple outputs, adjust and see what's going on. Let's do the residual plot again for the terms in the model: ```{r resid-covar-refit} # get the data from the larger k model resid_dat <- dsm_nb_xy_ms_refit$data # get the residuals resid_dat$resids <- residuals(dsm_nb_xy_ms_refit) # now let's chop Depth into 20 chunks resid_dat$Depth_chop <- cut(resid_dat$Depth, seq(min(resid_dat$Depth), max(resid_dat$Depth), len=20)) # now let's chop EKE into 20 chunks resid_dat$EKE_chop <- cut(resid_dat$EKE, seq(min(resid_dat$EKE), max(resid_dat$EKE), len=20)) # now let's chop NPP into 20 chunks resid_dat$NPP_chop <- cut(resid_dat$NPP, seq(min(resid_dat$NPP), max(resid_dat$NPP), len=20)) par(mfrow=c(1,3)) boxplot(resids~Depth_chop, data=resid_dat, ylab="Deviance residuals", xlab="Depth", las=2) boxplot(resids~EKE_chop, data=resid_dat, ylab="Deviance residuals", xlab="EKE", las=2) boxplot(resids~NPP_chop, data=resid_dat, ylab="Deviance residuals", xlab="NPP", las=2) ``` We can see there's still something going on with `Depth`, but as I said above, this may be down to still not having a good covariate for prey. Finally we'll do the same thing with the Tweedie model with space in it `dsm_tw_xy_ms`. First a reminder of what's in that model: ```{r tw-summ} summary(dsm_tw_xy_ms) ``` Now doing the same thing again and refitting to the residuals: ```{r resid-covar-xy-tw} # make a template data frame from the fitted data resid_dat_xy <- dsm_tw_xy_ms$data # get the residuals resid_dat_xy$resids <- residuals(dsm_tw_xy_ms) # fit the model with all predictors, but larger k resid_model_xy <- gam(resids ~ s(x,y, bs="ts", k=60) + s(Depth, bs="ts", k=20) + s(DistToCAS, bs="ts", k=20) + s(SST, bs="ts", k=20) + s(EKE, bs="ts", k=20) + s(NPP, bs="ts", k=20), data=resid_dat_xy, family=gaussian(), method="REML") summary(resid_model_xy) ``` Again, most of these EDFs are very very small. Let's again plot the model terms: ```{r tw-resid-plot} plot(resid_model_xy, scheme=2, pages=1) ``` Looking at the `summary` results and the plot, I think we need to increase `k` for `s(x,y)`, `s(Depth)` and `s(DistToCAS)` (which wasn't in our model!): ```{r tw-xy-refit} dsm_tw_xy_ms_refit <- dsm(count~s(x,y, bs="ts", k=60)+ s(Depth, bs="ts", k=20) + #s(DistToCAS, bs="ts", k=20) + # 1 #s(SST, bs="ts") + # 2 s(EKE, bs="ts"),# + #s(NPP, bs="ts"), df_hn, segs, obs, family=tw()) summary(dsm_tw_xy_ms_refit) ``` So we're back to where we started! In this case increasing `k` for `DistToCAS` didn't have any impact on our model selection in the end. --- ## Using `rqgam.check` We can use `rqgam.check()` to look at the residual check plots for this model. Here we're looking for pattern in the plots that would indicate that there is unmodelled structure. Ideally the plot should look like a messy blob of points. ```{r rqcheck} rqgam.check(dsm_better) ``` Try this out with your models and see if there is any pattern in the randomised quantile residuals. --- **Checking randomized quantile residuals** Let's compare between negative binomial and Tweedie models. First negative binomial, as refitted above: ```{r rqhab-nb} rqgam.check(dsm_nb_noxy_ms_refit) ``` Then Tweedie: ```{r rqxy-tw} rqgam.check(dsm_tw_xy_ms) ``` Both of these seem fine, not pathological like those in the lecture slides. --- ## Using `obs_exp` To compare the observed vs. expected counts from the model, we need to aggregate the data at some level. We can use `obs_exp()` to do this. For continuous covariates in the detection function or spatial moel, we just need to specify the cutpoints. For example: ```{r oe-depth-bad, echo=TRUE} obs_exp(dsm_better, "Depth", c(0, 1000, 2000, 3000, 4000, 6000)) ``` Try this out for your models. --- **Looking at observed vs. expected counts** Again trying this for our negative binomial habitat model using the `Depth` covariate: ```{r oe-nb, echo=TRUE} obs_exp(dsm_nb_noxy_ms_refit, "Depth", c(0, 1000, 2000, 3000, 4000, 6000)) ``` This looks pretty good! We can try with `EKE` and `NPP` too: ```{r oe-nb-eke, echo=TRUE} # for EKE the majority of the values are between 0 and # 0.05, so we'll put more bins there and make bins larger for larger values -- see hist(segs$EKE) obs_exp(dsm_nb_noxy_ms_refit, "EKE", c(0, 0.01, 0.025, 0.05, 0.1, 0.4, 0.7)) ``` ```{r oe-nb-npp, echo=TRUE} # we can get the default breaks we have from hist() npp_breaks <- hist(segs$NPP, plot=FALSE)$breaks obs_exp(dsm_nb_noxy_ms_refit, "NPP", npp_breaks) ``` Again, these seem pretty reasonable! Doing the same thing for the Tweedie model `dsm_tw_xy_ms`, again using the default histogram breaks for `x` and `y`: ```{r oe-tw, echo=TRUE} x_breaks <- hist(segs$x, plot=FALSE)$breaks obs_exp(dsm_tw_xy_ms, "x", x_breaks) y_breaks <- hist(segs$y, plot=FALSE)$breaks obs_exp(dsm_tw_xy_ms, "y", y_breaks) obs_exp(dsm_tw_xy_ms, "Depth", c(0, 1000, 2000, 3000, 4000, 6000)) # again need to be careful with EKE obs_exp(dsm_tw_xy_ms, "EKE", c(0, 0.01, 0.025, 0.05, 0.1, 0.4, 0.7)) ``` The `EKE` and `Depth` results look good, but the spatial terms look less good. That's just because we're taking marginals of a 2 dimensional distribution. We might do better looking at "areas of interest" or other covariates here. For example we can look at `DistToCAS` even though it's not in the model: ```{r tw_disttocas} # again need to inspect hist(segs$DistToCAS) first obs_exp(dsm_tw_xy_ms, "DistToCAS", c(-1e-10, 1000, 2000, 5000, 10000, 350000)) ``` This looks pretty good! --- ## Saving models Now save the models that you'd like to use to predict with later (that have good check results!): I recommend saving a few models so you can compare the results later. ```{r save-models-03} # add your models here save(dsm_nb_noxy_ms_refit, dsm_tw_xy_ms, file="dsms-checked.RData") ```