Model checking

“perhaps the most important part of applied statistical modelling”

Simon Wood

Model checking

Checking \( \neq \) validation!
As with detection function, checking is important
Want to know the model conforms to assumptions
What assumptions should we check?

What to check

Convergence
Basis size
Residuals

Convergence

Fitting the GAM involves an optimization
By default this is REstricted Maximum Likelihood (REML) score
Sometimes this can go wrong
R will warn you!

A model that converges

gam.check(dsm_tw_xy_depth)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-3.468176e-05,1.090937e-05]
(score 374.7249 & scale 4.172176).
Hessian positive definite, eigenvalue range [1.179219,301.267].
Model rank =  39 / 39 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

            k'   edf k-index p-value    
s(x,y)   29.00 11.11    0.65  <2e-16 ***
s(Depth)  9.00  3.84    0.81    0.33    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A bad model

Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In sqrt(w) : NaNs produced
Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
  missing value where TRUE/FALSE needed

This is rare

The Folk Theorem of Statistical Computing

“most statistical computational problems are due not to the algorithm being used but rather the model itself”

Andrew Gelman

Basis size

Basis size (k)

Set k per term
e.g. s(x, k=10) or s(x, y, k=100)
Penalty removes “extra” wigglyness
- up to a point!
(But computation is slower with bigger k)

Checking basis size

gam.check(dsm_x_tw)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-3.08755e-06,4.928064e-07]
(score 409.936 & scale 6.041307).
Hessian positive definite, eigenvalue range [0.7645492,302.127].
Model rank =  10 / 10 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

       k'  edf k-index p-value
s(x) 9.00 4.96    0.76    0.44

Increasing basis size

dsm_x_tw_k <- dsm(count~s(x, k=20), ddf.obj=df,
                  segment.data=segs, observation.data=obs,
                  family=tw())
gam.check(dsm_x_tw_k)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-2.301238e-08,3.930667e-09]
(score 409.9245 & scale 6.033913).
Hessian positive definite, eigenvalue range [0.7678456,302.0336].
Model rank =  20 / 20 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

        k'   edf k-index p-value
s(x) 19.00  5.25    0.76    0.39

Sometimes basis size isn't the issue...

Generally, double k and see what happens
Didn't increase the EDF much here
Other things can cause low “p-value” and “k-index”
Increasing k can cause problems (nullspace)

k is a maximum

(Usually) Don't need to worry about things being too wiggly
k gives the maximum complexity
Penalty deals with the rest

plot of chunk plotk

Residuals

What are residuals?

Generally residuals = observed value - fitted value
BUT hard to see patterns in these “raw” residuals
Need to standardise \( \Rightarrow \) deviance residuals
Residual sum of squares \( \Rightarrow \) linear model
- deviance \( \Rightarrow \) GAM
Expect these residuals \( \sim N(0,1) \)

Residual checking

plot of chunk gamcheck

Shortcomings

gam.check can be helpful
“Resids vs. linear pred” is victim of artifacts
Need an alternative
“Randomised quanitle residuals” (experimental)
- rqgam.check
- Exactly normal residuals

Randomised quantile residuals

plot of chunk rqgamcheck

Residuals vs. covariates

plot of chunk covar-resids

Residuals vs. covariates (boxplots)

plot of chunk covar-resids-boxplot

Example of "bad" plots

Bad residual check plot from Wood 2006

Example of "bad" plots

Bad residual check plot from Wood 2006

Residual checks

Looking for patterns (not artifacts)
This can be tricky
Need to use a mixture of techniques
Cycle through checks, make changes recheck
Each dataset is different

Summary

Convergence
- Rarely an issue
- Check your thinking about the model
Basis size
- k is a maximum
- Double and see what happens
Residuals
- Deviance and randomised quantile
- check for artifacts
gam.check is your friend