Multivariate smoothing, model selection

# Multivariate smoothing<br/>&<br/>model selection
<div style="position: absolute; bottom: 15px; vertical-align: center; left: 10px">
<img src="images/02-foundation-vertical-white.png" height="200">
</div>

---

# Recap

- How GAMs work
- How to include detection info
- Simple spatial-only models
- How to check those models

---
class: inverse, middle, center

# Univariate models are fun, but...

---

# Ecology is not univariate

- Many variables affect distribution
- Want to model the **right** ones
- Select between possible models
  - Smooth term selection
  - Response distribution
- Large literature on model selection

---
class: inverse, middle, center

# Models with multiple smooths

---

# Adding smooths

- Already know that `+` is our friend
- Can build a big model...

```r
dsm_all <- dsm(count~s(x, y) +
                     s(Depth) +
                     s(DistToCAS) +
                     s(SST) +
                     s(EKE) +
                     s(NPP),
                  ddf.obj=df_hr,
                  segment.data=segs, observation.data=obs,
                  family=tw())
```

---
class: inverse, middle, center

# Now we have a huge model, what do we do?

---

# Term selection

**Stepwise selection** - path dependence

**All possible subsets** - computationally expensive (fishing?)
]
.pull-right[
<img src="images/gnome.jpg">
]

---

# p-values

- Test for *zero effect* of a smooth
- They are **approximate** for GAMs (but useful)
- Reported in `summary`

---

# p-values example

```
## 
## Family: Tweedie(p=1.25) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y) + s(Depth) + s(DistToCAS) + s(SST) + s(EKE) + 
##     s(NPP) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.6369     0.2752     -75   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(x,y)       5.236  7.169 1.233  0.2928    
## s(Depth)     3.568  4.439 6.640 1.6e-05 ***
## s(DistToCAS) 1.000  1.000 1.503  0.2205    
## s(SST)       5.927  6.987 2.067  0.0407 *  
## s(EKE)       1.763  2.225 2.577  0.0696 .  
## s(NPP)       2.393  3.068 0.855  0.4680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.224   Deviance explained = 44.9%
## -REML = 368.97  Scale est. = 3.9334    n = 949
```

---

# Path dependence is an issue here

---

# Term selection during fitting

.pull-left[
- Penalty already removes complexity from model
- What about using it to remove the whole term?
]
.pull-right[
![animation of penalty in action](images/shrinky_dink.gif)
]

---

# Shrinkage approach

.pull-left[
- Basis `s(..., bs="ts")` - thin plate splines *with shrinkage*
- remove the wiggles **then** remove the "linear" bits
- nullspace should be shrunk less than the wiggly part
]
.pull-right[
![animation of penalty in action](images/shrinky_dink_ts.gif)
]

---

# Shrinkage example

```r
dsm_ts_all <- dsm(count~s(x, y, bs="ts") +
                        s(Depth, bs="ts") +
                        s(DistToCAS, bs="ts") +
                        s(SST, bs="ts") +
                        s(EKE, bs="ts") +
                        s(NPP, bs="ts"),
                  ddf.obj=df_hr,
                  segment.data=segs, observation.data=obs,
                  family=tw())
```

---

# Model with no shrinkage

![](dsm4-multiple-smooths_files/figure-html/smooth-no-shrinkage-1.png)
---

# ... with shrinkage

![](dsm4-multiple-smooths_files/figure-html/smooth-shrinkage-1.png)

---

# Shrinkage in action

```r
summary(dsm_ts_all)
```

```
## 
## Family: Tweedie(p=1.277) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y, bs = "ts") + s(Depth, bs = "ts") + s(DistToCAS, 
##     bs = "ts") + s(SST, bs = "ts") + s(EKE, bs = "ts") + s(NPP, 
##     bs = "ts") + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -20.260      0.234  -86.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df     F  p-value    
## s(x,y)       1.8875209     29 0.705 3.56e-06 ***
## s(Depth)     3.6794182      9 4.811 2.15e-10 ***
## s(DistToCAS) 0.0000934      9 0.000   0.6797    
## s(SST)       0.3826654      9 0.063   0.2160    
## s(EKE)       0.8196256      9 0.499   0.0178 *  
## s(NPP)       0.0003570      9 0.000   0.8359    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.11   Deviance explained = 35.1%
## -REML = 385.04  Scale est. = 4.5486    n = 949
```

---

# Alternative: extra penalty

.pull-left[
- `dsm(..., select=TRUE)` - extra penalty
- no assumption of how much to shrink the nullspace
- "no assumption" might not be such a good idea?
]
.pull-right[
![animation of penalty in action](images/shrinky_dink_select.gif)
]

---

# Extra penalty example

```r
dsm_sel <- dsm(count~s(x, y) +
                     s(Depth) +
                     s(DistToCAS) +
                     s(SST) +
                     s(EKE) +
                     s(NPP),
                  ddf.obj=df_hr,
                  segment.data=segs, observation.data=obs,
                  family=tw(), select=TRUE)
```

---

# Extra penalty example

```r
summary(dsm_sel)
```

```
## 
## Family: Tweedie(p=1.266) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y) + s(Depth) + s(DistToCAS) + s(SST) + s(EKE) + 
##     s(NPP) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.4285     0.2454  -83.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df     F  p-value    
## s(x,y)       7.694e+00     29 1.272 2.67e-07 ***
## s(Depth)     3.645e+00      9 4.005 3.24e-10 ***
## s(DistToCAS) 1.944e-05      9 0.000   0.7038    
## s(SST)       2.010e-04      9 0.000   0.8216    
## s(EKE)       1.417e+00      9 0.630   0.0127 *  
## s(NPP)       2.318e-04      9 0.000   0.5152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.16   Deviance explained = 40.9%
## -REML =    378  Scale est. = 4.2252    n = 949
```

---

# Extra penalty example

![](dsm4-multiple-smooths_files/figure-html/smooth-sel-1.png)

---

# EDF comparison

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> allterms </th>
   <th style="text-align:right;"> select </th>
   <th style="text-align:right;"> ts </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> s(x,y) </td>
   <td style="text-align:right;"> 5.2361 </td>
   <td style="text-align:right;"> 7.6936 </td>
   <td style="text-align:right;"> 1.8875 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s(Depth) </td>
   <td style="text-align:right;"> 3.5677 </td>
   <td style="text-align:right;"> 3.6449 </td>
   <td style="text-align:right;"> 3.6794 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s(DistToCAS) </td>
   <td style="text-align:right;"> 1.0001 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> 0.0001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s(SST) </td>
   <td style="text-align:right;"> 5.9270 </td>
   <td style="text-align:right;"> 0.0002 </td>
   <td style="text-align:right;"> 0.3827 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s(EKE) </td>
   <td style="text-align:right;"> 1.7628 </td>
   <td style="text-align:right;"> 1.4174 </td>
   <td style="text-align:right;"> 0.8196 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> s(NPP) </td>
   <td style="text-align:right;"> 2.3929 </td>
   <td style="text-align:right;"> 0.0002 </td>
   <td style="text-align:right;"> 0.0004 </td>
  </tr>
</tbody>
</table>

---

# Double penalty can be slow

- Lots of smoothing parameters to estimate

```r
length(dsm_ts_all$sp)
```

```
## [1] 6
```

```r
length(dsm_sel$sp)
```

```
## [1] 12
```

---

# How do we select smooth terms?

1. Look at EDF
  - Terms with EDF<1 may not be useful (can we remove?)
2. Remove non-significant terms by `$p$`-value
  - Decide on a significance level and use that as a rule

(In some sense leaving "shrunk" terms in is more "consistent" in terms of variance estimation, but can be computationally annoying)

---

# Comparing models

---

# Comparing models

- Usually have >1 option
- How can we pick?
- Even if we have 1 model, is it any good?

---

# Nested vs. non-nested models

- Compare `~s(x)+s(depth)` with `~s(x)`
  - nested models
- What about `s(x) + s(y)` vs. `s(x, y)`
  - don't want to have all these in the model
  - not nested models

---

# Measures of "fit"

- Two listed in `summary`
  - Deviance explained
  - Adjusted `$R^2$` (not useful)
- Deviance is a generalisation of `$R^2$`
- Highest likelihood value (*saturated* model) minus estimated model value
- (These are usually not very high for DSMs)

---

# AIC

- Can get AIC from our model
- Comparison of AIC fine (but not the end of the story)

```r
AIC(dsm_all)
```

```
## [1] 1238.307
```

```r
AIC(dsm_ts_all)
```

```
## [1] 1225.822
```

---

# A quick note about REML scores

- Use REML to select the smoothness
- Can also use the score to do model selection
- **BUT** only compare models with the same fixed effects
  - (i.e., same "linear terms" in the model)
- `$\Rightarrow$` **All terms** must be penalised
  - `bs="ts"` or `select=TRUE`

---

# Selecting between response distributions

---

# Goodness of fit tests

- Q-Q plots
- Closer to the line == better
![](dsm4-multiple-smooths_files/figure-html/gof-qq-1.png)

---

# Tobler's first law of geography

## *"Everything is related to everything else, but near things are more related than distant things"*

### Tobler (1970)

---

# Implications of Tobler's law

![](dsm4-multiple-smooths_files/figure-html/pairrrrs-1.png)

---

## Covariates are not only correlated (linearly)...<br/><br/>...they are also "concurve"

### *"How much can one smooth be approximated by one or more other smooths?"*

---

# Concurvity (model/smooth)

```r
concurvity(dsm_all)
```

```
##                  para    s(x,y)  s(Depth) s(DistToCAS)    s(SST)    s(EKE)
## worst    2.539199e-23 0.9963493 0.9836597    0.9959057 0.9772853 0.7702479
## observed 2.539199e-23 0.9213461 0.8275679    0.9883162 0.6951997 0.6615697
## estimate 2.539199e-23 0.7580838 0.9272203    0.9642030 0.8978412 0.4906765
##             s(NPP)
## worst    0.9727752
## observed 0.8258504
## estimate 0.8694619
```

---

# Concurvity between smooths

```r
concurvity(dsm_all, full=FALSE)$estimate
```

```
##                      para       s(x,y)     s(Depth) s(DistToCAS)
## para         1.000000e+00 4.700364e-26 4.640330e-28 6.317431e-27
## s(x,y)       8.687343e-24 1.000000e+00 9.067347e-01 9.568609e-01
## s(Depth)     1.960563e-25 2.247389e-01 1.000000e+00 2.699392e-01
## s(DistToCAS) 2.964353e-24 4.335154e-01 2.568123e-01 1.000000e+00
## s(SST)       3.614289e-25 5.102860e-01 3.707617e-01 5.107111e-01
## s(EKE)       1.283557e-24 1.220299e-01 1.527425e-01 1.205373e-01
## s(NPP)       2.034284e-25 4.407590e-01 2.067464e-01 2.701934e-01
##                    s(SST)       s(EKE)       s(NPP)
## para         5.042066e-28 3.615073e-27 6.078290e-28
## s(x,y)       7.205518e-01 3.201531e-01 6.821674e-01
## s(Depth)     1.232244e-01 6.422005e-02 1.990567e-01
## s(DistToCAS) 2.554027e-01 1.319306e-01 2.590227e-01
## s(SST)       1.000000e+00 1.735256e-01 7.616800e-01
## s(EKE)       2.410615e-01 1.000000e+00 2.787592e-01
## s(NPP)       7.833972e-01 1.033109e-01 1.000000e+00
```

---

# Visualising concurvity between terms

---
class: inverse, middle, center

# Path dependence

---

# Sensitivity

- General path dependency?
- What if there are highly concurve smooths?
- Is the model is sensitive to them?

---

# What can we do?

- Fit variations excluding smooths
  - Concurve terms that are excluded early on
- Appendix of Winiarski et al (2014) has an example

---

# Sensitivity example

- `s(Depth)` and `s(x, y)` are highly concurve (0.9067)
- Refit removing `Depth` first

```
## # with depth
```

```
##                       edf Ref.df            F      p-value
## s(x,y)       1.887521e+00     29 7.053433e-01 3.564239e-06
## s(Depth)     3.679418e+00      9 4.811277e+00 2.149915e-10
## s(DistToCAS) 9.339573e-05      9 1.709199e-06 6.796698e-01
## s(SST)       3.826654e-01      9 6.305406e-02 2.159885e-01
## s(EKE)       8.196256e-01      9 4.989110e-01 1.775703e-02
## s(NPP)       3.570476e-04      9 1.615127e-06 8.359159e-01
```

```
## # without depth
```

```
##                       edf Ref.df            F      p-value
## s(x,y)       1.377758e+01     29 2.588645e+00 1.164871e-12
## s(DistToCAS) 1.869964e-04      9 2.575100e-06 6.877802e-01
## s(SST)       3.679648e-04      9 6.186926e-06 6.658640e-01
## s(EKE)       8.448440e-01      9 5.669750e-01 1.050438e-02
## s(NPP)       7.993690e-01      9 3.627747e-01 3.229658e-02
```

---

# Comparison of spatial effects

![](dsm4-multiple-smooths_files/figure-html/sensitivity-vis-1.png)

---

# Sensitivity example

- Refit removing `x` and `y`...

```
## # without xy
```

```
##                       edf Ref.df            F      p-value
## s(DistToCAS) 0.0002587354      9 1.804557e-06 8.019754e-01
## s(SST)       3.7878795045      9 1.401853e+00 2.951200e-03
## s(Depth)     3.6560507550      9 4.260082e+00 3.020240e-09
## s(EKE)       0.8089036590      9 4.616991e-01 2.181356e-02
## s(NPP)       2.4578806949      9 1.371706e+00 1.115487e-03
```

```
## # with xy
```

---

# Comparison of depth smooths

![](dsm4-multiple-smooths_files/figure-html/sensitivity-depth-1.png)

---

# Comparing those three models...

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:right;"> AIC </th>
   <th style="text-align:right;"> Deviance </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> `all` </td>
   <td style="text-align:right;"> 1225.822 </td>
   <td style="text-align:right;"> 35.10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> `s(x,y)+s(EKE)+s(NPP)` </td>
   <td style="text-align:right;"> 1248.172 </td>
   <td style="text-align:right;"> 34.54 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> `s(SST)+s(Depth)` </td>
   <td style="text-align:right;"> 1228.106 </td>
   <td style="text-align:right;"> 39.28 </td>
  </tr>
</tbody>
</table>

- "Full" model still explains most deviance
- No depth model requires spatial smooth to "mop up" extra variation
- We'll come back to this when we do prediction

---
class: inverse, middle, center

# Recap

---

# Recap

- Adding smooths
- Removing smooths
  - `$p$`-values
  - shrinkage/extra penalties
- Comparing models
- Comparing response distributions
- Sensitivity