From the inbox: How can I get fold assignments from spatialsample?

In my inbox,¹ someone asks:²

I’m using spatial_clustering_cv() from spatialsample to do cross-validation. How can I get separate data frames with each split created by this function?

I think this question is decently common, because a lot of the spatialsample documentation is written assuming that you’re familiar with rsample already, which is often not the case for people working with spatial data. The functions to do this sort of thing live in rsample, and aren’t (currently) re-exported by spatialsample, so it can be hard to find the right function.

First and foremost, let’s assume that you’ve got some object called my_folds created by spatial_clustering_cv():

library(spatialsample)
my_folds <- spatial_clustering_cv(boston_canopy, v = 2)
my_folds

#  2-fold spatial cross-validation 
# A tibble: 2 × 2
  splits            id   
  <list>            <chr>
1 <split [277/405]> Fold1
2 <split [405/277]> Fold2

The “my_folds” object that gets created should have a “splits” column, which is a list. Each element of that list contains your analysis and assessment sets.³ To get a single split, use rsample::get_rsplit():

rsample::get_rsplit(my_folds, 1)

<Analysis/Assess/Total>
<277/405/682>

To get just the analysis data for that fold, use rsample::analysis():

rsample::get_rsplit(my_folds, 1) |>
  rsample::analysis()

Simple feature collection with 277 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 755424.9 ymin: 2935616 xmax: 812069.7 ymax: 2970073
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
# A tibble: 277 × 19
   grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
   <chr>       <dbl>       <dbl>       <dbl>            <dbl>            <dbl>
 1 AB-4      795045.      15323.       3126.           53676.           56802.
 2 AO-9      270153        6187.       1184.           26930.           28114.
 3 V-7       107890.        219.       3612.             240.            3852.
 4 X-4       848558.       8275.       1760.            6872.            8632.
 5 AC-4     2069814.      82201.      50944.          240161.          291104.
 6 AC-15    1175032.      24517.      24010.          111148.          135158.
 7 U-14     2690727.      69780.      51404.          263796.          315201.
 8 AQ-15     453368.      13971.       3401.          343677.          347077.
 9 Q-10      156688.       9237.       3094.           57327.           60421.
10 T-10      215340.      13984.       3947.           59539.           63487.
# ℹ 267 more rows
# ℹ 13 more variables: canopy_area_2019 <dbl>, change_canopy_area <dbl>,
#   change_canopy_percentage <dbl>, canopy_percentage_2014 <dbl>,
#   canopy_percentage_2019 <dbl>, change_canopy_absolute <dbl>,
#   mean_temp_morning <dbl>, mean_temp_evening <dbl>, mean_temp <dbl>,
#   mean_heat_index_morning <dbl>, mean_heat_index_evening <dbl>,
#   mean_heat_index <dbl>, geometry <MULTIPOLYGON [US_survey_foot]>

Similarly, to get just the assessment data for that fold, use rsample::assessment():

rsample::get_rsplit(my_folds, 1) |>
  rsample::assessment()

Simple feature collection with 405 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 739826.9 ymin: 2908294 xmax: 781347.5 ymax: 2959751
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
# A tibble: 405 × 19
   grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
   <chr>       <dbl>       <dbl>       <dbl>            <dbl>            <dbl>
 1 I-33      265813.       8849.      11795.           78677.           90472.
 2 H-10     2691490.      73098.      80362.          345823.          426185.
 3 Q-22     2648089.     122211.     154236.         1026632.         1180868.
 4 P-18     2690726.     110928.     113146.          915137.         1028283.
 5 J-29     2574479.      38069.      15530.         2388638.         2404168.
 6 G-28     2641525.      87024.      39246.         1202528.         1241774.
 7 M-23     2690727.      87621.     124032.          748742.          872774.
 8 M-9      2690727.      52443.      53467.          304239.          357706.
 9 S-15     2690728.      93787.     162118.          478257.          640375.
10 Q-21     2690727.      54712.     101816.         1359305.         1461121.
# ℹ 395 more rows
# ℹ 13 more variables: canopy_area_2019 <dbl>, change_canopy_area <dbl>,
#   change_canopy_percentage <dbl>, canopy_percentage_2014 <dbl>,
#   canopy_percentage_2019 <dbl>, change_canopy_absolute <dbl>,
#   mean_temp_morning <dbl>, mean_temp_evening <dbl>, mean_temp <dbl>,
#   mean_heat_index_morning <dbl>, mean_heat_index_evening <dbl>,
#   mean_heat_index <dbl>, geometry <MULTIPOLYGON [US_survey_foot]>

If you’re trying to get your original data, with a column indicating which fold each row belongs to, there’s not a provided function for that. Instead, what you can do is take the assessment set from each split (which is “what fold data is assigned to”), add a new column to it with the fold name, and then combine those assessment sets into a single data frame. I do this via the function:⁴

purrr::map2(
  my_folds$splits, 
  my_folds$id, 
  \(split, id) cbind(rsample::assessment(split), fold_name = id)
) |> 
  dplyr::bind_rows()

Simple feature collection with 682 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 739826.9 ymin: 2908294 xmax: 812069.7 ymax: 2970073
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
First 10 features:
   grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
1     I-33  265813.3    8848.818    11795.11         78676.56         90471.67
2     H-10 2691489.9   73098.168    80361.85        345823.19        426185.04
3     Q-22 2648088.6  122211.269   154236.43       1026631.85       1180868.27
4     P-18 2690726.1  110927.833   113145.85        915137.00       1028282.85
5     J-29 2574478.7   38068.676    15529.73       2388638.19       2404167.92
6     G-28 2641525.3   87024.318    39246.15       1202527.94       1241774.09
7     M-23 2690727.2   87620.730   124031.79        748742.13        872773.92
8      M-9 2690726.6   52443.164    53466.56        304239.49        357706.04
9     S-15 2690727.8   93786.589   162118.16        478257.33        640375.48
10    Q-21 2690727.2   54711.827   101815.82       1359305.11       1461120.93
   canopy_area_2019 change_canopy_area change_canopy_percentage
1          87525.38          -2946.293               -3.2565923
2         418921.35          -7263.685               -1.7043502
3        1148843.12         -32025.158               -2.7120009
4        1026064.83          -2218.014               -0.2157008
5        2426706.87          22538.944                0.9374946
6        1289552.26          47778.164                3.8475730
7         836362.86         -36411.060               -4.1718776
8         356682.65          -1023.393               -0.2860988
9         572043.92         -68331.566              -10.6705469
10       1414016.94         -47103.991               -3.2238256
   canopy_percentage_2014 canopy_percentage_2019 change_canopy_absolute
1                34.03579               32.92739            -1.10840701
2                15.83454               15.56466            -0.26987600
3                44.59323               43.38386            -1.20936883
4                38.21581               38.13338            -0.08243181
5                93.38465               94.26013             0.87547604
6                47.00974               48.81847             1.80873391
7                32.43636               31.08315            -1.35320518
8                13.29403               13.25600            -0.03803406
9                23.79934               21.25982            -2.53951988
10               54.30208               52.55148            -1.75060448
   mean_temp_morning mean_temp_evening mean_temp mean_heat_index_morning
1           74.26247          83.87540  90.85933                75.63458
2           74.64432          84.96917  91.71625                75.86767
3           73.19889          82.29358  89.70302                74.47757
4           73.77269          84.29003  91.26480                75.03802
5           72.26419          79.77278  88.70229                73.65608
6           73.60919          82.80297  90.33156                74.96955
7           74.24167          83.34713  90.41143                75.66013
8           76.74740          84.69933  91.96502                77.91048
9           75.18260          84.85431  92.00132                76.39949
10          73.37669          82.38064  90.59503                74.63029
   mean_heat_index_evening mean_heat_index fold_name
1                 89.71880        96.70939     Fold1
2                 89.88733        96.19667     Fold1
3                 87.34062        95.53811     Fold1
4                 88.93811        96.43569     Fold1
5                 81.32060        95.56059     Fold1
6                 88.47864        96.82653     Fold1
7                 89.23434        96.05418     Fold1
8                 90.02009        96.14348     Fold1
9                 89.91342        96.92160     Fold1
10                86.90021        96.23439     Fold1
                         geometry
1  MULTIPOLYGON (((752945.6 29...
2  MULTIPOLYGON (((751419.1 29...
3  MULTIPOLYGON (((763631.7 29...
4  MULTIPOLYGON (((763122.9 29...
5  MULTIPOLYGON (((753963.4 29...
6  MULTIPOLYGON (((749383.6 29...
7  MULTIPOLYGON (((758543.1 29...
8  MULTIPOLYGON (((758543.1 29...
9  MULTIPOLYGON (((767702.6 29...
10 MULTIPOLYGON (((764649.4 29...

I think it would make sense for get_rsplit(), analysis(), and assessment() to get ported over to spatialsample, to make it a bit easier for the folks whose first point-of-entry into tidymodels work is via spatialsample. I’ve got a GitHub issue to remind me to look into that before the package’s next release.

Footnotes

I want to mention that I include a link to Yihui Xie’s excellent blog post in replies to help questions sent via email. I love seeing people use my packages, and I love helping people use them, but I don’t always have the time to give 1:1 help via email. If you post a question somewhere publicly, then other people might give an even better answer; if no one answers in a day or two, then email me the link, so I can answer it publicly and have a link to send the next person with the same question as you. That’s also why I turned this into a blog post – so that I can send others with the same question a pre-written answer!↩︎
Anonymized and heavily paraphrased.↩︎
Sometimes called training and testing, respectively – rsample uses the analysis/assessment terminology to make it clear that all of this data should be in your training set, and doesn’t touch your final held-out test set.↩︎
This is how spatialsample’s autoplot() methods do it, for instance.↩︎