library(spatialsample)
<- spatial_clustering_cv(boston_canopy, v = 2)
my_folds my_folds
# 2-fold spatial cross-validation
# A tibble: 2 × 2
splits id
<list> <chr>
1 <split [277/405]> Fold1
2 <split [405/277]> Fold2
June 6, 2023
I’m using
spatial_clustering_cv()
from spatialsample to do cross-validation. How can I get separate data frames with each split created by this function?
I think this question is decently common, because a lot of the spatialsample documentation is written assuming that you’re familiar with rsample already, which is often not the case for people working with spatial data. The functions to do this sort of thing live in rsample, and aren’t (currently) re-exported by spatialsample, so it can be hard to find the right function.
First and foremost, let’s assume that you’ve got some object called my_folds
created by spatial_clustering_cv()
:
# 2-fold spatial cross-validation
# A tibble: 2 × 2
splits id
<list> <chr>
1 <split [277/405]> Fold1
2 <split [405/277]> Fold2
The “my_folds” object that gets created should have a “splits” column, which is a list. Each element of that list contains your analysis and assessment sets.3 To get a single split, use rsample::get_rsplit()
:
To get just the analysis data for that fold, use rsample::analysis()
:
Simple feature collection with 277 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 755424.9 ymin: 2935616 xmax: 812069.7 ymax: 2970073
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
# A tibble: 277 × 19
grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB-4 795045. 15323. 3126. 53676. 56802.
2 AO-9 270153 6187. 1184. 26930. 28114.
3 V-7 107890. 219. 3612. 240. 3852.
4 X-4 848558. 8275. 1760. 6872. 8632.
5 AC-4 2069814. 82201. 50944. 240161. 291104.
6 AC-15 1175032. 24517. 24010. 111148. 135158.
7 U-14 2690727. 69780. 51404. 263796. 315201.
8 AQ-15 453368. 13971. 3401. 343677. 347077.
9 Q-10 156688. 9237. 3094. 57327. 60421.
10 T-10 215340. 13984. 3947. 59539. 63487.
# ℹ 267 more rows
# ℹ 13 more variables: canopy_area_2019 <dbl>, change_canopy_area <dbl>,
# change_canopy_percentage <dbl>, canopy_percentage_2014 <dbl>,
# canopy_percentage_2019 <dbl>, change_canopy_absolute <dbl>,
# mean_temp_morning <dbl>, mean_temp_evening <dbl>, mean_temp <dbl>,
# mean_heat_index_morning <dbl>, mean_heat_index_evening <dbl>,
# mean_heat_index <dbl>, geometry <MULTIPOLYGON [US_survey_foot]>
Similarly, to get just the assessment data for that fold, use rsample::assessment()
:
Simple feature collection with 405 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 739826.9 ymin: 2908294 xmax: 781347.5 ymax: 2959751
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
# A tibble: 405 × 19
grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 I-33 265813. 8849. 11795. 78677. 90472.
2 H-10 2691490. 73098. 80362. 345823. 426185.
3 Q-22 2648089. 122211. 154236. 1026632. 1180868.
4 P-18 2690726. 110928. 113146. 915137. 1028283.
5 J-29 2574479. 38069. 15530. 2388638. 2404168.
6 G-28 2641525. 87024. 39246. 1202528. 1241774.
7 M-23 2690727. 87621. 124032. 748742. 872774.
8 M-9 2690727. 52443. 53467. 304239. 357706.
9 S-15 2690728. 93787. 162118. 478257. 640375.
10 Q-21 2690727. 54712. 101816. 1359305. 1461121.
# ℹ 395 more rows
# ℹ 13 more variables: canopy_area_2019 <dbl>, change_canopy_area <dbl>,
# change_canopy_percentage <dbl>, canopy_percentage_2014 <dbl>,
# canopy_percentage_2019 <dbl>, change_canopy_absolute <dbl>,
# mean_temp_morning <dbl>, mean_temp_evening <dbl>, mean_temp <dbl>,
# mean_heat_index_morning <dbl>, mean_heat_index_evening <dbl>,
# mean_heat_index <dbl>, geometry <MULTIPOLYGON [US_survey_foot]>
If you’re trying to get your original data, with a column indicating which fold each row belongs to, there’s not a provided function for that. Instead, what you can do is take the assessment set from each split (which is “what fold data is assigned to”), add a new column to it with the fold name, and then combine those assessment sets into a single data frame. I do this via the function:4
purrr::map2(
my_folds$splits,
my_folds$id,
\(split, id) cbind(rsample::assessment(split), fold_name = id)
) |>
dplyr::bind_rows()
Simple feature collection with 682 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 739826.9 ymin: 2908294 xmax: 812069.7 ymax: 2970073
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
First 10 features:
grid_id land_area canopy_gain canopy_loss canopy_no_change canopy_area_2014
1 I-33 265813.3 8848.818 11795.11 78676.56 90471.67
2 H-10 2691489.9 73098.168 80361.85 345823.19 426185.04
3 Q-22 2648088.6 122211.269 154236.43 1026631.85 1180868.27
4 P-18 2690726.1 110927.833 113145.85 915137.00 1028282.85
5 J-29 2574478.7 38068.676 15529.73 2388638.19 2404167.92
6 G-28 2641525.3 87024.318 39246.15 1202527.94 1241774.09
7 M-23 2690727.2 87620.730 124031.79 748742.13 872773.92
8 M-9 2690726.6 52443.164 53466.56 304239.49 357706.04
9 S-15 2690727.8 93786.589 162118.16 478257.33 640375.48
10 Q-21 2690727.2 54711.827 101815.82 1359305.11 1461120.93
canopy_area_2019 change_canopy_area change_canopy_percentage
1 87525.38 -2946.293 -3.2565923
2 418921.35 -7263.685 -1.7043502
3 1148843.12 -32025.158 -2.7120009
4 1026064.83 -2218.014 -0.2157008
5 2426706.87 22538.944 0.9374946
6 1289552.26 47778.164 3.8475730
7 836362.86 -36411.060 -4.1718776
8 356682.65 -1023.393 -0.2860988
9 572043.92 -68331.566 -10.6705469
10 1414016.94 -47103.991 -3.2238256
canopy_percentage_2014 canopy_percentage_2019 change_canopy_absolute
1 34.03579 32.92739 -1.10840701
2 15.83454 15.56466 -0.26987600
3 44.59323 43.38386 -1.20936883
4 38.21581 38.13338 -0.08243181
5 93.38465 94.26013 0.87547604
6 47.00974 48.81847 1.80873391
7 32.43636 31.08315 -1.35320518
8 13.29403 13.25600 -0.03803406
9 23.79934 21.25982 -2.53951988
10 54.30208 52.55148 -1.75060448
mean_temp_morning mean_temp_evening mean_temp mean_heat_index_morning
1 74.26247 83.87540 90.85933 75.63458
2 74.64432 84.96917 91.71625 75.86767
3 73.19889 82.29358 89.70302 74.47757
4 73.77269 84.29003 91.26480 75.03802
5 72.26419 79.77278 88.70229 73.65608
6 73.60919 82.80297 90.33156 74.96955
7 74.24167 83.34713 90.41143 75.66013
8 76.74740 84.69933 91.96502 77.91048
9 75.18260 84.85431 92.00132 76.39949
10 73.37669 82.38064 90.59503 74.63029
mean_heat_index_evening mean_heat_index fold_name
1 89.71880 96.70939 Fold1
2 89.88733 96.19667 Fold1
3 87.34062 95.53811 Fold1
4 88.93811 96.43569 Fold1
5 81.32060 95.56059 Fold1
6 88.47864 96.82653 Fold1
7 89.23434 96.05418 Fold1
8 90.02009 96.14348 Fold1
9 89.91342 96.92160 Fold1
10 86.90021 96.23439 Fold1
geometry
1 MULTIPOLYGON (((752945.6 29...
2 MULTIPOLYGON (((751419.1 29...
3 MULTIPOLYGON (((763631.7 29...
4 MULTIPOLYGON (((763122.9 29...
5 MULTIPOLYGON (((753963.4 29...
6 MULTIPOLYGON (((749383.6 29...
7 MULTIPOLYGON (((758543.1 29...
8 MULTIPOLYGON (((758543.1 29...
9 MULTIPOLYGON (((767702.6 29...
10 MULTIPOLYGON (((764649.4 29...
I think it would make sense for get_rsplit()
, analysis()
, and assessment()
to get ported over to spatialsample, to make it a bit easier for the folks whose first point-of-entry into tidymodels work is via spatialsample. I’ve got a GitHub issue to remind me to look into that before the package’s next release.
I want to mention that I include a link to Yihui Xie’s excellent blog post in replies to help questions sent via email. I love seeing people use my packages, and I love helping people use them, but I don’t always have the time to give 1:1 help via email. If you post a question somewhere publicly, then other people might give an even better answer; if no one answers in a day or two, then email me the link, so I can answer it publicly and have a link to send the next person with the same question as you. That’s also why I turned this into a blog post – so that I can send others with the same question a pre-written answer!↩︎
Anonymized and heavily paraphrased.↩︎
Sometimes called training and testing, respectively – rsample uses the analysis/assessment terminology to make it clear that all of this data should be in your training set, and doesn’t touch your final held-out test set.↩︎
This is how spatialsample’s autoplot()
methods do it, for instance.↩︎