I am helping out with some plots for a paper: simple X versus Y plots, and would like to provide some stats to improve the paper. My statistics knowledge is rusty and I am an R-noob with only a few weeks of Tidyverse under my belt, so I would appreciate some input. Am I on the right track?

The "X"-data are averages of about 50 time series having clear trends of various shapes, mostly convex and/or increasing; i.e. they are autocorrelated. Each time series has in median 45 values. The bootstrap sample histograms of the averages are mostly normal distributed or very close, but not all. Each average is plotted against one Y-value (I.e. only 50 Y-values available). Another complication is that for a handful of the time series, 2-5 averages are computed from the same time series (using different subsets of the values) making these Y-values internally correlated as well. I.e. Only 35-40 time series exist. (It is yearly data with Y-vaues for one or more years at the end)

Using all the data in the bootstrap simulations (stratified BCa-bootstrap with function 'boot' and 'boot.ci' in R, first computing series averages then correlation) gives a suspiciously good confidence interval: correlation 0.78 (0.77, 0.79 ) 95%. See code below!

But sampling from only the final 50 X, Y -pairs gives too poor a result in my mind. 0.78 (0.60, 0.90) 95%. (Code not given)

My questions:

The "X"-data are averages of about 50 time series having clear trends of various shapes, mostly convex and/or increasing; i.e. they are autocorrelated. Each time series has in median 45 values. The bootstrap sample histograms of the averages are mostly normal distributed or very close, but not all. Each average is plotted against one Y-value (I.e. only 50 Y-values available). Another complication is that for a handful of the time series, 2-5 averages are computed from the same time series (using different subsets of the values) making these Y-values internally correlated as well. I.e. Only 35-40 time series exist. (It is yearly data with Y-vaues for one or more years at the end)

Using all the data in the bootstrap simulations (stratified BCa-bootstrap with function 'boot' and 'boot.ci' in R, first computing series averages then correlation) gives a suspiciously good confidence interval: correlation 0.78 (0.77, 0.79 ) 95%. See code below!

But sampling from only the final 50 X, Y -pairs gives too poor a result in my mind. 0.78 (0.60, 0.90) 95%. (Code not given)

Code:

```
testst <-function(tb,i) {
# Generate correlation between yStrat and xStrat.
# arg: tibble with about 2500 rows
tbi <- tb[i, ] # Bootstrap sample
yTbi <- tbi %>% group_by(StratumID) %>%
summarise(yStrat=last(YPrevalence)) # All Y identical per stratum
# Could have used unique as well
# Apply mean to each stratum of bootstrapped draws from stratum
xTbi <- tbi %>% group_by(StratumID) %>%
summarise(xStrat = mean(H, na.rm = TRUE))
# Apply cor to fixed y values against each stratum mean
pears <- cor(x = xTbi$xStrat,
y = yTbi$yStrat,
method = "pearson")
ctest <- cor.test(x = xTbi$xStrat,
y = yTbi$yStrat,
alternative="greater",
method="p" # "p" or "s"
)
return(c(pears, ctest$p.value))
}
# Called on data tibble tbd with ~2000 rows, ~50 strata
tbd$StratumID <- as_factor(tbd$StratumID)
bt <- boot(tbd,testst,R=10000,strata=tbd$StratumID)
bc<-boot.ci(bt,type="bca",index=i) # i=1 or 2
```

- Should I trust the bootstrap result on the complete 2000 + 50 data? I.e. does the bootstrap automatically account for autocorrelation and the dearth of Y-values when computing the confidence interval? If not: how do I compute a more valid confidence interval?
- How do I compute the correct P-value and degrees of freedom using bootstrap (Zero hypothesis: correlation = 0). With 'cor.test' in each bootstrap sample but that way yields only about 50 degrees of freedom.

Last edited: