6  Predictor with Maximum Effect Size

ImportantDisclaimer

These packages (Note 1) are a one-person project undergoing rapid evolution. Backward compatibility (per Hadley Wickham) is provided as a courtesy rather than a guarantee.

Until further notice, these packages should

  • not be used as a basis for research grant applications,
  • not be cited as an actively maintained tool in a peer-reviewed manuscript,
  • not be used to support or fulfill requirements for pursuing an academic degree.

In addition, work primarily based on these packages (Note 1) should not be presented at academic conferences or similar scholarly venues.

Furthermore, a person’s ability to use these packages (Note 1) does not necessarily imply an understanding of their underlying mechanisms. Accordingly, demonstration of their use alone should not be considered sufficient evidence of expertise, nor should it be credited as a basis for academic promotion or advancement.

These statements do not apply to the contributors (Tip 1) to these packages (Note 1) with respect to their specific contributions.

These statements do not apply when the maintainer of these packages (Note 1), Tingting Zhan, is credited as the first author, the lead author, and/or the corresponding author in a peer-reviewed manuscript, or as the Principal Investigator or Co-Principal Investigator in a research grant application and/or a final research progress report.

These statements are advisory in nature and do not modify or restrict the rights granted under the GNU General Public License https://www.r-project.org/Licenses/.

Note

The examples in Chapter 6 require

library(maxEff)
library(survival)

🚧 Chapter 6 is being re-written English-wise. All code are correct here, though. Expected delivery by 2025-12-31.

Chapter 6 documents methods to select one from many numeric predictors for a regression model, to ensure that the additional predictor has the maximum effect size.

6.1 Compute Aggregated Quantiles

Section 5.1

Ki67q = groupedHyperframe::Ki67 |>
  within.data.frame(expr = {
    x = y = NULL # remove x- and y-coords for non-spatial application
  }) |>
  aggregate2hyper(by = ~ patientID/tissueID) |>
  within(expr = {
    logKi67.q = quantile(logKi67, probs = seq.int(from = .01, to = .99, by = .01))
  }) |>
  aggregate() |>
  within(expr = {
    logKi67.q = logKi67.q |> do.call(what = pmean)
  })
# Hypercolumn(s) logKi67 created!
# Variable(s) tissueID removed from aggregation
A hyper data frame Ki67q with aggregated quantiles
Ki67q |>
  head()
# Hyperframe:
#   Tstage  PFS adj_rad adj_chemo histology  Her2   HR  node  race age patientID   logKi67 logKi67.q
# 1      2 100+   FALSE     FALSE         3  TRUE TRUE  TRUE White  66   PT00037 (anylist) (numeric)
# 2      1   22   FALSE     FALSE         3 FALSE TRUE FALSE Black  42   PT00039 (anylist) (numeric)
# 3      1  99+   FALSE        NA         3 FALSE TRUE FALSE White  60   PT00040 (anylist) (numeric)
# 4      1  99+   FALSE      TRUE         3 FALSE TRUE  TRUE White  53   PT00042 (anylist) (numeric)
# 5      1  112    TRUE      TRUE         3 FALSE TRUE  TRUE White  52   PT00054 (anylist) (numeric)
# 6      4   12    TRUE     FALSE         2  TRUE TRUE  TRUE Black  51   PT00059 (anylist) (numeric)

6.2 Data Partition

Partition into a training (80%) and test (20%) set.

set.seed(32); id = sample.int(n = nrow(Ki67q), size = nrow(Ki67q)*.8)
s0 = Ki67q[id, , drop = FALSE] # training set
s1 = Ki67q[-id, , drop = FALSE] # test set
dim(s0)
# [1] 497  13
dim(s1)
# [1] 125  13

6.3 Starting Model

Let’s consider a starting model of endpoint PFS with predictor Tstage on the training set s0. As of package spatstat.geom v3.7.3, GPL (>= 2), the S3 method as.data.frame.hyperframe() (v3.7.3, GPL (>= 2)) warns on hypercolumns that aren’t able to be converted to a data frame.

m0 = coxph(PFS ~ Tstage, data = s0)
Starting coxph model m0
summary(m0)
# Call:
# coxph(formula = PFS ~ Tstage, data = s0)
# 
#   n= 497, number of events= 91 
# 
#          coef exp(coef) se(coef)    z Pr(>|z|)    
# Tstage 0.6725    1.9591   0.1061 6.34 2.29e-10 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
#        exp(coef) exp(-coef) lower .95 upper .95
# Tstage     1.959     0.5104     1.591     2.412
# 
# Concordance= 0.654  (se = 0.028 )
# Likelihood ratio test= 32.23  on 1 df,   p=1e-08
# Wald test            = 40.2  on 1 df,   p=2e-10
# Score (logrank) test = 43.73  on 1 df,   p=4e-11

6.4 Adding numeric Predictor

a0 = m0 |>
  add_numeric(x = ~ logKi67.q) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of three steps.

First, function add_numeric()

  • considers each “slice” of the numeric-hypercolumn logKi67.q as an additional numeric predictor. Users are encourage to read the package groupedHyperframe vignette, Appendix Section anylist for more details;
  • updates the starting model m0 with each one of the additional numeric predictors, respectively;
  • returns an 'add_numeric' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_numeric_' (Chapter 14).

Second, the S3 method sort_by.add_() sorts the input by the absolute values of the regression coefficients, i.e., effect size effsize, of the additional numeric predictors.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

The S3 method print.add_numeric() displays the calls to the selected numeric predictors.

a0
# logKi67.q["6%"]
# logKi67.q["5%"]

The S3 method predict.add_numeric() applies the starting model structure in m0, as well as the additional numeric predictors selected by a0, on the test set s1. The author categorizes this functionality under the S3 generic predict(), instead of update(), to emphasize that this “update” is on a newdata, even that the workhorse function is the S3 method update.default(). The S3 method predict.add_numeric() applies the starting model structure in m0, as well as the additional numeric predictors selected by a0, on the test set s1. The author categorizes this functionality under the S3 generic predict(), instead of update(), to emphasize that this “update” is on a newdata, even that the workhorse function is the S3 method update.default().

a1 = a0 |> 
  predict(newdata = s1)
Predicted models a1
a1
# logKi67.q["6%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z      p
# Tstage 0.5900    1.8041   0.1959 3.012 0.0026
# x.     0.6071    1.8352   0.8687 0.699 0.4846
# 
# Likelihood ratio test=9.23  on 2 df, p=0.009926
# n= 125, number of events= 27 
# 
# logKi67.q["5%"] :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#          coef exp(coef) se(coef)     z       p
# Tstage 0.5886    1.8015   0.1961 3.002 0.00269
# x.     0.6276    1.8732   0.8746 0.718 0.47299
# 
# Likelihood ratio test=9.25  on 2 df, p=0.009794
# n= 125, number of events= 27

6.5 Adding logical Predictor

6.5.1 Naive Practice

b0 = m0 |>
  add_dummy(x = ~ logKi67.q) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy()

  • dichotomizes (Chapter 32) each “slice” of the numeric-hypercolumn logKi67.q into a logical variable;
  • removes the duplicated logical variables;
  • updates the starting model m0 with each one of the additional logical predictors, respectively;
  • returns an 'add_dummy' object, which inherits from the class 'add_'. This is a listof objects with internal class 'add_dummy_' (Chapter 13).

Second, the S3 method subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor.

Third, the S3 method sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

The S3 method print.add_dummy() displays the calls to the selected logical predictors.

b0
# logKi67.q["95%"]>=6.76825904668107
# logKi67.q["94%"]>=6.72093005757954

The S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by b0, on the test set s1.

b1 = b0 |> 
  predict(newdata = s1)
Predicted models b1
b1
# logKi67.q["95%"]>=6.76825904668107 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6287    1.8753   0.1922  3.271 0.00107
# x.TRUE -0.2630    0.7687   0.4693 -0.560 0.57515
# 
# Likelihood ratio test=9.05  on 2 df, p=0.01086
# n= 125, number of events= 27 
# 
# logKi67.q["94%"]>=6.72093005757954 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#            coef exp(coef) se(coef)      z       p
# Tstage  0.61820   1.85558  0.19162  3.226 0.00125
# x.TRUE -0.09157   0.91249  0.46920 -0.195 0.84526
# 
# Likelihood ratio test=8.78  on 2 df, p=0.01238
# n= 125, number of events= 27

6.5.2 via Repeated Partitions

set.seed(83); c0 = m0 |> 
  add_dummy_partition(~ logKi67.q, times = 20L, p = .8) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)

The pipeline above consists of four steps.

First, function add_dummy_partition() creates a dichotomizing rule for each additional numeric predictor in the following steps.

  1. Generates twenty (20L) partitions of the training set s0 using the functions caret::createDataPartition() or statusPartition() (Chapter 53) at .8 versus .2=(1-.8) ratio.
  2. For the \(i\)-th partition \((i=1,\cdots,20)\) of the training set s0,
    • creates a dichotomizing rule of the numeric predictor in the \(i\)-th training-subset of s0 using the function node1() (Chapter 32);
    • applies such dichotomizing rule to the numeric predictor in the \(i\)-th test-subset of s0;
    • updates the starting model m0 with the \(i\)-th test-subset of s0, as well as the logical predictor from the previous step;
    • records the estimated regression coefficients, i.e., effect size effsize, of the logical predictor.
  3. Selects the dichotomizing rule based on the partition with the median effect size of the logical predictor in their own, specific test-subset of s0.
  4. Returns an 'add_dummy' object.

Second, the S3 method subset.add_dummy() subsets the input by the balance of the partition of the additional logical predictor in their own, specific test-subset of s0.

Third, the S3 method sort_by.add_() sorts the input by the absolute value of regression coefficients, i.e., effect size effsize, of the additional logical predictor in their own, specific test-subset of s0.

Lastly, the S3 method utils:::head.default() chooses the top n element from the input.

Similar to Section 6.5.1, the S3 method print.add_dummy() displays the calls to the selected logical predictors.

c0
# logKi67.q["96%"]>=6.8055032497215
# logKi67.q["98%"]>=7.25821545225641

Similar to Section 6.5.1, the S3 method generic predict.add_dummy() applies the starting model structure in m0, as well as the additional logical predictors selected by c0, on the test set s1.

c1 = c0 |> 
  predict(newdata = s1)
Predicted models c1
c1
# logKi67.q["96%"]>=6.8055032497215 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z       p
# Tstage  0.6275    1.8730   0.1924  3.261 0.00111
# x.TRUE -0.3212    0.7253   0.4695 -0.684 0.49392
# 
# Likelihood ratio test=9.19  on 2 df, p=0.01012
# n= 125, number of events= 27 
# 
# logKi67.q["98%"]>=7.25821545225641 :
# Call:
# coxph(formula = PFS ~ Tstage + x., data = newd)
# 
#           coef exp(coef) se(coef)      z        p
# Tstage  0.6346    1.8862   0.1915  3.313 0.000923
# x.TRUE -0.6196    0.5382   0.3935 -1.575 0.115309
# 
# Likelihood ratio test=11.19  on 2 df, p=0.00372
# n= 125, number of events= 27