Title: | Simultaneous Threshold Interaction Modeling Algorithm |
---|---|
Description: | Regression trunk model estimation proposed by Dusseldorp and Meulman (2004) <doi:10.1007/bf02295641> and Dusseldorp, Conversano, Van Os (2010) <doi:10.1198/jcgs.2010.06089>, integrating a regression tree and a multiple regression model. |
Authors: | Elise Dusseldorp [aut, cre, cph], Claudio Conversano [aut, cph], Cor Ninaber [ctb], Kristof Meers [ctb], Peter Neufeglise [trl], Juan Claramunt [ctb] |
Maintainer: | Elise Dusseldorp <[email protected]> |
License: | GPL-2 |
Version: | 1.2.4 |
Built: | 2025-01-25 03:18:01 UTC |
Source: | https://github.com/cran/stima |
This package enables you to estimate a regression trunk model. The core function is stima
, which is also the name of the algorithm. The default model is a regression trunk model. A regression trunk model is an integration of a regression tree and a multiple regression model. Currently, the classification trunk model is being developed.
Package: | stima |
Type: | Package |
Version: | 1.1 |
Date: | 2013-11-08 |
License: | GPL-02 |
LazyLoad: | yes |
The most important functions are stima,and stima.control
.
Elise Dusseldorp, Peter Neufeglise, and Claudio Conversano, with contributions of Kristof Meers and Cor Ninaber
Maintainer: [email protected]
Dusseldorp, E. Conversano, C., and Van Os, B.J. (2010). Combining an additive and tree-based regression model simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3), 514-530.
The response is the median value of owner-occupied homes measured for each of 506 cencus tracts in the Boston area.
data(boston)
data(boston)
A data frame with 506 observations on the following 16 variables.
c.medv
numeric response variable: median value of owner-occupied homes measured in 1000's USD
chas
a factor with levels "lontano"
and "vicino"
, indicating if a suburb tracts the bound of Charles river (= "lontano") or not
long
a numeric variable: longitude
latid
a numeric variable: latitude of census tract
crim
a numeric variable: per capita crime rate per town
zn
a numeric variable: proportion of residential land zoned for lots over 25,000 sq.ft.
indus
a numeric variable: proportion of non-retail business acres per town
nox
a numeric variable: nitric oxides concentration (parts per 10 million)
rm
a numeric variable: average number of rooms per dwelling
age
a numeric variable: proportion of owner-occupied units built prior to 1940
dis
a numeric variable: weighted distances to five Boston employment centers
rad
a numeric variable: index of accessibility to radial highways
tax
a numeric variable: full-value property-tax rate per 10,000 USD
ptratio
a numeric variable: pupil-teacher ratio by town
b
a numeric variable: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat
a numeric variable: percentage lower status of the population
Statlib website: http://lib.stat.cmu.edu/datasets
Harrison, D. and Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. J. Environ. Economics & Management, 5, 81-102.
A dataset with information on background characteristics and salary of 473 employees.
data(employee)
data(employee)
A data frame with 473 observations on the following 9 variables:
salary
a numeric variable, used as response variable: current salary in US dollars
age
a numeric variable: age in years
edu
a numeric variable: educational level in years
startsal
a numeric variable: beginning salary in US dollars
jobtime
a numeric variable: months since hire
prevexp
a numeric variable: previous work experience in months
minority
a factor variable: minority classification with levels min
, indicating minority, and no_min
, no minority
gender
a factor variable: gender type with levels f
, indicating female, and m
, indicating male
jobcat
a factor variable: type of job with levels Clerical
, Custodial
, and manager
This is an example dataset from the statistical software program SPSS, Version 20.0. If you use this dataset, refer to IBM Corp. (2011), see references. The dataset is used as a benchmark dataset in Dusseldorp, Conversano, and Van Os (2010).
IBM Corp. (2011). IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp.
Dusseldorp, E. Conversano, C., and Van Os, B.J. (2010). Combining an additive and tree-based regression model simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3), 514-530.
Results in a plot a regression trunk.
## S3 method for class 'rt' plot(x,digits=2,...)
## S3 method for class 'rt' plot(x,digits=2,...)
x |
an object of class |
digits |
number of decimal places used in the plot. Default value is 2. |
... |
additional arguments to be passed. |
The output is a plot of a regression trunk. Exception: If the first splitting predictor is categorical with more than 2 categories, the output will be multiple plots: for each category one plot of a regression trunk.
The number of digits of the mean value displayed in each node can be adjusted using the command
options(digits =..)
before the plot command.
Known bug: If a splitting variable (not the first one) in the regression trunk is categorical, the values of the categories are not displayed in the plot.
stima,stima.control,summary.rt
data(employee) fit1<-stima(employee,2,first=3,vfold=0) ##adjust the number of decimal places used in the plot plot(fit1,digits=1) #categorical first split fit2<-stima(employee,3,first=9,vfold=0) plot(fit2) #click on the plot to see the next one #for each category of variable "jobcat" the subtree is shown in a separate plot
data(employee) fit1<-stima(employee,2,first=3,vfold=0) ##adjust the number of decimal places used in the plot plot(fit1,digits=1) #categorical first split fit2<-stima(employee,3,first=9,vfold=0) plot(fit2) #click on the plot to see the next one #for each category of variable "jobcat" the subtree is shown in a separate plot
Determines the optimally pruned size of the regression trunk by applying the c*standard error rule to the results from the cross-validation procedure.
## S3 method for class 'rt' prune(tree, data, c.par = NULL,...)
## S3 method for class 'rt' prune(tree, data, c.par = NULL,...)
tree |
a tree of class |
data |
the dataset that was used to create the regression trunk. |
c.par |
the pruning parameter (c) that will be used in the c*SE rule. In the default option, the pruning function uses the best value of c, as recommended by Dusseldorp, Conversano & Van Os (2010). This best value depends on the sample size of the included dataset. |
... |
additional arguments to be passed. |
The function returns the pruned regression trunk, and the corresponding regression trunk model. The output is an object of class rt
. If the pruning rule resulted in the root node, no object is returned.
Dusseldorp, E. Conversano, C., and Van Os, B.J. (2010). Combining an additive and tree-based regression model simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3), 514-530.
#Example with employee data data(employee) #a regression trunk with a maximum of three splits is grown #variable used for the first split (edu) is third variable in the dataset #twofold cross-validation is performed to save time in the example, #tenfold cross-validation is recommended emprt1<-stima(employee,3,first=3,vfold=2) summary(emprt1) #prune the regression trunk emprt1_pr<-prune(emprt1,data=employee)
#Example with employee data data(employee) #a regression trunk with a maximum of three splits is grown #variable used for the first split (edu) is third variable in the dataset #twofold cross-validation is performed to save time in the example, #tenfold cross-validation is recommended emprt1<-stima(employee,3,first=3,vfold=2) summary(emprt1) #prune the regression trunk emprt1_pr<-prune(emprt1,data=employee)
This function fits a regression trunk model (default option) using the simultaneous threshold interaction modeling algorithm. The algorithm fits a regression tree and a multiple regression model simultaneously.
stima(data, maxsplit, model = "regtrunk", first = NULL, vfold = 10, CV = 1, Save = FALSE, control = NULL, printoutput = TRUE)
stima(data, maxsplit, model = "regtrunk", first = NULL, vfold = 10, CV = 1, Save = FALSE, control = NULL, printoutput = TRUE)
data |
a data frame with one continuous response variable and multiple predictors (categorical or continuous). IMPORTANT: The first column is treated as the response variable, the remaining columns as predictors. |
maxsplit |
the maximum number of splits. |
model |
the default model is a regression trunk model. The classification trunk model is under development. |
first |
the column number in the data frame of the predictor that is used for the first split of the regression trunk. The default option automatically selects the predictor for the first split. |
vfold |
the number of sets to be used in the cross-validation. The default value is 10, which means 10-fold cross-validation. If |
CV |
the number of times the cross-validation procedure is performed. The default is once. If |
Save |
if |
control |
options controlling details of the algorithm. For default options see |
printoutput |
if TRUE, output will be printed while running the function. |
an object of class rt
, which is a list containing at least the following components
call |
the matched call. |
trunk |
the fitted regression trunk. MeanResponse is the mean response value of the observations in that particular node (this is not the predicted response value). |
splitsequence |
the number of the nodes that are split. |
goffull |
goodness-of-fit estimates of the full regression trunk model estimated after 1 split through the model estimated after the maximum number of splits. |
full |
the estimated full regression trunk model after the maximum number of splits. Coefficient = estimated unstandardized regression coefficient; Std. Coef. = standardized regression coefficient. |
Dusseldorp, E. & Meulman, J. J. (2004). The regression trunk approach to discover treatment covariate interaction. Psychometrika, 69, 355-374.
Dusseldorp, E. Conversano, C., and Van Os, B.J. (2010). Combining an additive and tree-based regression model simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3), 514-530.
stima.control,summary.rt,prune.rt,plot.rt
and help("stima-package")
#Example with Boston Housing dataset from paper in JCGS data(boston) #grow a full regression trunk with automatic first split selection #and maximum number of splits = 10, with: bostonrt<-stima(boston,10) #NB. This analysis will take a long time (about one hour) #inspect the output with: summary(bostonrt) #prune the tree with: prune(bostonrt,data=boston) #the pruned regression trunk has 7 splits #to save time in the example, we select the splitting candidates beforehand, #and we grow a tree with a maximum of 4 splits: contr<-stima.control(predtrunk=c(8,9,16)) bostonrt_pr<-stima(boston,4,first=16,vfold=0,Save=TRUE,control = contr) summary(bostonrt_pr) #inspect the coefficients of the final regression trunk model round(bostonrt_pr$full,digits=2) #inspect the new data including the indicator variables referring #to the terminal nodes bostonrt_pr$newdata
#Example with Boston Housing dataset from paper in JCGS data(boston) #grow a full regression trunk with automatic first split selection #and maximum number of splits = 10, with: bostonrt<-stima(boston,10) #NB. This analysis will take a long time (about one hour) #inspect the output with: summary(bostonrt) #prune the tree with: prune(bostonrt,data=boston) #the pruned regression trunk has 7 splits #to save time in the example, we select the splitting candidates beforehand, #and we grow a tree with a maximum of 4 splits: contr<-stima.control(predtrunk=c(8,9,16)) bostonrt_pr<-stima(boston,4,first=16,vfold=0,Save=TRUE,control = contr) summary(bostonrt_pr) #inspect the coefficients of the final regression trunk model round(bostonrt_pr$full,digits=2) #inspect the new data including the indicator variables referring #to the terminal nodes bostonrt_pr$newdata
The output are various parameters that control aspects of the simultaneaous threshold interaction algorithm
stima.control(minbucket = NULL, crit = "f2", mincrit = 0.001, predtrunk = NULL, ref = 1, sel = "none", ksel = 2, predsel = NULL, cvvec = NULL, seed = 3)
stima.control(minbucket = NULL, crit = "f2", mincrit = 0.001, predtrunk = NULL, ref = 1, sel = "none", ksel = 2, predsel = NULL, cvvec = NULL, seed = 3)
minbucket |
the minimum number of observations in a terminal node. The default is the square root of the total sample size. |
crit |
the type of statistic to be used in the partitioning criterion. The default for the regression trunk model is the effect size |
mincrit |
the minimum node deviance before growing stops. |
predtrunk |
a row vector that indicates the column numbers in the data frame of the predictors that can be used in the regression trunk. The default action uses all predictors as available splitting candidates; NB. this column number can not be 1, because the first column is the response variable. |
ref |
a number referring to the region of the regression trunk that will be used as reference category in the regression trunk model. The default value is 1, referring to R1. |
sel |
if |
ksel |
the multiple of the number of degrees of freedom used for the penalty in the backward selection procedure. The default value is 2, which gives the genuine AIC: |
predsel |
row vector that indicates the column numbers in the |
cvvec |
index vector for the rows of the dataframe that will be used in each cross-validation set. The default option is a random division into |
seed |
an integer between 0 and 1023 that will be used in set.seed(). The default value equals 3. |
a list containing the parameters.
Dusseldorp, E. Conversano, C., and Van Os, B.J. (2010). Combining an additive and tree-based regression model simultaneously: STIMA. Journal of Computational and Graphical Statistics, 19(3), 514-530.
stima,summary.rt,plot.rt,prune.rt
##Adjust the stopping rule in a minimum of 5 observations in a terminal node data(employee) contr1<-stima.control(minbucket=5) ##Adjust the seed used to create an index vector for the 10fold cross-validation ##With seed=3, the result equals the one reported in the online Appendix D of ##the paper in the Journal of Computational and Graphical Statistics ##NB. To save time in the example, the splitting candidates of the regression ##trunk(i.e., edu and jobtime) are selected with predtrunk=c(3,5), ##where 3 and 5 denote the column numbers in the dataset contr2<-stima.control(sel="backward",seed=3,predtrunk=c(3,5)) emprt2<-stima(employee,2,first=3,control=contr2) summary(emprt2) ##Apply a manual selection of predictors to be used in the pruned model contr3<-stima.control(sel="manual",predsel=c(2,3,4,5,6,8))
##Adjust the stopping rule in a minimum of 5 observations in a terminal node data(employee) contr1<-stima.control(minbucket=5) ##Adjust the seed used to create an index vector for the 10fold cross-validation ##With seed=3, the result equals the one reported in the online Appendix D of ##the paper in the Journal of Computational and Graphical Statistics ##NB. To save time in the example, the splitting candidates of the regression ##trunk(i.e., edu and jobtime) are selected with predtrunk=c(3,5), ##where 3 and 5 denote the column numbers in the dataset contr2<-stima.control(sel="backward",seed=3,predtrunk=c(3,5)) emprt2<-stima(employee,2,first=3,control=contr2) summary(emprt2) ##Apply a manual selection of predictors to be used in the pruned model contr3<-stima.control(sel="manual",predsel=c(2,3,4,5,6,8))
summary
method for class “rt” (i.e. a regression trunk)
## S3 method for class 'rt' summary(object, digits = 3,...)
## S3 method for class 'rt' summary(object, digits = 3,...)
object |
an object of class |
digits |
the number of decimals to used in the output. |
... |
Additional arguments to be passed |
The function summary.rt
returns the goodness-of-fit summary of the estimated regression trunk model, using the components “goffull
” and, if available, “gofsel
”.
full |
goodness-of-fit estimates of the full regression trunk model estimated after 1 split through the model estimated after the maximum number of splits. |
selected |
goodness-of-fit estimates of the selected regression trunk model (if applicable). |