RefineMod.Rmd
This vignette explains about the details and usage of functions in the RefineMod package. This package provide functions to refine and optimize linear regression models. Models can be refined to only those predictors that are statistically significant in explaining the response variable. Linear regression models can also be compared on the basis of their performance like RMSE, R2 and MAE.
There are currently three functions in this package:
sig_pred
lm_significant
comp_mods
library(RefineMod)
library(datateachr) #Dependency
This function finds all the predictor variables that are significant to a linear regression model from the input predictors
sig_pred(data, res, preds = NULL, p = 0.01, verbose = FALSE, ...)
data
a data frame object containing the variables to be used as response and predictors in the model.
res
a character vector of length 1 that matches the name of the response variable column in data. Response variable in the data must be of numeric type.
preds
a character vector of predictor variables in the data. When not specified function will take all variables in data other than response variable as input predictors to start with. A default of NULL is given to this argument to provide the flexibility of using either user defined predictor variables or all but response variable as predictors from the data.
p
a numeric value that denotes the threshold for selecting the predictors based on their statistical significance in building the model. Default p-value threshold is 0.01.
verbose
a logical value denoting whether or not to print progress messages as the function is being run. Default is TRUE
...
additional arguments to be passed to the inner lm() function calls. Refer to the documentation of lm() for more details on those arguments
sig_mod <- sig_pred(cancer_sample[,-2], res = "radius_mean")
lm_significant is used to optimize a multiple linear regression model to only those predictor variables that are statistically significant for building that model. The function currently only supports additive model linear regressions.
lm_significant(
data,
res,
preds = NULL,
p = 0.01,
verbose = TRUE,
all = FALSE,
...
)
data
a data frame object containing the variables to be used as response and predictors in the model.
res
a character vector of length 1 that matches the name of the response variable column in data. Response variable in the data must be of numeric type.
preds
a character vector of predictor variables in the data. When not specified function will take all variables in data other than response variable as input predictors to start with. A default of NULL is given to this argument to provide the flexibility of using either user defined predictor variables or all but response variable as predictors from the data.
p
a numeric value that denotes the threshold for selecting the predictors based on their statistical significance in building the model. Default p-value threshold is 0.01.
verbose
a logical value denoting whether or not to print progress messages as the function is being run. Default is TRUE
all
if TRUE, the function will return a list with two lm model objects. The first one is the original call with all the input predictors and the second is the model with only the significant predictors. The default is FALSE which returns a single lm object with only the significant predictors
...
additional arguments to be passed to the inner lm() function calls. Refer to the documentation of lm() for more details on those arguments
#Only the optimized model
sig_mod1 <- lm_significant(cancer_sample[,-2], res = "radius_mean")
#>
#> Response Variable: radius_mean
#>
#> Input Predictors: ID texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst
#>
#> Fitting a linear model
#>
#> Optimization of Predictors
#> ....
#>
#> Final Optimization...
#>
#>
#> Final Optimized Predictors: perimeter_mean compactness_mean radius_worst area_worst concavity_mean perimeter_worst compactness_worst
summary(sig_mod1)
#>
#> Call:
#> stats::lm(formula = form1, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.43905 -0.03010 -0.00420 0.03113 0.27932
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.401e-01 3.107e-02 14.166 < 2e-16 ***
#> perimeter_mean 1.487e-01 5.931e-04 250.709 < 2e-16 ***
#> compactness_mean -3.926e+00 1.559e-01 -25.182 < 2e-16 ***
#> radius_worst 1.462e-01 7.390e-03 19.778 < 2e-16 ***
#> area_worst -1.897e-04 3.205e-05 -5.918 5.68e-09 ***
#> concavity_mean -6.334e-01 9.568e-02 -6.620 8.39e-11 ***
#> perimeter_worst -1.671e-02 1.020e-03 -16.381 < 2e-16 ***
#> compactness_worst 2.310e-01 4.123e-02 5.602 3.32e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.06754 on 561 degrees of freedom
#> Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
#> F-statistic: 2.208e+05 on 7 and 561 DF, p-value: < 2.2e-16
#Both the original model and optimized model
sig_mod2 <- lm_significant(mtcars, res = "mpg", all = TRUE)
#>
#> Response Variable: mpg
#>
#> Input Predictors: cyl disp hp drat wt qsec vs am gear carb
#>
#> Fitting a linear model
#>
#> Optimization of Predictors
#> .
#>
#> Final Optimization...
#>
#>
#> Final Optimized Predictors: cyl wt
sig_mod2
#> $`All Predictors`
#>
#> Call:
#> stats::lm(formula = form, data = data)
#>
#> Coefficients:
#> (Intercept) cyl disp hp drat wt
#> 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
#> qsec vs am gear carb
#> 0.82104 0.31776 2.52023 0.65541 -0.19942
#>
#>
#> $`Opt Predictors`
#>
#> Call:
#> stats::lm(formula = form1, data = data)
#>
#> Coefficients:
#> (Intercept) cyl wt
#> 39.686 -1.508 -3.191
summary(sig_mod2$`Opt Predictors`)
#>
#> Call:
#> stats::lm(formula = form1, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.2893 -1.5512 -0.4684 1.5743 6.1004
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
#> cyl -1.5078 0.4147 -3.636 0.001064 **
#> wt -3.1910 0.7569 -4.216 0.000222 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.568 on 29 degrees of freedom
#> Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
#> F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
This function takes in one or more linear regression models and compares the RMSE, R2 and MAE of those models with or without a new input data.
comp_mods(mod1, ..., newdata = NULL)
mod1
an object containing results returned by lm function
...
additional lm model object
newdata
newdata than used for training the model. By default NULL and the function evaluated model performance based on training data
train <- mtcars[1:20,]
test <- mtcars[21:30,]
mod1 <- lm(mpg~wt, train)
mod2 <- lm(mpg~cyl, train)
mod3 <- lm(mpg~wt+cyl, train)
mod4 <- lm(mpg~carb, train)
comp_mods(mod1, mod2, mod3, mod4, newdata = test)
#> RMSE Rsquared MAE
#> model1 7.558188 0.0276731943 6.090093
#> model2 8.893606 0.0006549141 7.565000
#> model3 8.315105 0.0044296323 6.741422
#> model4 9.741102 0.1946252582 7.860596