Title: | Accurate Generalized Linear Model |
---|---|
Description: | Provides functions to fit Accurate Generalized Linear Model (AGLM) models, visualize them, and predict for new data. AGLM is defined as a regularized GLM which applies a sort of feature transformations using a discretization of numerical features and specific coding methodologies of dummy variables. For more information on AGLM, see Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa (2020) <https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1>. |
Authors: | Kenji Kondo [aut, cre, cph], Kazuhisa Takahashi [ctb], Hikari Banno [ctb] |
Maintainer: | Kenji Kondo <[email protected]> |
License: | GPL-2 |
Version: | 0.4.0 |
Built: | 2025-03-08 02:45:15 UTC |
Source: | https://github.com/kkondo1981/aglm |
Provides functions to fit Accurate Generalized Linear Model (AGLM) models, visualize them, and predict for new data. AGLM is defined as a regularized GLM which applies a sort of feature transformations using a discretization of numerical features and specific coding methodologies of dummy variables. For more information on AGLM, see Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa (2020).
The collection of functions provided by the aglm
package has almost the same structure as the famous glmnet
package,
so users familiar with the glmnet
package will be able to handle it easily.
In fact, this structure is reasonable in implementation, because what the aglm
package does is
applying appropriate transformations to the given data and passing it to the glmnet
package as a backend.
The aglm
package provides three different fitting functions, depending on how users want to handle hyper-parameters of AGLM models.
Because AGLM is based on regularized GLM, the regularization term of the loss function can be expressed as follows:
\[
R(\lbrace \beta_{jk} \rbrace; \lambda, \alpha)
= \lambda \left\lbrace
(1 - \alpha)\sum_{j=1}^{p} \sum_{k=1}^{m_j}|\beta_{jk}|^2 + \alpha \sum_{j=1}^{p} \sum_{k=1}^{m_j} |\beta_{jk}|
\right\rbrace,
\]
where is the k-th coefficient of auxiliary variables for the j-th column in data,
is a weight which controls how L1 and L2 regularization terms are mixed,
and
determines the strength of the regularization.
Searching hyper-parameters and
is often useful to get better results, but usually time-consuming.
That's why the
aglm
package provides three fitting functions with different strategies for specifying hyper-parameters as follows:
aglm: A basic fitting function with given and
(s).
cv.aglm: A fitting function with given and cross-validation for
.
cva.aglm: A fitting function with cross-validation for both and
.
Generally speaking, setting an appropriate is often important to get meaningful results,
and using
cv.aglm()
with default (LASSO) is usually enough.
Since
cva.aglm()
is much time-consuming than cv.aglm()
, it is better to use it only if particularly better results are needed.
The following S4 classes are defined to store results of the fitting functions.
AccurateGLM-class: A class for results of aglm()
and cv.aglm()
CVA_AccurateGLM-class: A class for results of cva.aglm()
Users can use models obtained from fitting functions in various ways, by passing them to following functions:
predict: Make predictions for new data
plot: Plot contribution of each variable and residuals
print: Display textual information of the model
coef: Get coefficients
deviance: Get deviance
residuals: Get residuals of various types
We emphasize that plot()
is particularly useful to understand the fitted model,
because it presents a visual representation of how variables in the original data are used by the model.
The following functions are basically for internal use, but exported as utility functions for convenience.
Functions for creating feature vectors
Functions for binning
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
aglm()
and cv.aglm()
Class for results of aglm()
and cv.aglm()
backend_models
The fitted backend glmnet
model is stored.
vars_info
A list, each of whose element is information of one variable.
lambda
Same as in the result of cv.glmnet.
cvm
Same as in the result of cv.glmnet.
cvsd
Same as in the result of cv.glmnet.
cvup
Same as in the result of cv.glmnet.
cvlo
Same as in the result of cv.glmnet.
nzero
Same as in the result of cv.glmnet.
name
Same as in the result of cv.glmnet.
lambda.min
Same as in the result of cv.glmnet.
lambda.1se
Same as in the result of cv.glmnet.
fit.preval
Same as in the result of cv.glmnet.
foldid
Same as in the result of cv.glmnet.
call
An object of class call
, corresponding to the function call when this AccurateGLM
object is created.
Kenji Kondo
A basic fitting function with given and
(s).
See aglm-package for more details on
and
.
aglm( x, y, qualitative_vars_UD_only = NULL, qualitative_vars_both = NULL, qualitative_vars_OD_only = NULL, quantitative_vars = NULL, use_LVar = FALSE, extrapolation = "default", add_linear_columns = TRUE, add_OD_columns_of_qualitatives = TRUE, add_interaction_columns = FALSE, OD_type_of_quantitatives = "C", nbin.max = NULL, bins_list = NULL, bins_names = NULL, family = c("gaussian", "binomial", "poisson"), ... )
aglm( x, y, qualitative_vars_UD_only = NULL, qualitative_vars_both = NULL, qualitative_vars_OD_only = NULL, quantitative_vars = NULL, use_LVar = FALSE, extrapolation = "default", add_linear_columns = TRUE, add_OD_columns_of_qualitatives = TRUE, add_interaction_columns = FALSE, OD_type_of_quantitatives = "C", nbin.max = NULL, bins_list = NULL, bins_names = NULL, family = c("gaussian", "binomial", "poisson"), ... )
x |
A design matrix.
Usually a
These dummy variables are added to If you need to change the default behavior, use the following options: |
y |
A response variable. |
qualitative_vars_UD_only |
Used to change the default behavior of
|
qualitative_vars_both |
Same as |
qualitative_vars_OD_only |
Same as |
quantitative_vars |
Same as |
use_LVar |
Set to use L-variables.
By default, |
extrapolation |
Used to control values of linear combination for quantitative variables, outside where the data exists.
By default, values of a linear combination outside the data is extended based on the slope of the edges of the region where the data exists.
You can set |
add_linear_columns |
By default, for quantitative variables, |
add_OD_columns_of_qualitatives |
Set to |
add_interaction_columns |
If this parameter is set to |
OD_type_of_quantitatives |
Used to control the shape of linear combinations obtained by O-dummies for quantitative variables (deprecated). |
nbin.max |
An integer representing the maximum number of bins when |
bins_list |
Used to set custom bins for variables with O-dummies. |
bins_names |
Used to set custom bins for variables with O-dummies. |
family |
A |
... |
Other arguments are passed directly when calling |
A model object fitted to the data.
Functions such as predict
and plot
can be applied to the returned object.
See AccurateGLM-class for more details.
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
#################### Gaussian case #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## Fit the model model <- aglm(x, y) # alpha=1 (the default value) ## Predict for various alpha and lambda lambda <- 0.1 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse)) lambda <- 1.0 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse)) alpha <- 0 model <- aglm(x, y, alpha=alpha) lambda <- 0.1 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for alpha=%.2f and lambda=%.2f: %.5f \n\n", alpha, lambda, rmse)) #################### Binomial case #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] ## Fit the model model <- aglm(x, y, family="binomial") ## Make the confusion matrix lambda <- 0.1 y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] print(table(y_true, y_pred)) #################### use_LVar and extrapolation #################### library(MASS) # For Boston library(aglm) ## Randomly created train and test data set.seed(2021) sd <- 0.2 x <- 2 * runif(1000) + 1 f <- function(x){x^3 - 6 * x^2 + 13 * x} y <- f(x) + rnorm(1000, sd = sd) xy <- data.frame(x=x, y=y) x_test <- seq(0.75, 3.25, length.out=101) y_test <- f(x_test) + rnorm(101, sd=sd) xy_test <- data.frame(x=x_test, y=y_test) ## Plot nbin.max <- 10 models <- c(cv.aglm(x, y, use_LVar=FALSE, extrapolation="default", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=FALSE, extrapolation="flat", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=TRUE, extrapolation="default", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=TRUE, extrapolation="flat", nbin.max=nbin.max)) titles <- c("O-Dummies with extrapolation=\"default\"", "O-Dummies with extrapolation=\"flat\"", "L-Variables with extrapolation=\"default\"", "L-Variables with extrapolation=\"flat\"") par.old <- par(mfrow=c(2, 2)) for (i in 1:4) { model <- models[[i]] title <- titles[[i]] pred <- predict(model, newx=x_test, [email protected], type="response") plot(x_test, y_test, pch=20, col="grey", main=title) lines(x_test, f(x_test), lty="dashed", lwd=2) # the theoretical line lines(x_test, pred, col="blue", lwd=3) # the smoothed line by the model } par(par.old)
#################### Gaussian case #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## Fit the model model <- aglm(x, y) # alpha=1 (the default value) ## Predict for various alpha and lambda lambda <- 0.1 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse)) lambda <- 1.0 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for lambda=%.2f: %.5f \n\n", lambda, rmse)) alpha <- 0 model <- aglm(x, y, alpha=alpha) lambda <- 0.1 y_pred <- predict(model, newx=newx, s=lambda) rmse <- sqrt(mean((y_true - y_pred)^2)) cat(sprintf("RMSE for alpha=%.2f and lambda=%.2f: %.5f \n\n", alpha, lambda, rmse)) #################### Binomial case #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] ## Fit the model model <- aglm(x, y, family="binomial") ## Make the confusion matrix lambda <- 0.1 y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] print(table(y_true, y_pred)) #################### use_LVar and extrapolation #################### library(MASS) # For Boston library(aglm) ## Randomly created train and test data set.seed(2021) sd <- 0.2 x <- 2 * runif(1000) + 1 f <- function(x){x^3 - 6 * x^2 + 13 * x} y <- f(x) + rnorm(1000, sd = sd) xy <- data.frame(x=x, y=y) x_test <- seq(0.75, 3.25, length.out=101) y_test <- f(x_test) + rnorm(101, sd=sd) xy_test <- data.frame(x=x_test, y=y_test) ## Plot nbin.max <- 10 models <- c(cv.aglm(x, y, use_LVar=FALSE, extrapolation="default", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=FALSE, extrapolation="flat", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=TRUE, extrapolation="default", nbin.max=nbin.max), cv.aglm(x, y, use_LVar=TRUE, extrapolation="flat", nbin.max=nbin.max)) titles <- c("O-Dummies with extrapolation=\"default\"", "O-Dummies with extrapolation=\"flat\"", "L-Variables with extrapolation=\"default\"", "L-Variables with extrapolation=\"flat\"") par.old <- par(mfrow=c(2, 2)) for (i in 1:4) { model <- models[[i]] title <- titles[[i]] pred <- predict(model, newx=x_test, s=model@lambda.min, type="response") plot(x_test, y_test, pch=20, col="grey", main=title) lines(x_test, f(x_test), lty="dashed", lwd=2) # the theoretical line lines(x_test, pred, col="blue", lwd=3) # the smoothed line by the model } par(par.old)
S4 class for input
vars_info
A list, each of whose element is information of one variable.
data
The original data.
Get coefficients
## S3 method for class 'AccurateGLM' coef(object, index = NULL, name = NULL, s = NULL, exact = FALSE, ...)
## S3 method for class 'AccurateGLM' coef(object, index = NULL, name = NULL, s = NULL, exact = FALSE, ...)
object |
A model object obtained from |
index |
An integer value representing the index of variable whose coefficients are required. |
name |
A string representing the name of variable whose coefficients are required.
Note that if both |
s |
Same as in coef.glmnet. |
exact |
Same as in coef.glmnet. |
... |
Other arguments are passed directly to |
If index
or name
is given, the function returns a list with the one or combination
of the following fields, consisting of coefficients related to the specified variable.
coef.linear
: A coefficient of the linear term. (If any)
coef.OD
: Coefficients of O-dummies. (If any)
coef.UD
: Coefficients of U-dummies. (If any)
coef.LV
: Coefficients of L-variables. (If any)
If both index
and name
are not given, the function returns entire coefficients
corresponding to the internal designed matrix.
Kenji Kondo
Create bins (equal frequency binning)
createEqualFreqBins(x_vec, nbin.max)
createEqualFreqBins(x_vec, nbin.max)
x_vec |
A numeric vector, whose quantiles are used as breaks. |
nbin.max |
The maximum number of bins. |
A numeric vector representing breaks obtained by binning.
Note that the number of bins is equal to min(nbin.max, length(x_vec))
.
Kenji Kondo
Create bins (equal width binning)
createEqualWidthBins(left, right, nbin)
createEqualWidthBins(left, right, nbin)
left |
The leftmost value of the interval to be binned. |
right |
The rightmost value of the interval to be binned. |
nbin |
The number of bins. |
A numeric vector representing breaks obtained by binning.
Kenji Kondo
A fitting function with given and cross-validation for
.
See aglm-package for more details on
and
.
cv.aglm( x, y, qualitative_vars_UD_only = NULL, qualitative_vars_both = NULL, qualitative_vars_OD_only = NULL, quantitative_vars = NULL, use_LVar = FALSE, extrapolation = "default", add_linear_columns = TRUE, add_OD_columns_of_qualitatives = TRUE, add_interaction_columns = FALSE, OD_type_of_quantitatives = "C", nbin.max = NULL, bins_list = NULL, bins_names = NULL, family = c("gaussian", "binomial", "poisson"), keep = FALSE, ... )
cv.aglm( x, y, qualitative_vars_UD_only = NULL, qualitative_vars_both = NULL, qualitative_vars_OD_only = NULL, quantitative_vars = NULL, use_LVar = FALSE, extrapolation = "default", add_linear_columns = TRUE, add_OD_columns_of_qualitatives = TRUE, add_interaction_columns = FALSE, OD_type_of_quantitatives = "C", nbin.max = NULL, bins_list = NULL, bins_names = NULL, family = c("gaussian", "binomial", "poisson"), keep = FALSE, ... )
x |
A design matrix. See aglm for more details. |
y |
A response variable. |
qualitative_vars_UD_only |
Same as in aglm. |
qualitative_vars_both |
Same as in aglm. |
qualitative_vars_OD_only |
Same as in aglm. |
quantitative_vars |
Same as in aglm. |
use_LVar |
Same as in aglm. |
extrapolation |
Same as in aglm. |
add_linear_columns |
Same as in aglm. |
add_OD_columns_of_qualitatives |
Same as in aglm. |
add_interaction_columns |
Same as in aglm. |
OD_type_of_quantitatives |
Same as in aglm. |
nbin.max |
Same as in aglm. |
bins_list |
Same as in aglm. |
bins_names |
Same as in aglm. |
family |
Same as in aglm. |
keep |
Set to |
... |
Other arguments are passed directly when calling |
A model object fitted to the data with cross-validation results.
Functions such as predict
and plot
can be applied to the returned object, same as the result of aglm()
.
See AccurateGLM-class for more details.
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
#################### Cross-validation for lambda #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] # NOTE: Codes bellow will take considerable time, so run it when you have time. ## Fit the model model <- cv.aglm(x, y, family="binomial") ## Make the confusion matrix lambda <- [email protected] y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] cat(sprintf("Confusion matrix for lambda=%.5f:\n", lambda)) print(table(y_true, y_pred))
#################### Cross-validation for lambda #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] # NOTE: Codes bellow will take considerable time, so run it when you have time. ## Fit the model model <- cv.aglm(x, y, family="binomial") ## Make the confusion matrix lambda <- model@lambda.min y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] cat(sprintf("Confusion matrix for lambda=%.5f:\n", lambda)) print(table(y_true, y_pred))
cva.aglm()
Class for results of cva.aglm()
models_list
A list consists of cv.glmnet()
's results for all values.
alpha
Same as in cv.aglm.
nfolds
Same as in cv.aglm.
alpha.min.index
The index of alpha.min
in the vector alpha
.
alpha.min
The value achieving the minimum loss among all the values of
alpha
.
lambda.min
The value achieving the minimum loss when
is equal to
alpha.min
.
call
An object of class call
, corresponding to the function call when this CVA_AccurateGLM
object is created.
Kenji Kondo
and
A fitting function with cross-validation for both and
.
See aglm-package for more details on
and
.
cva.aglm( x, y, alpha = seq(0, 1, len = 11)^3, nfolds = 10, foldid = NULL, parallel.alpha = FALSE, ... )
cva.aglm( x, y, alpha = seq(0, 1, len = 11)^3, nfolds = 10, foldid = NULL, parallel.alpha = FALSE, ... )
x |
A design matrix. See aglm for more details. |
y |
A response variable. |
alpha |
A numeric vector representing |
nfolds |
An integer value representing the number of folds. |
foldid |
An integer vector with the same length as observations.
Each element should take a value from 1 to |
parallel.alpha |
(not used yet) |
... |
Other arguments are passed directly to |
An object storing fitted models and information of cross-validation. See CVA_AccurateGLM-class for more details.
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
#################### Cross-validation for alpha and lambda #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] # NOTE: Codes bellow will take considerable time, so run it when you have time. ## Fit the model cva_result <- cva.aglm(x, y, family="binomial") alpha <- [email protected] lambda <- [email protected] mod_idx <- [email protected] model <- cva_result@models_list[[mod_idx]] ## Make the confusion matrix y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] cat(sprintf("Confusion matrix for alpha=%.5f and lambda=%.5f:\n", alpha, lambda)) print(table(y_true, y_pred))
#################### Cross-validation for alpha and lambda #################### library(aglm) library(faraway) ## Read data xy <- nes96 ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/5)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] y <- train$vote newx <- test[, c("popul", "TVnews", "selfLR", "ClinLR", "DoleLR", "PID", "age", "educ", "income")] # NOTE: Codes bellow will take considerable time, so run it when you have time. ## Fit the model cva_result <- cva.aglm(x, y, family="binomial") alpha <- cva_result@alpha.min lambda <- cva_result@lambda.min mod_idx <- cva_result@alpha.min.index model <- cva_result@models_list[[mod_idx]] ## Make the confusion matrix y_true <- test$vote y_pred <- levels(y_true)[as.integer(predict(model, newx, s=lambda, type="class"))] cat(sprintf("Confusion matrix for alpha=%.5f and lambda=%.5f:\n", alpha, lambda)) print(table(y_true, y_pred))
Get deviance
## S3 method for class 'AccurateGLM' deviance(object, ...)
## S3 method for class 'AccurateGLM' deviance(object, ...)
object |
A model object obtained from |
... |
Other arguments are passed directly to |
The value of deviance extracted from the object object
.
Kenji Kondo
Binning the data to given bins.
executeBinning(x_vec, breaks = NULL, nbin.max = 100, method = "freq")
executeBinning(x_vec, breaks = NULL, nbin.max = 100, method = "freq")
x_vec |
The data to be binned. |
breaks |
A numeric vector representing breaks of bins (If |
nbin.max |
The maximum number of bins (used only if |
method |
|
A list with the following fields:
labels
: An integer vector with same length as x_vec
, where labels[i]==k
means the i-th element of x_vec
is in the k-th bin.
breaks
: Breaks of bins used for binning.
Kenji Kondo
Create L-variable matrix for one variable
getLVarMatForOneVec(x_vec, breaks = NULL, nbin.max = 100, only_info = FALSE)
getLVarMatForOneVec(x_vec, breaks = NULL, nbin.max = 100, only_info = FALSE)
x_vec |
A numeric vector representing original variable. |
breaks |
A numeric vector representing breaks of bins (If |
nbin.max |
The maximum number of bins (used only if |
only_info |
If |
A list with the following fields:
breaks
: Same as input
dummy_mat
: The created L-variable matrix (only if only_info=FALSE
).
Kenji Kondo
Create a O-dummy matrix for one variable
getODummyMatForOneVec( x_vec, breaks = NULL, nbin.max = 100, only_info = FALSE, dummy_type = NULL )
getODummyMatForOneVec( x_vec, breaks = NULL, nbin.max = 100, only_info = FALSE, dummy_type = NULL )
x_vec |
A numeric vector representing original variable. |
breaks |
A numeric vector representing breaks of bins (If |
nbin.max |
The maximum number of bins (used only if |
only_info |
If |
dummy_type |
Used to control the shape of linear combinations obtained by O-dummies for quantitative variables (deprecated). |
A list with the following fields:
breaks
: Same as input
dummy_mat
: The created O-dummy matrix (only if only_info=FALSE
).
Kenji Kondo
Create a U-dummy matrix for one variable
getUDummyMatForOneVec( x_vec, levels = NULL, drop_last = TRUE, only_info = FALSE )
getUDummyMatForOneVec( x_vec, levels = NULL, drop_last = TRUE, only_info = FALSE )
x_vec |
A vector representing original variable.
The class of |
levels |
A character vector representing values of |
drop_last |
If |
only_info |
If |
A list with the following fields:
levels
: Same as input.
drop_last
: Same as input.
dummy_mat
: The created U-dummy matrix (only if only_info=FALSE
).
Kenji Kondo
Plot contribution of each variable and residuals
## S3 method for class 'AccurateGLM' plot( x, vars = NULL, verbose = TRUE, s = NULL, resid = FALSE, smooth_resid = TRUE, smooth_resid_fun = NULL, ask = TRUE, layout = c(2, 2), only_plot = FALSE, main = "", add_rug = FALSE, ... )
## S3 method for class 'AccurateGLM' plot( x, vars = NULL, verbose = TRUE, s = NULL, resid = FALSE, smooth_resid = TRUE, smooth_resid_fun = NULL, ask = TRUE, layout = c(2, 2), only_plot = FALSE, main = "", add_rug = FALSE, ... )
x |
A model object obtained from |
vars |
Used to specify variables to be plotted (
|
verbose |
Set to |
s |
A numeric value specifying |
resid |
Used to display residuals in plots. This parameter may have one of the following classes:
|
smooth_resid |
Used to display smoothing lines of residuals for quantitative variables. This parameter may have one of the following classes:
|
smooth_resid_fun |
Set if users need custom smoothing functions. |
ask |
By default, |
layout |
Plotting multiple variables for each page is allowed. To achieve this, set it to a pair of integer, which indicating number of rows and columns, respectively. |
only_plot |
Set to |
main |
Used to specify the title of plotting. |
add_rug |
Set to |
... |
Other arguments are currently not used and just discarded. |
No return value, called for side effects.
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
#################### using plot() and predict() #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## With the result of aglm() model <- aglm(x, y) lambda <- 0.1 plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred) ## With the result of cv.aglm() model <- cv.aglm(x, y) lambda <- [email protected] plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred)
#################### using plot() and predict() #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## With the result of aglm() model <- aglm(x, y) lambda <- 0.1 plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred) ## With the result of cv.aglm() model <- cv.aglm(x, y) lambda <- model@lambda.min plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred)
Make predictions for new data
## S3 method for class 'AccurateGLM' predict( object, newx = NULL, s = NULL, type = c("link", "response", "coefficients", "nonzero", "class"), exact = FALSE, newoffset, ... )
## S3 method for class 'AccurateGLM' predict( object, newx = NULL, s = NULL, type = c("link", "response", "coefficients", "nonzero", "class"), exact = FALSE, newoffset, ... )
object |
A model object obtained from |
newx |
A design matrix for new data.
See the description of |
s |
Same as in predict.glmnet. |
type |
Same as in predict.glmnet. |
exact |
Same as in predict.glmnet. |
newoffset |
Same as in predict.glmnet. |
... |
Other arguments are passed directly when calling |
The returned object depends on type
.
See predict.glmnet for more details.
Kenji Kondo,
Kazuhisa Takahashi and Hikari Banno (worked on L-Variable related features)
Suguru Fujita, Toyoto Tanaka, Kenji Kondo and Hirokazu Iwasawa. (2020)
AGLM: A Hybrid Modeling Method of GLM and Data Science Techniques,
https://www.institutdesactuaires.com/global/gene/link.php?doc_id=16273&fg=1
Actuarial Colloquium Paris 2020
#################### using plot() and predict() #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## With the result of aglm() model <- aglm(x, y) lambda <- 0.1 plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred) ## With the result of cv.aglm() model <- cv.aglm(x, y) lambda <- [email protected] plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred)
#################### using plot() and predict() #################### library(MASS) # For Boston library(aglm) ## Read data xy <- Boston # xy is a data.frame to be processed. colnames(xy)[ncol(xy)] <- "y" # Let medv be the objective variable, y. ## Split data into train and test n <- nrow(xy) # Sample size. set.seed(2018) # For reproducibility. test.id <- sample(n, round(n/4)) # ID numbders for test data. test <- xy[test.id,] # test is the data.frame for testing. train <- xy[-test.id,] # train is the data.frame for training. x <- train[-ncol(xy)] y <- train$y newx <- test[-ncol(xy)] y_true <- test$y ## With the result of aglm() model <- aglm(x, y) lambda <- 0.1 plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred) ## With the result of cv.aglm() model <- cv.aglm(x, y) lambda <- model@lambda.min plot(model, s=lambda, resid=TRUE, add_rug=TRUE, verbose=FALSE, layout=c(3, 3)) y_pred <- predict(model, newx=newx, s=lambda) plot(y_true, y_pred)
Display textual information of the model
## S3 method for class 'AccurateGLM' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'AccurateGLM' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
A model object obtained from |
digits |
Used to control significant digits in printout. |
... |
Other arguments are passed directly to |
No return value, called for side effects.
Kenji Kondo
Get residuals of various types
## S3 method for class 'AccurateGLM' residuals( object, x = NULL, y = NULL, offset = NULL, weights = NULL, type = c("working", "pearson", "deviance"), s = NULL, ... )
## S3 method for class 'AccurateGLM' residuals( object, x = NULL, y = NULL, offset = NULL, weights = NULL, type = c("working", "pearson", "deviance"), s = NULL, ... )
object |
A model object obtained from |
x |
A design matrix.
If not given, |
y |
A response variable.
If not given, |
offset |
An offset values.
If not given, |
weights |
Sample weights.
If not given, |
type |
A string representing type of deviance:
|
s |
A numeric value specifying |
... |
Other arguments are currently not used and just discarded. |
A numeric vector representing calculated residuals.
Kenji Kondo