Quantcast
Channel: ListenData
Viewing all articles
Browse latest Browse all 425

SAS : Calculate AUC of Validation Data

$
0
0
This tutorial explains how to calculate area under curve (AUC) of validation sample. The AUC of validation sample is calculated by applying coefficients (estimates) derived from training sample to validation sample. This process is called Scoring. The detailed explanation is listed below -

Steps of calculating AUC of validation data

1. Split data into two parts - 70% Training and 30% Validation. It can be 60/40 or 80/20.

2. Run logistic regression model on training sample.

3. Note coefficients (estimates) of significant variables coming in the model run in Step 2.

4. Apply the following equation to calculate predictive probability in the validation sample

Logistic Regression Equation
In this case, b0 is intercept and b1...bk - coefficients derived from training sample (Step2)

5. Calculate Area under Curve (AUC) considering probability scores derived in Step 4.


Method I : PROC LOGISTIC to calculate AUC of Validation
Proc Logistic Data = trainingoutest=coeff descending;
class rank / param = ref;
Model admit = gre gpa rank / selection = stepwise slstay=0.15 slentry=0.15 stb;
Run;
Proc Logistic Data = validationinest=coeff descending;
class rank / param = ref;
Model admit = gpa rank / MAXITER=0;
Run;
The OUTEST= option in the PROC LOGISTIC stores final estimates in the SAS dataset. In this case, it is stored on the dataset named COEFF. We have run stepwise regression which drops an insignificant variable named GRE.

The INEST= option in the PROC LOGISTIC uses the final parameter estimates calculated from training dataset.

The MAXITER= option in the MODEL statement specifies the maximum number of iterations to perform. The combination of DATA=validation data, INEST=final estimates from training data, and MAXITER=0 causes PROC LOGISTIC to score, not refit, the validation data.

Important Points
  1. Use only significant variables from the training data in MODEL Statement of PROC LOGISTIC DATA = VALIDATION. In the above example, GRE variable was not included.
  2. Parameter Estimates (Coefficients) would remain same produced by both PROC LOGISTIC programs as we are scoring in second PROC LOGISTIC program, not building the model.
  3. Ignore warning 'Convergence was not attained in 0 iteration' when running second PROC LOGISTIC statement.
AUC - Validation

The c-statistics is AUC (Area under Curve). In this case, it is 0.607.

Method II : PROC NPAR1WAY

In this method, we are using Wilcoxon method to calculate AUC of validation sample. First, we are scoring using SCORE statement with VALIDATION sample.
Proc Logistic Data = training descending;
class rank / param = ref;
Model admit = gre gpa rank / selection = stepwise slstay=0.15 slentry=0.15 stb;
score data=validation out = valpred;
Run;

ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data= valpred ;
where admit^=.;
class admit;
var  p_1;
run;
ods select all;

data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;    AUC=d+0.5;   
put AUC=  GINI=;
keep AUC Gini;
output;
end;
run;
The above program returns AUC score of 0.6062.

Viewing all articles
Browse latest Browse all 425

Trending Articles