Question;Data Description:See the data dictionary for a full description of the data.Assignment Instructions:For this assignment we will fit a multiple logistic regression model for a binary response variable to thecredit_approval data set using PROC LOGISTIC, and assess its predictive accuracy. We will compare thepredictive performance of our multiple logistic regression model to the predictive performance of a prespecified model. You should use the same data formatting and dummy variables that you developedin Assignment #5.The assignment has been broken down into pieces to help you achieve the overall objective.Throughout the assignment you will find code snippets to help you with a particular SAS syntax.These code snippets are not complete pieces of SAS code. You will need to embed reformulated codesnippets within a SAS data step or within your SAS program if the code snippet is a SAS PROC.Split the Sample: In order to assess the predictive accuracy of this classification model we will employ astatistical methodology called cross-validation by splitting the sample data into a 70/30 split, which wewill respectively refer to as the training and testing samples. Split the data by generating a uniformrandom variable using the statement u=uniform(123), in a SAS data step. If (u<0.7) then assign theobservation to the training data set, else assign the observation to the testing data set. We will estimatethe model on the training data set and assess its predictive accuracy on the testing data set.data temp,set mydata.credit_approval,* Flag the observations as training/testing,* Since we set the seed value to 123, we will get the same set ofrandom numbers every time and we will all get the same set of randomnumbers,u=uniform(123),if (u<0.7) then train=1, else train=0,if A16='+' then Y=1,else if A16='-' then Y=0,else Y=.,* Create a response indicator based on the training/testing split,if (train=1) then Y_train=Y, else Y_train=.,/* DEFINE ALL OF YOUR DUMMY VARIABLES HERE */* Delete the observations with missing values,if (A1='?') or (A4='?') or (A5='?') or (A6='?') or (A7='?')or (A2=.) or (A3=.) or (A8=.) or (A11=.) or (A14=.) or (A15=.)then delete,run,In order for all of us to get the same answer we must follow the outline of the above data step.Fit the Model: Find the optimal model using backward variable selection. You will need to include anoutput statement in PROC LOGISTIC in order to output the model scores, i.e. the probability that Y=1.proc logistic data=temp descending,model Y_train = A2 A3 A8 A11 A14 A15A1_b A4_u A5_gA6_aa A6_c A6_cc A6_ff A6_i A6_k A6_m A6_q A6_w A6_xA7_bb A7_ff A7_h A7_vA9_t A10_t A12_t A13_g / selection=backward,output out=model_data pred=yhat,run,We will refer to the model selected through this backward variable selection procedure as Model #1.Your report should include the backward selection summary table, the parameter estimates, thegoodness-of-fit statistics, and a discussion of these results.In addition to the optimal model that you will define, your manager wants you to fit this particularmodel. We will refer to this model as Model #2.proc logistic data=temp descending,model Y_train = A9_t A2 A3,output out=model_data2 pred=yhat,run,We want to compare the predictive merits of the two models and suggest that one model be usedinstead of the other model. We will begin this analysis by discussing the in-sample (training sample)model fit including the parameter estimates and the goodness-of-fit statistics automatically produced bySAS.Note that by using the variable Y_train as the response variable SAS will automatically fit the model onthe training data set, but SAS will also score the testing data set with the out-of-sample predicted values(yhat where (train=0)). This is the standard approach to scoring out-of-sample observations required bySAS on all of the newer SAS linear model PROCs.Assessing the Predictive Accuracy: While the goodness-of-fit statistics provide an insight into the insample predictive accuracy of the fitted model. We must always assess the out-of-sample predictiveaccuracy of our model in order to guard against overfitting, hence the need for some type of crossvalidation. We will assess the predictive accuracy of our model by creating a lift chart (also known as acumulative gains chart) and computing the respective Kolmogorov-Smirnov test statistic as a modelcomparison measure. Here is how we compute the in-sample lift chart for Model #2.proc logistic data=temp descending,model Y_train = A9_t A2 A3,output out=model_data2 pred=yhat,run,* The descending option assigns the highest model scores to the lowestscore_decile,proc rank data=model_data2 out=training_scores descending groups=10,var yhat,ranks score_decile,where train=1,run,* To create the lift chart run this exact code,proc means data=training_scores sum,class score_decile,var Y,output out=pm_out sum(Y)=Y_Sum,run,proc print data=pm_out, run,data lift_chart,set pm_out (where=(_type_=1)),by _type_,Nobs=_freq_,score_decile = score_decile+1,if first._type_ then do,cum_obs=Nobs,model_pred=Y_Sum,end,else do,cum_obs=cum_obs+Nobs,model_pred=model_pred+Y_Sum,end,retain cum_obs model_pred,* 201 represents the number of successes,* This value will need to be changed with different samples,pred_rate=model_pred/201,base_rate=score_decile*0.1,lift = pred_rate-base_rate,drop _freq_ _type_,run,proc print data=lift_chart, run,ods graphics on,axis1 label=(angle=90 '% Captured from Target Population'),axis2 label=('Total Population'),legend1 label=(color=black height=1 '')value=(color=black height=1 'Model #2' 'Random Guess'),title 'Model #2: In-Sample Lift Chart',symbol1 color=green interpol=join w=2 value=dot height=1,symbol2 color=black interpol=join w=2 value=dot height=1,proc gplot data=lift_chart,plot pred_rate*base_rate base_rate*base_rate / overlaylegend=legend1 vaxis=axis1 haxis=axis2,run, quit,ods graphics off,These SAS commands will produce the following lift chart table and a lift chart plot.Obs score_decile Y_Sum Nobs cum_obs model_pred pred_rate base_ratelift1142454542 0.208960.1 0.108962235459077 0.383080.2 0.18308333545135112 0.557210.3 0.25721443645180148 0.736320.4 0.33632553345225181 0.900500.5 0.4005066945270190 0.945270.6 0.3452777345315193 0.960200.7 0.2602088345360196 0.975120.8 0.1751299045405196 0.975120.9 0.075121010545450201 1.000001.0 0.00000You can find the scaling factor 201 by using a PROC FREQ statement. 201 is the scaling factor for the insample lift chart. The out-of-sample lift chart will have a different scaling factor.proc freq data=temp,tables train*Y,run,You will produce the lift chart (table) and a plot of the lift chart for both models and for both the trainingand testing data sets and display all four tables and graphs in your report. From this, with a calculator orEXCEL, you can calculate both the Lift and the Kolmogorov-Smirnov (KS) statistic.?The Lift for any given bucket is defined as the response rate of the bucket DIVIDED by thetheoretical random response rate of the bucket. For example, if there are 10 buckets, thetheoretical response rate of any bucket is 10%. If bucket 6 had a response rate of 25%, then thelift in that bucket is 25%/10% = 2.5. The overall lift for the model is the MAXIMUM Lift value forall of the buckets. For more information on lift charts see:http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html?The Kolmogorov-Smirnov (KS) statistic for any given bucket is defined as the cumulativeresponse rate of the bucket MINUS the theoretical cumulative random response rate. Forexample, if there are 10 buckets, it is assumed that a random model would have 10% responsein each bucket. By the time you got to bucket 6, you should have 60% of your responders.However, if your model had already identified 80% of the responders by bucket 6, then your liftfor bucket 6 would be 80% - 60% = 20%. The KS for the overall model is the MAXIMUM KS valuefor all of the buckets.Assignment Document:All assignment reports should conform to the standards and style of the report template provided toyou. Results should be presented and discussed in an organized manner with the discussion in closeproximity of the results. The report should not contain unnecessary results or information.The ?Results? section of this report should contain two subsections: ?In-Sample Results? and ?Out-ofSample Results?. The ?In-Sample Results? section will contain the two fitted logistic regression modelswith their model parameter values, the output of a model selection procedure, a lift chart table andgraph for each model for the training data set, and a discussion of the models and their goodness-of-fitstatistics. The ?Out-of-Sample Results? section should contain a lift chart table and graph for each modelfor the testing data set with a discussion of their predictive accuracy and a recommendation for onemodel over the other model. The document should be submitted in pdf format.
Paper#62186 | Written in 18-Jul-2015Price : $32