Data Analysis for Decision Making
Assignment: Build a scorecard from scratch
Deadline to submit: November 19, 2023, at 11:59PM (Paris time)
The main goal of this assignment is to write a report that explains the approach you have considered to build a scorecard by your own. We will consider the dataset “dataassignment.xlsx” . In this dataset, a bank has collected the information about payment incident for a lending portfolio of 2850 French borrowers. Our goal is to make prediction of the probability a borrower experiences a payment default on his credit conditionally to his/her characteristics. The variable we want to predict is the variable “incident” that is equal to “Yes” is the borrower has experienced a payment default, or “No” if the borrower has not experienced a payment default on his credit.
Using the Chapter 2 on probability, we want to forecast P(incident=”Yes”) using the concept of probability with multiple conditioning events as done in Example 10 of Chapter 2 for instance.
To predict the variable “incident”, you have at your disposal 8 variables.
• “income” : monthly income of the borrower (in euros).
• “duration”: credit duration of the borrower (in months).
• “amount”: credit amount of the borrower (in euros).
• “family”: “Single” if the borrower is single, “Married” if the borrower is married.
• “seniority” : number of months since the borrower has been a bank customer.
• “credcard”: “Yes” if the borrower ownsacredit card, “No” otherwise.
• “age” : age of the borrower (in years).
• “depbirth” : “Department 1”, “Department 2”, or “Department 3” according to the department of birth of the borrower.
To help the construction of the scorecard, have a look on the examples 10 and 11 of Chapter 2 that will be very useful.
In this project, I expect you to:
1. Perform a univariate analysis of the variables to get a better idea of the data. (see Chapter 1 for performing this step.)
a. Give the list of numerical variables and categorical variables in the dataset.
b. Give a table displaying some numerical descriptive measures for the numerical variables (such as the mean, standard deviation, skewness, kurtosis, …).
c. Give frequency tables for the categorical variables.
2. Bin the numerical variables into categorical variables using the binning method of
your choice. (see Chapter 2 for performing this step.). Note that it is also possible to perform the binning step after the step 3 of variableselection. Note also that it is
possible to consider several binning methods and to compare them in order to select the method that maximizes the dependence between the conditional variables and the variable “incident”.
3. Select a subset of 3 or 4 variables (out of the 8 variables) that will be used to predict the variable “incident”. (see Chapter 5 for performing this step.)
a. You should retain the variables that display a significant dependence with the variable “incident”. Use the statistical tests of Chapter 5: Advanced statistical tests to that end:
i. Pearson test of correlation if you are analyzing the dependence between two numerical variables.
ii. Chi-square test of independence if you are analyzing the dependence between two categorical variables.
iii. One-way ANOVA test if you are analyzing the dependence between 1 numerical and 1 categorical variables.
b. Then, out of those selected variables, select the ones that are most highly
correlated/associated with the variable “incident” using the measures of
correlation/association of Chapter 5: Advanced statistical tests (Cramer’V or Pearson correlation coefficient).
4. Build n-ways contingency tables and calculate the probability the borrower
experiences a payment default conditional to the variables that has been selected in the former step 3. (See example 10 of Chapter 2 for an example). At the end of this step, you must display the probability of payment default in a prediction table.
5. Build and display your final scorecard. (See example 11 of Chapter 2 for this step). It would be interesting to give some examples on how to use it for non-experts’ people working in the bank such as bank advisor staff. Also it would bean added value to
identify from the scorecard what are the main drivers explaining payment default experienced by borrowers.
A report of about 10 pages is expected explaining your approach and displaying the important results. Clarity,completeness, innovation potential, use of data, technical details, and overall presentation will be evaluated.
The assignment submission must include a written report of about 10 pages including screenshots of the SPSS outputs to support your explanations.