Exploring Machine Learning: Credit Card Default Prediction

29 November 2023

Introduction

In this mini-project, I applied the knowledge gained from the Machine Learning course at Seattle Pacific University to tackle a fascinating problem – predicting credit card defaults. The focus was on utilizing the skills acquired during the course to analyze and model the Default of Credit Card Clients Dataset.

Click here for full code

Prediction Problem Overview

The primary objective was to address a binary classification problem: predicting whether a credit card client would default in the next month (1=yes, 0=no). The dataset comprised 30,000 examples with 24 features, offering a diverse range of information including credit amount, gender, education, marital status, age, and repayment history.

Data Observations

Key Insights:

1. Default Rate: Approximately 22.32% of clients defaulted, showcasing the challenge of an imbalanced dataset.

2. Data Size: With 24 features and 30,000 examples, handling such extensive data posed a significant challenge.

3. Feature Types: The dataset included both numerical and categorical features, requiring thoughtful preprocessing.

Feature Engineering

To enhance the predictive power of the model, I engaged in feature engineering. Key additions included:

1. Credit Utilization: Calculating the ratio of total bill amounts to the credit limit, offering insights into financial stress.

2. Total Months in Delay: Counting months with repayment delay to create a new feature.

3. Average Delay Duration: Calculating the mean of positive delay values for each client.

Preprocessing and Transformations

Understanding the importance of data preprocessing, I undertook several steps:

1. Handling Missing Values: Ensured there were no 'Unknown' or 'N/A' values.

2. Imputing Columns:Managed NaN values appropriately.

3. Encoding Categorical Features: Utilized one-hot encoding for 'SEX,' 'EDUCATION,' and 'MARRIAGE' to prevent ordinal misinterpretation.

4. Scaling Numeric Features: Applied scaling to features like 'LIMIT_BAL,' 'AGE,' 'PAY_AMT1,' 'BILL_AMT1,' etc.

Baseline Model

The initial exploration involved implementing scikit-learn's baseline model, resulting in an accuracy of 78.36%. This baseline provided a foundation for further model comparisons.

Model Selection and Evaluation

Linear SVM Model

For a deeper analysis, I opted for a Linear SVM model, fine-tuning hyperparameters through RandomizedSearchCV. The model demonstrated a mean cross-validation accuracy of 81.42%, indicating stability in generalization.

Other Models Explored

Apart from the Linear SVM model, I tested Decision Tree, Random Forest, and k-Nearest Neighbors. Notable observations:

Random Forest: Outperformed other models with a mean accuracy of 82.02% and low standard deviation.

Overfitting/Underfitting: Linear SVM and Random Forest showed better generalization, while Decision Tree and kNN exhibited some sensitivity.

Results on Test Set

The Random Forest model, chosen for its balance of accuracy and speed, was evaluated on the test set. The test accuracy (81.84%) aligned closely with the mean cross-validation score, affirming the model's generalization to unseen data.

Feature Importance

Key features influencing predictions included repayment history (PAY_0), credit utilization, age, and demographic factors.

Concluding Remarks

The Random Forest model emerged as the top performer, exhibiting a cross-validation accuracy of 82.02%. The test set results (81.84%) validated the model's reliability on unseen data.

Takeaway

One significant takeaway from this project was the pivotal role of data transformation. Feature engineering, addressing missing values, and appropriate preprocessing not only optimized model performance but also provided a profound sense of control over the dataset. This newfound understanding extends beyond machine learning, impacting tasks like data analytics.

Future Directions

Considering the success of Random Forest, future exploration could involve:

Ensemble Methods: Implementing stacking or boosting for enhanced performance.

Feature Engineering: Exploring additional features to glean more insights.

Different Algorithms: Trying more complex algorithms or delving into deep learning models for intricate pattern recognition.

In conclusion, this project underscored the importance of both theoretical knowledge and hands-on application, emphasizing that effective data transformation is a key driver in the success of machine learning models.