Is there a relationship between credit scores and various demographics? The purpose of this is to determine what groups (young people, minorities, people living in certain locations, etc.) should be taught more about credit scores and their importance. Credit scores are very important as they can determine what kinds of credit you can get and how much credit you’re allowed. This can include things like credit cards and loans. An obstacle that I have, although not physical, is the idea that using demographics could introduce various biases into my data that could affect results. As we learned in our ethics class, these biases can have effects around the world, so it is important that the original data does not include them, or else the model will have them replicated. Credit scores are determined by payment history (35%), amounts owed (30%), credit history (15%), new inquiries (10%), and credit mix (10%). Although there are only five factors that go into determining scores, there is a lot of variation in the factors. Everyone has a different line of credit and credit utilization, and the amount of sources of credit also vary from person to person. FICO determines these scores and multiple credit unions, like Equifax and transunion, are responsible for actually taking note of differences and reporting them. This all impacts essentially everyone in the United States and can be seen in young people fresh out of high school, in CEOs, and in those who have just retired all the same. In addition, the five factors that go into the scores can be impacted greatly by the situations people are in, including where they live, what their job is, and the environment they grew up in. This is mainly what I want to determine for my project. Who suffers the most from this credit score structure, and what aspects of their lives cause this to happen? Are young people not educated enough from school causing them to have to learn the hard way? Are people living in certain areas of a city unable to have easy access to all of the services available? Are there biases against certain races ingrained in society? Hopefully my results would be able to answer these questions so that we have a better idea of who to help educate.
Although I could not find a source of data that actually provided me with both personal data and credit scores, I was able to find a dataset that had credit approval as the target variable. This means that the target is categorical instead of discrete like I had planned. This is a relatively small dataset with only 690 observations, but it was the most detailed dataset I could find. There are 15 features including gender, age, ethnicity, and employment. In total that is 9 categorical and 6 discrete variables. I got this data from the University of California Irvine Machine Learning Repository and unfortunately, their method of collecting the data is confidential.
In this project, I used a Boosted Trees Model, very similar to the one we used with the Titanic dataset, in order to categorize each individual as approved or not approved for various forms of credit. I did have to do some preprocessing and transform the categorical data with one-hot encoding. For my boosted trees model, I used 60 trees with a max depth of 5, as I found this gave me the best results.
I found that after training, my model was able to produce an accuracy of about 83.1%. On the left a three charts. The first shows the feature contributions for individual 121 from the validation set, as I felt like it did a good job highlighting the importance of the various features. As you can see, the two features that are the most detrimental to the result are debt and having prior defaults. This individual is also a young unmarried hispanic that is not a citizen, which all also hurt the outcome of being classified as no approval. What did help was being employed, but interestingly, it did not help very much. This could be because of both students and retirees counterbalancing those who are employed. I also thought it was interesting that education did not have much of an effect in this individual’s case, being a high school graduate. Looking at the importance of the features as a whole, we can see some similar trends. Prior default, age, citizenship, and debt all make up the most significant features. Features like zip code did not seem to have much of an effect in this case, and was not shown in the chart.
For this project, I would ideally want to have a dataset that included actual credit scores and not simply credit approval. That way I would be able to not only gauge how big of a contribution various features make, but also determine how much the credit score is actually affected. A large contribution to a small negative effect is not the same as a large contribution to a large negative effect. Aside from having a better data set, I would also like to have many more observations, preferable at least several thousand. This way, I would be able to attain a better accuracy and smaller loss. Lastly, with this dataset, I would still want to show feature contribution and importance, but instead of using a Boosted Trees Classifier, I would use a regression. This would be similar to what we did with the MPG dataset and I would use a linear activation to predict the credit score.