Lending Club Data Analysis
Essay by Ashutosh Lall • December 1, 2018 • Research Paper • 2,623 Words (11 Pages) • 1,249 Views
Lending Club Data Analysis
- Dr. Soper
[pic 1]
ISDS 577: Project Phase – 1
Presented by:
Ashutosh Lall
Gurleen Kaur
Tanisha Munshi
Wenjie Li
Contents
Introduction 3
A. Data 4
B. Research questions 10
C. Analytical methods 13
D. Figures and tables 15
E. References 17
F. Data Visualizations 18
Introduction
Lending Club is a peer to peer lending company based in United States, in which investors provide funds for potential borrowers and earn a profit depending on the risk they take.
Lending Club enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors can search and browse the loan listings on Lending Club website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. Lending Club makes money by charging borrowers an origination fee and investors a service fee.
[pic 2]
A. Data
- Where will you get the data for your project?
The data for our project is sourced from:
https://www.kaggle.com/wendykan/lending-club-loan-data/home
These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.
2. Are there any legal or privacy concerns associated with your group possessing these data? If so, what measures will you take to protect your data?
There are no legal or privacy concerns associated with possessing this data since it is obtained from open data source i.e. Kaggle.
3. In what format(s) will your data be provided to you?
The data is available to us in csv (comma separated values) format. This csv file will be used to read the data in Python and converted to .xlsx format to use Tableau for creating visualizations.
4. Will it be necessary to accomplish two or more datasets for your project? If so, how do you plan to accomplish this?
For our project, we aim to build a prediction model which will predict whether a loan is good or bad based on various factors. Our current dataset includes complete loan information from 2007 Q1 – 2015 Q4. For prediction, we will be extracting data of 2016 Q1, from Lending Club’s official website and feed it to our prediction model.
5. Will it be necessary to clean your data in any way? If so, what approach do you plan to use to clean your data?
To perform efficient data analysis, our data will need to be cleaned up since it has missing values and irrelevant data. Following steps will be taken to clean up the data:
-Handling missing values: Our dataset contains 75 variables and it was observed that a lot of them had missing values. Therefore, we will remove the variables which have 25% or more of missing data since they will not contribute to our analysis.
-Removing irrelevant variables: After removing variables containing more than 25% missing data, we plan to walk through our remaining list of variables slice by slice and keeping what we need and cleaning the rest. There are certain irrelevant variables in our dataset like id, member_Id which do not have any predictive power. Other variables like url, emp_title, desc, title which are more descriptive in nature will also be removed.
-Creating dummy variables and reducing categories: After removing irrelevant variables, we will create dummy variables for categorical variables such as grade, Sub_trade, Home_ownership, purpose, initial_list_status, application_type, and emp_length_int. Because all categories except emp_length_int are well defined for our data set, we decide to keep all categories. For emp_length_int, we will reduce categories to three categories, if_emp_short_term, and If_emp_ mid_term.
-Correlation analysis: After creating dummy variables and reducing category, we will check the correlation between the variables and remove the variables with a strong correlation to get a more accurate result.
6. Which attributes/variables will your dataset contain?
After data pre-processing, our dataset will contain the following variables,
Variable name | Description |
loan_amnt | The listed amount of the loan applied by the borrower. |
funded_amnt_inv | The total amount committed by the investors for that loan. |
term | The number of payments on the loan (36 or 60 months). |
int_rate | Interest Rate on the loan. |
installment | The monthly payment owed by the borrower if the loan originates. |
grade | LC assigned loan grade. |
subgrade | LC assigned loan subgrade. |
home_ownership | The home ownership status provided by the borrower during registration. |
annual_inc | The self-reported annual income provided by the borrower. |
issue_d | The month in which the loan was funded. |
purpose | A category provided by the borrower for the loan request. |
addr_state | The state provided by the borrower in the loan application. |
dti | Borrower’s debt-to-income ratio. |
delinq_2yrs | Number of 30+ days past due incidences of delinquency in the borrower’s credit line for the past 2 years. |
earliest_cr_line | The month the borrower’s earliest reported credit line was opened. |
inq_last_6mnths | The number of inquiries in past 6 months. |
open_acc | The number of open accounts in the borrower’s credit line. |
pub_rec | Number of derogatory public records. |
revol_bal | Total credit revolving balance. |
revol_util | Revolving line utilization rate. |
total_acc | The total number of credit lines in the borrower’s credit file. |
initial_list_status | The initial listing status of the loan. |
collections_12_mnths_ex_med | Number of collections in 12 months excluding medical conditions. |
application_type | Indicates whether the loan application is individual or joint. |
acc_now_delinq | The number of accounts in which the borrower is now delinquent. |
tot_coll_amnt | Total collection amounts ever owed. |
tot_cur_bal | Total current balance of all accounts. |
total_rev_hi_lim | Total revolving high credit/credit limit |
loan_condition | Target variable which represents the status of the loan (1 = bad, 0 = good). |
emp_length_int | Employment length of the borrower. |
...
...