AllBestEssays.com - All Best Essays, Term Papers and Book Report
Search

Lending Club Data Analysis

Essay by   •  December 1, 2018  •  Research Paper  •  2,623 Words (11 Pages)  •  1,561 Views

Essay Preview: Lending Club Data Analysis

Report this essay
Page 1 of 11

Lending Club Data Analysis

-    Dr. Soper

[pic 1]

ISDS 577: Project Phase – 1

Presented by:

Ashutosh Lall

Gurleen Kaur

Tanisha Munshi

Wenjie Li

Contents

Introduction        3

A. Data        4

B. Research questions        10

C. Analytical methods        13

D. Figures and tables        15

E. References        17

F. Data Visualizations        18

Introduction

Lending Club is a peer to peer lending company based in United States, in which investors provide funds for potential borrowers and earn a profit depending on the risk they take.  

Lending Club enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors can search and browse the loan listings on Lending Club website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. Lending Club makes money by charging borrowers an origination fee and investors a service fee.

[pic 2]

A. Data

  1. Where will you get the data for your project?

The data for our project is sourced from:

https://www.kaggle.com/wendykan/lending-club-loan-data/home

These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information.

2.  Are there any legal or privacy concerns associated with your group possessing these data? If so, what measures will you take to protect your data?

There are no legal or privacy concerns associated with possessing this data since it is obtained from open data source i.e. Kaggle.

3.  In what format(s) will your data be provided to you?

The data is available to us in csv (comma separated values) format. This csv file will be used to read the data in Python and converted to .xlsx format to use Tableau for creating visualizations.

4.  Will it be necessary to accomplish two or more datasets for your project? If so, how do you plan to accomplish this?

For our project, we aim to build a prediction model which will predict whether a loan is good or bad based on various factors. Our current dataset includes complete loan information from 2007 Q1 – 2015 Q4. For prediction, we will be extracting data of 2016 Q1, from Lending Club’s official website and feed it to our prediction model.

5.   Will it be necessary to clean your data in any way? If so, what approach do you plan to use to clean your data?

To perform efficient data analysis, our data will need to be cleaned up since it has missing values and irrelevant data. Following steps will be taken to clean up the data:

-Handling missing values: Our dataset contains 75 variables and it was observed that a lot of them had missing values. Therefore, we will remove the variables which have 25% or more of missing data since they will not contribute to our analysis.

-Removing irrelevant variables: After removing variables containing more than 25% missing data, we plan to walk through our remaining list of variables slice by slice and keeping what we need and cleaning the rest. There are certain irrelevant variables in our dataset like id, member_Id which do not have any predictive power. Other variables like url, emp_title, desc, title which are more descriptive in nature will also be removed.

-Creating dummy variables and reducing categories: After removing irrelevant variables, we will create dummy variables for categorical variables such as grade, Sub_trade, Home_ownership, purpose, initial_list_status, application_type, and emp_length_int. Because all categories except emp_length_int are well defined for our data set, we decide to keep all categories. For emp_length_int, we will reduce categories to three categories, if_emp_short_term, and If_emp_ mid_term.

-Correlation analysis: After creating dummy variables and reducing category, we will check the correlation between the variables and remove the variables with a strong correlation to get a more accurate result.

6.   Which attributes/variables will your dataset contain?

After data pre-processing, our dataset will contain the following variables,

Variable name

Description

loan_amnt

The listed amount of the loan applied by the borrower.

funded_amnt_inv

The total amount committed by the investors for that loan.

term

The number of payments on the loan (36 or 60 months).

int_rate

Interest Rate on the loan.

installment

The monthly payment owed by the borrower if the loan originates.

grade

LC assigned loan grade.

subgrade

LC assigned loan subgrade.

home_ownership

The home ownership status provided by the borrower during registration.

annual_inc

The self-reported annual income provided by the borrower.

issue_d

The month in which the loan was funded.

purpose

A category provided by the borrower for the loan request.

addr_state

The state provided by the borrower in the loan application.

dti

Borrower’s debt-to-income ratio.

delinq_2yrs

Number of 30+ days past due incidences of delinquency in the borrower’s credit line for the past 2 years.

earliest_cr_line

The month the borrower’s earliest reported credit line was opened.

inq_last_6mnths

The number of inquiries in past 6 months.

open_acc

The number of open accounts in the borrower’s credit line.

pub_rec

Number of derogatory public records.

revol_bal

Total credit revolving balance.

revol_util

Revolving line utilization rate.

total_acc

The total number of credit lines in the borrower’s credit file.

initial_list_status

The initial listing status of the loan.

collections_12_mnths_ex_med

Number of collections in 12 months excluding medical conditions.

application_type

Indicates whether the loan application is individual or joint.

acc_now_delinq

The number of accounts in which the borrower is now delinquent.

tot_coll_amnt

Total collection amounts ever owed.

tot_cur_bal

Total current balance of all accounts.

total_rev_hi_lim

Total revolving high credit/credit limit

loan_condition

Target variable which represents the status of the loan (1 = bad, 0 = good).

emp_length_int

Employment length of the borrower.

...

...

Download as:   txt (16.3 Kb)   pdf (470 Kb)   docx (627.6 Kb)  
Continue for 10 more pages »
Only available on AllBestEssays.com