Statistic
Essay by Easoncyc • March 9, 2019 • Exam • 1,452 Words (6 Pages) • 658 Views
ETC3250 Report
Group member: Youcheng Cao, Jingying Wang, Yi Yu
Introduction:
Regards to the data set we have been given, it has been divided into two categories: training set and test set. There is also a list of the variables, which includes the age, job, marital, etc. The purpose of the project is to build a model to predict the probability that a client will subscribe to a bank term deposit on the basis of these predictors. The main process we are going through includes accessing data, dealing with data and fitting them into different models, comparing models and choosing the best one.
Methodology:
Firstly, we analyze data by visualising them. The graphs below show that almost all of the variables have obvious relationship with the probability that a client will subscribe to a bank term deposit (namely,
By looking at the percentage bar chart related to
[pic 1][pic 2][pic 3][pic 4][pic 5]
Secondly, we need to deal with the data and select those data which is really useful for the model. Here we state several main steps to do so.
- We need to find out the correlations between variables. This is a very important step because we cannot reach the right conclusion for statistical analysis if we failed to so. In order to solve this problem, we draw a graph to show the correlations between the variables. We can see from the graph that the variable
and the variable has a strong correlation, which means that the change of would cause the change of to a large extent and this indicate that the two variables bring same information for the model, thus we decide to remove one of the highly correlated variables.
[pic 6]
(2) Selecting data by using Principal component analysis and k means
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA provides us a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. Thus, it would be easier for us to select outliers. Besides, we also use k means to further select data which is useful for us. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. In order to choose the most appropriate value of k, we combine with the Davies- Bouldin index (DB) to find out the most ideal value of k. The Davies–Bouldin index (DBI)is a metric for evaluating clustering algorithms. Consequently, we found that the optimum value of k is 8. By analysing the graph below, we found that those purple dots on the left side of the graph should be outliers. Because they distribute away from the PCA line. Then, we need to delete those outliers.
[pic 7]
(3) Deleting the data with abnormal attributes.
For the variable marital, we can found some value of the data is “unknown”. This indicated there are some missing values for those variables. There are various ways to deal with the missing value. We can choose either filling the missing value manually or deleting data. Regarding our data set, we have a huge data set, thus, it may cause significant deviation if we choose to fill the missing using mean value or other special value. Besides, we do not have too much missing value in our data set. Therefore, we can just delete the data which contains the value “unknown”.
(4) Discretization
We visualise all the data by drawing graphs so that we can find the relationship between variables and factor, and the correlations between variables. For instance, we found that we can regard year 0,4,6,9 in
...
...