Data preparation is an essential activity in data analysis. There are several statistical methods that are used in data analysis which include: Linear model, logistic regression, k-means and decision trees. In this blog we will explain these statistical methods.
Data preparation
Data preparation, also known as data pre-processing is manipulation of the data in a form suitable for additional analysis and processing. Many different tasks are involved in this process and these tasks cannot be fully automated. More of the data preparation activities are tedious, routine and time consuming. According to some estimations around 60% to 80% of the time spent in data mining projects belongs to the process of data preparation. Data preparation is an essential activity for success in data mining projects. The process of data preparation improves the data quality and improves the quality of the data mining results. The process of data preparation involves several steps like checking the data for accuracy, checking or logging the data in, entering the data into the computer and transforming the data.
Linear model
Linear model or linear regression is a statistical approach for modelling of the relationship between a scalar dependent variable and one or more explanatory variables. There are two cases of linear regression:
- simple linear regression where we have one explanatory variable
- multiple linear regression where we have two or more explanatory variables.
In linear regression, data is modeled with using of linear predictor functions and unknown model parameters are estimated from the data. These models are called linear models.
Predictions
For generating predictions and residuals from the model, rxPredict() is the function which is used on various types of models. Some of the key arguments used in the formation of the syntax are as follows:
Logistic regression
Next we move on to logistic regression or a logit model. It is used to represent binary outcome variables. The log odds of the output result are represented as a linear arrangement of the predictor variables. To extract the output from R, we can use the summary command for the logit model:
Logit Models need a greater number of scenarios (greater sample size) as they utilize highest likelihood estimation techniques. In few situations, we can evaluate models for dichotomous outcomes where there are just a very few cases using exact regression analysis. It is hard to assess a logit model when the outcome is unusual. This does not depend on the size of the data-set.
k-means
Now, we move to k-means clustering. The k-mean clustering is a popular cluster analysis method in data analysis. By using rxKmeans() function, we can easily create good visualization for k-means clustering on a certain database.
First of all about the rxKmeans function we should know is the syntax. The rxKmeans function syntax includes formula, data, outFile, numClusters…etc like below. Here we provide the explanation for the core syntax.
- formula - Specifies formulas with variables are provided in the clustering algorithm.
- data - Variables can be search and found in the formula dataset.
- outFile - Cluster IDs can be note and written down in the outFile dataset.
- numClusters - Estimation of the clusters’ number (k)
- Algorithm means additional arguments which helps to control the k
Decision tree
The other function which is popularly used in data mining and we call it decision tree. The critical tool in data mining helps analytics to analyse large data-sets, whilst by using the decision tree, it is easier to build better visualization. The visualization mainly explores the decision rules for predicting the categorical or continuous outcome. Here we present a bit brief about the syntax we use to grow the tree:
We would like to thank our mentor, Dr Shah Jahan Miah, for providing guidance and support during the activities, which has motivated us to compile our ideas and experience in this blog.
Authors:
Aleksandar Jankulov s4518571
Komal Bhalla s4531062
Haoyang Di s3812986
No comments:
Post a Comment