Friday, 15 September 2017

Data Exploration in Predictive Analysis -Romika (s4574112) Nan (4561246)

Data Exploration

In today’s technology savvy environment, Predictive analysis has paramount importance.  One of my mentor advised me to spend significant time on data exploration, Data manipulation and data representation in order to learn predictive analysis. All these three facets are crucial part of analysis. Its noteworthy to dig deeper in each section of predictive analysis.  In this blog, I will discuss Data exploration in detail. Data Exploration is an approach to understand datasets. Also, the datasets with its characteristics. These characteristics can be
1)      Completeness
2)      Correctness
3)      Size
4)      Consistency
5)      uniqueness
6)      Relationship among other files.
All above key features makes data available for Data Exploration. Usually, Data Exploration is conducted by manual or automated processes like data profiling or data visualisation to give analyst an initial view into the data and an understanding of key characteristics.
The manual drill-down or filtering of data is accomplished to find anomalies or patterns recognition. Often, some scripting languages are used to get these insights out of datasets. These scripting languages could be SQL, R, Excel or some other tools available in market.
All these activities are aimed at creating clear mental model with clear view if some understanding of data in mind of analyst from where analyst can use it for further analysis. Here, I would like to introduce new term Data Quality. This is also done during exploration. It concludes removal of unusable parts of data and correcting poorly formatted data.

Why Data Exploration is required?
Data Exploration has crucial role in analytics industry whether it is Supply Chain market, Insurance Market, Real estate market, Healthcare management, banking sectors, Business organisation and many more. Data exploration is done in every part of business.

 And data is everything for organisations. Organisations step up their analytics teams and trying to find out new fact of business entities like
·         Customer
·         Employees
·         Products
·         Management
·         Sales
For example, Data Analytics teams trying to find out regarding
Customer: - Customer Retention, new Customers, Existing customers.
Employees: - Employees innovation programmes, Employee retention, Training and Task allocations.
Products: - Product demand, Product supply in market, Production.
Management: - existing use of resources, better utilisation of resources.
Sales: - Existing sales, drop in sales and setting targets to achieve sales.
These findings are also known as a knowledge discovery data and these findings based on data mining, SQL or R. Data Exploration is the key to enable discoveries whereas other techniques are to polish data sets like data manipulation, data visualisation.

After acknowledging an importance of data exploration, first question comes in mind that which tools we use to explore data. Let’s talk in detail about existing tools. First of all, Data exploration faces so many challenges: -
·         Largely due to the rise of big data, which is not only big but also highly diverse in terms of data structure.
·         Coming from new sources. Given the size, diversity, and newness of today's big data.
·         you need data exploration tools built for exploring a wide range of both new big data and enterprise data.

Features of Tools: -
Search technology for exploring diverse data types: - Data exploration should be as easy as Google, and Search technology satisfies all the requirements.
High ease of use for user productivity: - It should be easy so it conveys its message across and can be highly productive.
Short time to use, short time to business value. A data exploration capability with high ease of use enables a wide range of opportunities to get acquainted with data quickly, yet keep digging deeper over time for new business facts and the opportunities they expose.
Query capabilities in support of data exploration. Technical users, in particular, depend on query capabilities to find just the right data and structure the result set of a query so it is ready for immediate use in analytic applications.
Support for all major data platforms, ranging from relational databases to Hadoop. A modern data exploration tool needs to go where the data lives. Given the expanded range of data types now in play in analytics, it takes multiple platforms to manage diverse data in its original form.
Tools Available in Market
·         Trifacta – a data preparation and analysis platform
·         Paxata – self-service data preparation software
·         Alteryx – data blending and advanced data analytics software
·         IBM Infosphere Analyzer – a data profiling tool
·         Microsoft Power BI - interactive visualization and data analysis tool
·         OpenRefine - a standalone open source desktop application for data clean-up and data transformation.
·         Tableau software – interactive data visualization software.

While we are discussing about Data Exploration, it strikes in mind regarding predictive analysis as well. How predictive analysis is different from data explorations?

Predictive analysis is an analytic process designed to explore data in search of consistent patterns and/or systematic relationship between/among variables. Then validate the findings by applying the detected patterns to new subset of data. The goal of predictive analysis is to prediction or to find out meaningful information out of distinguished patterns. This process involves three stages: -
1)   Data Exploration: - In this initial stage data is explored from various sources. Be it relational database, structured database, unstructured database, weblogs or big data.
2)   Model Building: - in this stage, pattern recognition, the systematic relationships among variables detected and find accuracy in results after applying various methods like regression, correlation, standard deviation, mean, median or mode etc.
3)   Deployment: - in this last stage, after detecting meaningful pattern or some clear picture, the prediction is done for certain aspects based on findings of stage two.


From above scenario, we can simply differentiate Data Exploration and Predictive analysis. Let’s take another example and comprehend it in more detail
In the data exploration phase, you go through the dataset at hand and understand the different types of variables, try to identify any trends of data, understand missing values and outlier values, understand the distribution of the data, etc. Classification can be done during the data exploration stage; however, it would be a very essential one. You must use the values of each variable independently and they bucket them to classify the data points.
If we take example to differentiate data exploration and predictive analysis, let say, In KFC restaurant Managers want to find out sales in certain area of certain age group, in order to get that data Analytics team will explore data with validated or verified variables from given data set. Variable would be
1)      MemberID
2)      Age group
3)      Ordered Meal
4)      Total Price
5)       Postcode
6)       Date

Out of this dataset analytics team will do exploration and shrink down the dataset for region and age group. Analytics will do pattern recognition, correlation among given variables using statistical and mathematical techniques. After wards shrinking down the dataset predictive analysis comes to some conclusion, based on these analysis analytics can forecast or predict the sales for certain region of certain meal by certain age group in certain time of the year.
Here, we got some key differences about Data exploration in detail based on above scenario,
1.Data Exploration is about describing the data by use of statistical and visualization techniques. We explore data to bring important aspects of that data into focus for further analysis.
2. Data Exploration is essentially the first step in any data analysis.
3. Data Exploration is a prerequisite for predictive analytics.
Data Exploration Using R
R is an emerging language and is basically used for handling big data. For any predictive models, it is necessary to use some language to get some accurate values. And Language R is preferred language to provide insights of data during data exploration. I am going to represent some hypothetical example of data to provide an idea regarding Language R
Function to Import Data
rxImport(inData = spssFileName, outFile = xdfFileName, overwrite = TRUE, reportProgress = 0)
Function for Summarizing
rxSummary(~score:ageGroup, data=DF,transforms = list(ageGroup = cut(age, seq(0, 30, 10))))
Function for Cube
rxCube(score~ageGroup, data=DF, transforms=list(ageGroup = cut(age, seq(0,30,10))))
Function for Importing Text Data
RxTextData(file,  stringsAsFactors = FALSE, colClasses = NULL,colInfo = NULL,)
Function for Histogram
## Numeric Variables
rxHistogram(~ Close, data = djiXdf)
## Categorical Variable:
rxHistogram(~ DayOfWeek, data = djiXdf)
## Panel
rxHistogram(~ Close | DayOfWeek, data = djiXdf)
## Numeric Variables with a frequency weighting:
rxHistogram(~ Close, data = djiXdf, fweights = "Volume")

Techniques are used to explore data
1         Steps of Data Exploration and Preparation
2 Missing Value Treatment
§  Why missing value treatment is required?
§  Why data has missing values?
§  Which are the methods to treat missing value?
3 Techniques of Outlier Detection and Treatment
§  What is an outlier?
§  What are the types of outliers?
§  What are the causes of outliers?
§  What is the impact of outliers on dataset?
§  How to detect outlier?
§  How to remove outlier?

Steps of Data Exploration and Preparation
This is very first technique that includes collection and cleansing of data. 70% of the job is to collect and make it uniform. For this stage, we usually do 2 analysis
·         Variable analysis: - in this section predictor(input)and Target(output) variables get decided.
·         Univariate analysis: - In this second stage, analysis is based on categorical and continuous variables.
·         Bi-variate analysis: -in the third stage, Analysis is performed based on association and disassociation among different variable.After these preliminary analysis, Training dataset get ready. Out of this training data set we, now, perform another treatment like missing value and outliers.
Missing value treatment
Missing data in training data can reduce the accuracy of model and lead us to biased prediction. So, missing value treatment is vital in early stages of data exploration.
Table with missing information
Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056
Y
Mary

3046
N
James

3023
Y
Pete
M
3056
Y
Result from Missing Value
Gender
Funds
%funds
F
1
100
M
2
100


Table with complete information
Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056
Y
Mary
F
3046
N
James
M
3023
Y
Pete
M
3056
Y





Gender
Funds
%funds
F
2
50
M
3
100



 From above example, we can clearly say, missing values cannot predict accurate results. But, why we got data with missing values, nulls, empty columns. Let’s discuss it in detail.
Reasons of Missing values
We observed the significance of treatment of missing value in datasets. It may occur at two stages during data extraction and data collection
Data Extraction: - it is the most possible chance that we got missing values during data extraction and we should always check the data at this stage for anomalies and compare data with data at source level. At this stage, we can easily detect and rectify data.
Data Collection: - The errors or anomalies occurs at data collection time is difficult to rectify. Usually, error occurred this stage is categorized 4 stages
·         Missing completely at random: - this situation classifies that when each variable in dataset has equal chance of missing values.
·         Missing at random: -  this term refers to missing values at random and missing ratio is completely varies for each variable. For example, there is missing values for Female and at same level there is different ratio of Male variable. In this scenario, the accuracy will completely give false results and difficult to fix the issue.
·         Missing values at dependents variables: - this is the case when predictor has all values but dependent variables are missing. So, we cannot find the accurate result. Majority this happens in healthcare datasets. When one variable is having all inputs but dependent variables are missing. Because of these situations no analysis can be done and in short, data exploration is inaccurate.
·         Missing that depend upon missing value itself: -  This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.                       
Methods to treat missing values
Deletion: -this concludes deletion as List wise documents and pair wise deletion.
List wise deletion                                                                   
Name
Gender
Postcode
Funds
 Steve
M
3046
Y
Rose
F
3056
Y
Mary

3046
N
James

3023
Y
Pete
M
3056
Y
                        


Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056

Mary

3046
N
James

3023
Y

M
3056
Y

Pairwise Deletion




Prediction model: - this is the mostly used approach and cultured way to handle missing values. In this model the values are assumed based on model and fill the closest value. It is based on probability.
  1. KNN Imputation: - In this method of imputation, the missing value is imputed by similar attribute value. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
    • Pros:
      • k-nearest neighbour can predict both qualitative & quantitative attributes
      • Creation of predictive model for each attribute with missing data is not required
      • Attributes with multiple missing values can be easily treated
      • Correlation structure of the data is taken into consideration
    • Cons:
      • KNN algorithm is very time-consuming in analysing large database. It searches through all the dataset looking for the most similar instances.
      • Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

Outliers detection and handling
Outliers is commonly used term by analyst and refers to certain data out of all dataset is completely varies data. In simple words, if dataset stating income of certain group and out of this data 2 person’s income is way too high which may lead to inaccurate results. At this situation, to get accurate result for the whole dataset is to remove these two incomes from dataset.
Reasons for Outliers
Each time we come across outliers we should know the reason why outliers are appearing in the datasets. The cause of appearing outlier will lead us how to confront it. We broadly classify in two categories: -

Ø  Non-natural or Artificial
Ø  Natural
Data Entry Level Error: - Human errors are very common in during data collection. For example, if we person is entering someone’s income and by mistake it put extra leading zero can make huge difference in accuracy and results.
Measurement errors: -  The other type of the error is regarding measurements. For an instance, measurement done on wrong machine can lead to inaccurate results.
Experimental Errors: - while experiment new things human may come across bizarre values can also lead   up to imprecise results.
Natural Errors: -  When few dataset values are coming completely different but still fall in the same dataset is known as a natural error.
How to Remove Outliers
The simple theory behind removing outliers is to treat them as separate segment so that it doesn’t impact the whole dataset with some different value.

In Synopsis, Data Exploration is initial step to find out the insights of data and to achieve these goals we Use R and some techniques are quite helpful.

References
Submitted To:            Dr. Shah, Victoria University, Melbourne.
Submitted by:            Romika Mehta (s4574112)            Nan Wu (4561246)

Assignment 2(A):     Predictive Analysis - Semester 2 -2017

No comments:

Post a Comment