Data Exploration
In today’s technology savvy environment, Predictive analysis
has paramount importance. One of my
mentor advised me to spend significant time on data exploration, Data
manipulation and data representation in order to learn predictive analysis. All
these three facets are crucial part of analysis. Its noteworthy to dig deeper
in each section of predictive analysis.
In this blog, I will discuss Data exploration in detail. Data
Exploration is an approach to understand datasets. Also, the datasets with its
characteristics. These characteristics can be
1)
Completeness
2)
Correctness
3)
Size
4)
Consistency
5)
uniqueness
6)
Relationship among other files.
All above key features makes data available for Data
Exploration. Usually, Data Exploration is conducted by manual or automated
processes like data profiling or data visualisation to give analyst an initial
view into the data and an understanding of key characteristics.
The manual drill-down or filtering of data is accomplished
to find anomalies or patterns recognition. Often, some scripting languages are
used to get these insights out of datasets. These scripting languages could be
SQL, R, Excel or some other tools available in market.
All these activities are aimed at creating clear mental
model with clear view if some understanding of data in mind of analyst from
where analyst can use it for further analysis. Here, I would like to introduce
new term Data Quality. This is also done during exploration. It concludes
removal of unusable parts of data and correcting poorly formatted data.
Why Data Exploration is required?
Data Exploration has crucial role in analytics industry
whether it is Supply Chain market, Insurance Market, Real estate market,
Healthcare management, banking sectors, Business organisation and many more.
Data exploration is done in every part of business.
And data is
everything for organisations. Organisations step up their analytics teams and
trying to find out new fact of business entities like
·
Customer
·
Employees
·
Products
·
Management
·
Sales
For example, Data Analytics teams trying to find out regarding
Customer: - Customer Retention, new Customers, Existing customers.
Employees: - Employees innovation programmes, Employee retention,
Training and Task allocations.
Products: - Product demand, Product supply in market, Production.
Management: - existing use of resources, better utilisation of
resources.
Sales: - Existing sales, drop in sales and setting targets to
achieve sales.
These findings are also known as a knowledge discovery data and
these findings based on data mining, SQL or R. Data Exploration is the key to
enable discoveries whereas other techniques are to polish data sets like data
manipulation, data visualisation.
After acknowledging an importance of data exploration, first
question comes in mind that which tools we use to explore data. Let’s talk in
detail about existing tools. First of all, Data exploration faces so many challenges:
-
·
Largely due to the rise of big data, which is not only big but also
highly diverse in terms of data structure.
·
Coming from new sources. Given the size, diversity, and newness of
today's big data.
·
you need data exploration tools built for exploring a wide range of both
new big data and enterprise data.
Features of Tools: -
Search technology for exploring diverse data types:
- Data
exploration should be as easy as Google, and Search technology satisfies all
the requirements.
High ease of use for user productivity: - It should be easy
so it conveys its message across and can be highly productive.
Short time to use, short time to business value. A data
exploration capability with high ease of use enables a wide range of
opportunities to get acquainted with data quickly, yet keep digging deeper over
time for new business facts and the opportunities they expose.
Query capabilities in support of data exploration. Technical
users, in particular, depend on query capabilities to find just the right data
and structure the result set of a query so it is ready for immediate use in
analytic applications.
Support for all major data platforms, ranging from
relational databases to Hadoop. A modern data exploration tool needs to go
where the data lives. Given the expanded range of data types now in play in
analytics, it takes multiple platforms to manage diverse data in its original
form.
Tools Available in
Market
·
IBM Infosphere Analyzer – a data profiling
tool
·
OpenRefine - a
standalone open source desktop application for data clean-up and data
transformation.
While we are discussing about Data Exploration, it strikes in mind
regarding predictive analysis as well. How predictive analysis is different
from data explorations?
Predictive analysis is an analytic process designed to explore
data in search of consistent patterns and/or systematic relationship
between/among variables. Then validate the findings by applying the detected
patterns to new subset of data. The goal of predictive analysis is to
prediction or to find out meaningful information out of distinguished patterns.
This process involves three stages: -
1)
Data Exploration: - In
this initial stage data is explored from various sources. Be it relational
database, structured database, unstructured database, weblogs or big data.
2)
Model
Building: - in this stage, pattern recognition, the systematic
relationships among variables detected and find accuracy in results after
applying various methods like regression, correlation, standard deviation,
mean, median or mode etc.
3)
Deployment:
- in this last stage, after detecting meaningful pattern or some
clear picture, the prediction is done for certain aspects based on findings of
stage two.
From
above scenario, we can simply differentiate Data Exploration and Predictive
analysis. Let’s take another example and comprehend it in more detail
In
the data exploration phase, you go through the dataset at hand and understand
the different types of variables, try to identify any trends of data,
understand missing values and outlier values, understand the distribution of
the data, etc. Classification can be done during the data exploration stage; however,
it would be a very essential one. You must use the values of each variable
independently and they bucket them to classify the data points.
If we take example to differentiate data
exploration and predictive analysis, let say, In KFC restaurant Managers want
to find out sales in certain area of certain age group, in order to get that data
Analytics team will explore data with validated or verified variables from
given data set. Variable would be
1)
MemberID
2)
Age group
3)
Ordered Meal
4)
Total Price
5)
Postcode
6)
Date
Out of this dataset analytics team will do exploration and shrink
down the dataset for region and age group. Analytics will do pattern
recognition, correlation among given variables using statistical and
mathematical techniques. After wards shrinking down the dataset predictive
analysis comes to some conclusion, based on these analysis analytics can
forecast or predict the sales for certain region of certain meal by certain age
group in certain time of the year.
Here,
we got some key differences about Data exploration in detail based on above
scenario,
1.Data
Exploration is about describing the data by use of statistical and
visualization techniques. We explore data to bring important aspects of that
data into focus for further analysis.
2. Data Exploration is
essentially the first step in any data analysis.
3.
Data Exploration is a prerequisite for predictive analytics.
Data Exploration Using R
R
is an emerging language and is basically used for handling big data. For any predictive
models, it is necessary to use some language to get some accurate values. And
Language R is preferred language to provide insights of data during data
exploration. I am going to represent some hypothetical example of data to
provide an idea regarding Language R
Function to Import Data
rxImport(inData = spssFileName, outFile = xdfFileName,
overwrite = TRUE, reportProgress = 0)
Function for Summarizing
rxSummary(~score:ageGroup,
data=DF,transforms = list(ageGroup = cut(age, seq(0, 30, 10))))
Function for Cube
rxCube(score~ageGroup, data=DF, transforms=list(ageGroup =
cut(age, seq(0,30,10))))
Function for Importing
Text Data
RxTextData(file, stringsAsFactors = FALSE, colClasses = NULL,colInfo
= NULL,)
Function for Histogram
## Numeric Variables
rxHistogram(~ Close, data =
djiXdf)
## Categorical Variable:
rxHistogram(~ DayOfWeek,
data = djiXdf)
## Panel
rxHistogram(~ Close |
DayOfWeek, data = djiXdf)
## Numeric Variables with a
frequency weighting:
rxHistogram(~ Close, data =
djiXdf, fweights = "Volume")
Techniques are used to explore data
1
Steps of Data Exploration and Preparation
2 Missing Value Treatment
§
Why missing value treatment is required?
§
Why data has missing values?
§
Which are the methods to treat missing value?
3 Techniques of Outlier Detection and Treatment
§
What is an outlier?
§
What are the types of outliers?
§
What are the causes of outliers?
§
What is the impact of outliers on dataset?
§
How to detect outlier?
§
How to remove outlier?
Steps of Data Exploration and Preparation
This
is very first technique that includes collection and cleansing of data. 70% of
the job is to collect and make it uniform. For this stage, we usually do 2
analysis
·
Variable analysis: - in this section
predictor(input)and Target(output) variables get decided.
·
Univariate analysis: - In this second stage,
analysis is based on categorical and continuous variables.
·
Bi-variate analysis: -in the third stage, Analysis is
performed based on association and disassociation among different variable.After
these preliminary analysis, Training dataset get ready. Out of this training
data set we, now, perform another treatment like missing value and outliers.
Missing value treatment
Missing
data in training data can reduce the accuracy of model and lead us to biased
prediction. So, missing value treatment is vital in early stages of data
exploration.
Table with missing information
Name
|
Gender
|
Postcode
|
Funds
|
Steve
|
M
|
3046
|
Y
|
Rose
|
F
|
3056
|
Y
|
Mary
|
3046
|
N
|
|
James
|
3023
|
Y
|
|
Pete
|
M
|
3056
|
Y
|
Result from Missing Value
Gender
|
Funds
|
%funds
|
F
|
1
|
100
|
M
|
2
|
100
|
Table with complete information
Name
|
Gender
|
Postcode
|
Funds
|
Steve
|
M
|
3046
|
Y
|
Rose
|
F
|
3056
|
Y
|
Mary
|
F
|
3046
|
N
|
James
|
M
|
3023
|
Y
|
Pete
|
M
|
3056
|
Y
|
Gender
|
Funds
|
%funds
|
F
|
2
|
50
|
M
|
3
|
100
|
From above example, we can clearly say,
missing values cannot predict accurate results. But, why we got data with
missing values, nulls, empty columns. Let’s discuss it in detail.
Reasons of Missing values
We
observed the significance of treatment of missing value in datasets. It may
occur at two stages during data extraction and data collection
Data Extraction: - it is the most possible chance
that we got missing values during data extraction and we should always check
the data at this stage for anomalies and compare data with data at source
level. At this stage, we can easily detect and rectify data.
Data Collection: - The errors or anomalies occurs
at data collection time is difficult to rectify. Usually, error occurred this
stage is categorized 4 stages
·
Missing completely at random: - this situation classifies that
when each variable in dataset has equal chance of missing values.
·
Missing at random: - this term refers to missing values at random
and missing ratio is completely varies for each variable. For example, there is
missing values for Female and at same level there is different ratio of Male
variable. In this scenario, the accuracy will completely give false results and
difficult to fix the issue.
·
Missing values at dependents
variables: - this is the case when predictor has all values but
dependent variables are missing. So, we cannot find the accurate result. Majority
this happens in healthcare datasets. When one variable is having all inputs but
dependent variables are missing. Because of these situations no analysis can be
done and in short, data exploration is inaccurate.
·
Missing that depend upon missing
value itself: - This is a case when the probability of missing
value is directly correlated with missing value itself. For example: People
with higher or lower income are likely to provide non-response to their
earning.
Methods to treat missing values
Deletion: -this concludes deletion as List
wise documents and pair wise deletion.
List wise deletion
Name
|
Gender
|
Postcode
|
Funds
|
Steve
|
M
|
3046
|
Y
|
Rose
|
F
|
3056
|
Y
|
Pete
|
M
|
3056
|
Y
|
Name
|
Gender
|
Postcode
|
Funds
|
Steve
|
M
|
3046
|
Y
|
Rose
|
F
|
3056
|
|
Mary
|
3046
|
N
|
|
James
|
3023
|
Y
|
|
M
|
3056
|
Y
|
Pairwise Deletion
Prediction model: - this is the mostly used approach and
cultured way to handle missing values. In this model the values are assumed
based on model and fill the closest value. It is based on probability.
- KNN
Imputation: - In this method of imputation, the missing value is imputed by
similar attribute value. The similarity of two attributes is determined
using a distance function. It is also known to have certain advantage
& disadvantages.
- Pros:
- k-nearest neighbour can predict both qualitative &
quantitative attributes
- Creation of predictive model for each attribute with missing data
is not required
- Attributes with multiple missing values can be easily treated
- Correlation structure of the data is taken into consideration
- Cons:
- KNN algorithm is very time-consuming in analysing large database.
It searches through all the dataset looking for the most similar
instances.
- Choice of k-value is very critical. Higher value of k would
include attributes which are significantly different from what we need
whereas lower value of k implies missing out of significant attributes.
Outliers detection and handling
Outliers is commonly used term by analyst and refers to
certain data out of all dataset is completely varies data. In simple words, if
dataset stating income of certain group and out of this data 2 person’s income
is way too high which may lead to inaccurate results. At this situation, to get
accurate result for the whole dataset is to remove these two incomes from
dataset.
Reasons for Outliers
Each time we come across outliers we should know the
reason why outliers are appearing in the datasets. The cause of appearing
outlier will lead us how to confront it. We broadly classify in two categories:
-
Ø
Non-natural or Artificial
Ø
Natural
Data Entry Level Error: - Human errors are very common in
during data collection. For example, if we person is entering someone’s income
and by mistake it put extra leading zero can make huge difference in accuracy
and results.
Measurement errors: - The other type of the error
is regarding measurements. For an instance, measurement done on wrong machine
can lead to inaccurate results.
Experimental Errors: - while experiment new things
human may come across bizarre values can also lead up to imprecise results.
Natural Errors: - When few dataset values are coming completely
different but still fall in the same dataset is known as a natural error.
How to Remove Outliers
The simple theory behind removing outliers is to treat
them as separate segment so that it doesn’t impact the whole dataset with some
different value.
In Synopsis, Data Exploration is initial step to find out the insights of data and to achieve these goals we Use R and some techniques are quite helpful.
References
Submitted
To: Dr. Shah, Victoria University,
Melbourne.
Submitted
by: Romika Mehta (s4574112) Nan Wu (4561246)
Assignment
2(A): Predictive Analysis - Semester 2
-2017




No comments:
Post a Comment