One of the most conspicuous trends of today’s
world is big data, thanks to the proliferation of advanced technology, there
are huge amounts of data available in the system and it is very hard to
analysis the data due to this. To overcome this concern, Analyst used Data
exploration. It is a technique which helps for finding relevant information.
Data exploration is used for getting
didactic information by data regulars for better analysis. Search used by data
consumers to form true analysis from the information gathered. According to the
Burnham & Anderson,(2002) Data Exploration is an
indispensable and advanced procedure for analysing the data. It is a procedure, which mainly used in data
warehouse. Thanks to the different systems and sources data can be seen in
different format. With many customers claiming that it is very hard to get
correct information from the database due to existence of big data and it
creates problems for decision making process. Relevant data is needed for tasks
such as statistical reporting, trend spotting and pattern spotting. Data
exploration is the process of gathering such relevant data.
Techniques
of data exploration
There are two main methodologies or techniques used to
retrieve relevant data from large and unorganised database. They are the manual
and automatic methods. Data exploration is the
first step in data analysis and typically involves summarising the main
characteristics of a data set. The manual method is also called data
exploration. For instance, Analysts commonly
use data visualisation software for
data exploration because it allows users too quickly and simply view most of
the relevant features of their data set. By using visualisation software users
can identify variables that are likely to have interesting observations with
the help of scatter plots or bar charts. -- Users can see if two or more
variables correlate and determine if they are good candidates for further
in-depth analysis. There are many software available in the market such as R
programming and SAP business object Cloud for visualisation. In this type of methodology it can be apply to
data of any type or size but because of its manual nature, there are many
options available to use data exploration for smaller data sets.
Second is the automatic method which also known as data mining.
It generally refers to gathering relevant data from large databases. While data
exploration refers to a data user being able to find his or her way through
large amounts of data in order to gather necessary information (Keim 2001)
R programming is a programming language which is mainly used in
predictive analysis. It is an open source and scripting language which helps in
data exploration. It was developed in1995 by Ross and Robert. This is mainly
used in data exploration for getting relevant information.
Another pivotal aspect related with this fact that there is another tool
available such as SAP Business Object cloud by which we can explore the data in
better way for example in below diagram by using this tools we are trying to
explore data that how many refugees coming in Australia from various countries.
Source- Historical
Statistics 2016, Refugee
Council of Australia
Difference between data exploration and predictive analysis.
Data Exploration is the first step in any data analysis. It involves
summarising the main characteristic of any database or dataset. It’s mostly
done in statistical software varying in advancement levels, depending upon the
complexity of the dataset. It can be conducted using visual analytics tools,
which has been discussed in aforementioned arguments.
Predictive Analysis falls under advanced analytics, and is used to make
predictions about unknown events that might unfold in the future. It can be
said that Predictive analysis uses a host of different software, pairing them
with many techniques ranging from artificial intelligence (AI), and machine
learning, to analyse already existing data and make predictions regarding its
course in the future. The difference is that data Exploration uncovers the
complex, and oftentimes unseen relationships between measurable variables while
predictive analysis offers outcomes and possibilities in the future of the
variables, from the variables (Keim 2001)
Impacts
of exploration of data
·
Understand the data
In this section for data exploration understanding the data is important
things so that’s why there are existence of many questions; such as how many
fields are available and what types of data are represented and what type of units
are included in the data.
·
Organize and subset the database
After understanding the data another main step in data exploration is
there are two types of tools are popular such as sort and filter by which there
is possibility to sort or filter the data for making decisions about models. By
filtering there is possibility to investigate large database and extract what
interest us.
·
Examine individual variables and their distributions
In this section there is possibility that we can find numerical variables
from lowest to highest and common way to summarize data is the histogram.
·
Calculate summary measures for individual variables
After examine individual variables and their distributions there is
another process in data exploration is, Excel is also able to provide useful
functions in relation with investigating individual variables and it can be
also useful to identify or count specific variable.
·
Examine relationship among variable
By using graphical methods there is possibility to track relationships.
Steps of Data Exploration and
Preparation
For building predictive
model there are many steps involved to understand,clear and prepare data;
3.
Bi-variate Analysis: Bi-variate Analysis finds out the relationship between two
variables. Here, we look for association and disassociation between
variables at a pre-defined significance level. We can perform bi-variate
analysis for any combination of categorical and continuous variables. The
combination can be: Categorical & Categorical, Categorical & Continuous
and Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.
4.
Missing values treatment: Missing data in the training data set can reduce the power /
fit of a model or can lead to a biased model because we have not
analysed the behaviour and relationship with other variables correctly. It
can lead to wrong prediction or classification
5.
Outlier treatment: Outlier is a commonly used terminology by analysts and
data scientists as it needs close attention else it can result
in wildly wrong estimations. Let’s take an example, we do customer
profiling and find out that the average annual income of customers is $0.8
million. But, there are two customers having annual income of $4 and $4.2
million. These two customers annual income is much higher than rest of the
population
6.
Variable transformation: Transformation refers to the replacement of a variable
by a function. For instance, replacing a variable x by the square / cube root
or logarithm x is a transformation. In other words, transformation is a process
that changes the distribution or relationship of a variable with others.
7.
Variable creation: Variable creation is a process to generate a new variables /
features based on existing variable(s). For example, say, we have date
(dd-mm-yy) as an input variable in a data set. We can generate new
variables like day, month, year, week, weekday that may have better
relationship with target variable. This step is used to highlight the hidden
relationship in a variable (Zuur 2010)
Submitted by- Harbinder Kaur Bhullar(4510860)
Savita Rani (4491997)
Submitted to- Dr Shah Miah
Date-15 September 2017
References:-
·
Burnham, K.P. & Anderson, D.R. (2002) Model Selection and
Multimodel Inference. A Practical Information–Theoretic Approach, 2nd edn.
Springer, New York.
·
http://www.learn.geekinterview.com/data-warehouse/data-analysis/what-is-data-exploration.html
·
https://www.quora.com/What-is-the-difference-between-data-exploration-and-predictive-analytics
·
D. Keim (2001), “Visual exploration of large
databases,” Communications of the ACM, vol. 44, no. 8, pp. 38–44, 2001


No comments:
Post a Comment