Friday, 15 September 2017

Meaning of data exploration and its technique

One of the most conspicuous trends of today’s world is big data, thanks to the proliferation of advanced technology, there are huge amounts of data available in the system and it is very hard to analysis the data due to this. To overcome this concern, Analyst used Data exploration. It is a technique which helps for finding relevant information.
Data exploration is used for getting didactic information by data regulars for better analysis. Search used by data consumers to form true analysis from the information gathered. According to the Burnham & Anderson,(2002) Data Exploration is an indispensable and advanced procedure for analysing the data. It is a procedure, which mainly used in data warehouse. Thanks to the different systems and sources data can be seen in different format. With many customers claiming that it is very hard to get correct information from the database due to existence of big data and it creates problems for decision making process. Relevant data is needed for tasks such as statistical reporting, trend spotting and pattern spotting. Data exploration is the process of gathering such relevant data.
Techniques of data exploration
There are two main methodologies or techniques used to retrieve relevant data from large and unorganised database. They are the manual and automatic methods. Data exploration is the first step in data analysis and typically involves summarising the main characteristics of a data set. The manual method is also called data exploration. For instance, Analysts commonly use data visualisation software for data exploration because it allows users too quickly and simply view most of the relevant features of their data set. By using visualisation software users can identify variables that are likely to have interesting observations with the help of scatter plots or bar charts. -- Users can see if two or more variables correlate and determine if they are good candidates for further in-depth analysis. There are many software available in the market such as R programming and SAP business object Cloud for visualisation. In this type of methodology it can be apply to data of any type or size but because of its manual nature, there are many options available to use data exploration for smaller data sets.
Second is the automatic method which also known as data mining. It generally refers to gathering relevant data from large databases. While data exploration refers to a data user being able to find his or her way through large amounts of data in order to gather necessary information (Keim 2001)
R programming is a programming language which is mainly used in predictive analysis. It is an open source and scripting language which helps in data exploration. It was developed in1995 by Ross and Robert. This is mainly used in data exploration for getting relevant information.
Another pivotal aspect related with this fact that there is another tool available such as SAP Business Object cloud by which we can explore the data in better way for example in below diagram by using this tools we are trying to explore data that how many refugees coming in Australia from various countries.


Source- Historical Statistics 2016, Refugee Council of Australia
Difference between data exploration and predictive  analysis.
Data Exploration is the first step in any data analysis. It involves summarising the main characteristic of any database or dataset. It’s mostly done in statistical software varying in advancement levels, depending upon the complexity of the dataset. It can be conducted using visual analytics tools, which has been discussed in aforementioned arguments.
Predictive Analysis falls under advanced analytics, and is used to make predictions about unknown events that might unfold in the future. It can be said that Predictive analysis uses a host of different software, pairing them with many techniques ranging from artificial intelligence (AI), and machine learning, to analyse already existing data and make predictions regarding its course in the future. The difference is that data Exploration uncovers the complex, and oftentimes unseen relationships between measurable variables while predictive analysis offers outcomes and possibilities in the future of the variables, from the variables (Keim 2001)
Impacts of exploration of data
·         Understand the data
In this section for data exploration understanding the data is important things so that’s why there are existence of many questions; such as how many fields are available and what types of data are represented and what type of units are included in the data.
·         Organize and subset the database
After understanding the data another main step in data exploration is there are two types of tools are popular such as sort and filter by which there is possibility to sort or filter the data for making decisions about models. By filtering there is possibility to investigate large database and extract what interest us.
·         Examine individual variables and their distributions
In this section there is possibility that we can find numerical variables from lowest to highest and common way to summarize data is the histogram.
·         Calculate summary measures for individual variables
After examine individual variables and their distributions there is another process in data exploration is, Excel is also able to provide useful functions in relation with investigating individual variables and it can be also useful to identify or count specific variable.
·         Examine relationship among variable
By using graphical methods there is possibility to track relationships.

 

Steps of Data Exploration and Preparation

For building predictive model there are many steps involved to understand,clear and prepare data;
1.      Variable Identification: In this Step we have information about input and output of variables. For instance, we want to predict, whether the students will play cricket or not. Here you need to identify predictor variables, target variable, data type of variables and category of variables.


2.      Univariate Analysis: we explore variables one by one. Method to perform univariate analysis will depend on whether the variable type is categorical or continuous.
3.      Bi-variate Analysis: Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.
4.      Missing values treatment: Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behaviour and relationship with other variables correctly. It can lead to wrong prediction or classification
5.      Outlier treatment: Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population
6.      Variable transformation: Transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.
7.      Variable creation: Variable creation is a process to generate a new variables / features based on existing variable(s). For example, say, we have date (dd-mm-yy) as an input variable in a data set. We can generate new variables like day, month, year, week, weekday that may have better relationship with target variable. This step is used to highlight the hidden relationship in a variable (Zuur 2010)













Submitted by- Harbinder Kaur Bhullar(4510860)
Savita Rani (4491997)
Submitted to- Dr Shah Miah
Date-15 September 2017       





References:-
·         Burnham, K.P. & Anderson, D.R. (2002) Model Selection and Multimodel Inference. A Practical Information–Theoretic Approach, 2nd edn. Springer, New York.
·         http://www.learn.geekinterview.com/data-warehouse/data-analysis/what-is-data-exploration.html
·         https://www.quora.com/What-is-the-difference-between-data-exploration-and-predictive-analytics
·         D. Keim (2001), “Visual exploration of large databases,” Communications of the ACM, vol. 44, no. 8, pp. 38–44, 2001

·         Zuur, A.F., Ieno, E.N. and Elphick, C.S., 2010. A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1(1), pp.3-14.

No comments:

Post a Comment