Friday, 15 September 2017

Data Exploration in Predictive Analytics

By: Shehan Rodrigo, Piyumi Gunamuthu & Samson Akinyemi
 (Masters students in Victoria University Melbourne)

Data is everywhere. But we can measure the value of data only by the way we extract knowledge from it. Visual tools and other analytical techniques support data exploration for meaningful discovery of information. This blog post has been written as part of the assignment for unit Predictive Analytics studied at Victoria University Melbourne. We would like to thank our lecturer Dr.Shah Miah, who provided us guidance in understanding the value and scope of Predictive Analytics.

Data analytics add many more opportunities to businesses by providing deeper and more meaningful insights which enables them to strengthen the decision-making process and improve customer experience.  But today, most companies overwhelmed by the variety and the amount of data that they have and struggling to analyze, interpret and visualize them in a meaningful manner. To overcome these issues organizations are moving towards predictive analytics data discovery methods and visualization tools.  In order to properly handle these data analytics and visualization tools it is very important to identify the nature of the data and feed the data correctly to use in particular tools in required manner.  

What is Data Exploration?           
Data exploration can be identified as the first step in data analysis and basically involves summarizing the main characteristics of a data set. Data exploration usually carried out with the use of visual analytics tools such as SAP Lumira, SAP Analytics Cloud, Tableau and can be done using more advanced statistical software, like “R”. These tools allow users to quickly and simply view most of the relevant features of their data set by displaying data graphically - for example, through scatter plots or bar charts -users can see if two or more variables correlate and determine if they are good candidates for further in-depth analysis.

What is R?
Further, we would like to discuss some aspects of above mentioned data exploration tools. If we select “R”, it is a programming language which also named as an open source scripting language for predictive analytics and data visualization. The R allows academic statisticians and others with sophisticated programming skills to perform complex data statistical analysis and display the results in any of a multitude of visual graphics. The "R" name is derived from the first letter of the names of its two developers, Ross Ihaka and Robert Gentleman, who were associated with the University of Auckland in 1995. R provides a wide variety of statistical and graphical techniques such as linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering etc. and it is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. 

Examples of Data exploration using different tools.

Below is a snipped image from tutorial session where we practiced R programming language to generate histogram to identify the data related to an airline delay. R do parallel external memory algorithms which use to compute results by using xdf file formats. It includes conditionals, loops, user-defined recursive functions and input and output facilities. R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hard copy. R also help to Move, Manage Mung and Merge data to retrieve useful insights.

Source: https://campus.datacamp.com/courses/big-data-revolution-r-enterprise-tutorial/chapter-1-introduction-1?ex=6

Similarly, other data visualization tools and software provide more insight to explore data in more meaningful way. Below is another example where we did data exploration to identify how refugees spread across the world in 2015. Here we used SAP Analytics Cloud visualization software to explore the data which we obtained from world bank. Here we first insert the data into an excel sheet and then upload it to the online software. It helps us to easily identify the highest regions where refugees scattered and rank those countries. If we did it manually it will take hours to generate this output. But with this visualization software it was easy to explore the story behind the data. At a glance, we can identify which country has the highest refugee population and its geographical location. It saves time and help to analyze easily.

Data obtained from: data.worldbank.org/indicator/SM.POP.REFG.OR

As another example below demonstrate a visualization output which we generated for a previous research study. This study conducted for the purpose of representing data sets in a visualization report which gives a better and simplified insight about the Illicit Drug Usage in Australia, using SAP Business Object Lumira tool. We found that tool is very useful and having lots of options to analyse and visualize data. Even a beginner level user can easily make outputs using it. Below visualization reflects the popularity of different types of drugs in Australia during two different year combinations.

Data obtained from: https://www.aihw.gov.au/reports-statistics/behaviours-risk-factors/illicit-use-of-drugs/data

From all these examples, we can justify that proper data exploration methods will allow analysts and other interested parties to explore the stories behind rational data sets. 

Impact of Data Exploration in Predictive Analytics
Moreover, it is important to discuss about the impact of data exploration in predictive analytics. The quality of your data inputs decides the quality of your output. So, once you have got your business hypothesis ready, it makes sense to spend lot of time and efforts here. According to researchers’ estimates data exploration, cleaning and preparation can take up to 70% of your total project time. The analysts make such a great effort to these tasks as they could gain a high quality accurate output as a result. The organizations allocate huge budgets on these tasks as they expect a perfect analysis outcome from the process which is important to make decisions and predict the future of the organization to achieve competitive advantage in the industry. Most organizations gain the ability of planning future and setting up goals and targets depending on the data outputs generated through this process.

Challenges Related to Data Exploration 
Furthermore, there are challenges related to data exploration has identified in different studies. User interaction with database at various level of request is still a challenge. Ability to automatically steer the database with minimal query formations is a task that requires a high level of technical know-how including the versed used of programming instructions and its integration.

Idreos et al (2015) overview of data exploration techniques, highlighted the challenge experienced by data analyst at the user interaction layer, suggesting that we still lack declarative “exploration” languages to present and reason about popular navigational idioms. Such languages could facilitate custom optimizations, such as user-driven , reusing past or in-progress query results and customizing visualization tools. Other future directions include processing past user interaction histories to predict exploration trajectories and identify interesting exploration patterns. Similarly, at the database system layer there are numerous opportunities to reconsider fundamental assumptions about data and storage patterns and how they can be driven dynamically by high level requests.

The overall idea of data exploration is to achieve data navigation systems that automatically steer users towards interesting data. Visualization tools should inherently support exploration with help to identify patterns which are fully driven by the exploration paths taken by the users. Exploration methods should be able to provide answers instantly even if they are not complete, but it should also be able to eventually lead users towards interesting stories. 


Used Resources

Idreos, S., Pa paemmanouil, O. and Chaudhuri, S., 2015, May. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 277-281). ACM.

Linoff, G.S. and Berry, M.J., 2011. Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons.

Matloff, N., 2011. The art of R programming: A tour of statistical software design. No Starch Press.

Ohlhorst, F.J., 2012. Big data analytics: turning big data into big money. John Wiley & Sons

No comments:

Post a Comment