Monday, 17 September 2018

DATA MINING


Data mining is that the method of discovering patterns in massive data set involving strategies at the intersection of database systems, machine learning and statistics. Data processing is associate knowledge domain sub-field of computing with associate overall goal to extract info from a knowledge set and remodel the data into a visible structure for more use. Data mining is that the analysis step of the KDD or "knowledge discovery in databases" process. Other than the raw analysis step, it additionally involves information and knowledge management aspects, model knowledge pre-processing, and logical thinking issues, powerful metrics, complex issues, image, post-processing of discovered structures and on-line change.

The term "data mining" is in truth a name, as a result of the goal is that the extraction of patterns and information from massive amounts of information, not the extraction of information itself. It is alsononsensically and is often applied to any sort of large-scale knowledge or moreover as any application of laptop call network, as well as computer science and business intelligence. The book knowledge mining: sensible machine learning tools and techniques with Java was originally to be named simply sensible machine learning, and therefore the term data processing was solely supplementary for selling reasons. Typically, a lot of general terms knowledge analysis and analytics or, once touching on actual ways, computer science and machine learning square measure a lot of acceptable.

The actual data processing task is that the semi-automatic or automatic analysis of huge quantities of knowledge to extract antecedent unknown, attention-grabbing patterns like teams of knowledge records, uncommon record, and dependencies. This sometimes involves exploitation information techniques like special indices. These patterns will then be seen as a form of outline of the input file, and should be utilized in more analysis or, for instance, in machine learning and prophetic analytics. For instance, information mining step may determine multiple teams within the data, which may then be wont to get a lot of correct prediction results by a choice network. Neither the info assortment, information preparation, nor result interpretation and news is an element of the info mining step, however do belong to the KDD method as extra steps.

by Hari & Arpan

Friday, 15 September 2017

Predictive Analytics

Predictive Analytics
Friday September 15, 2017


The technology that amazes us by predicting the future outcomes and trends by extracting the information and creating patterns from the existing historical sets is known as Predictive Analytics. It is basically branch of advanced analytics that is used to make prediction about the unknown future events. Predictive Analytics uses various techniques from data mining, statistics modelling, machine learning and artificial intelligence to analyze the current data to make prediction about future outcomes.

This blog post has been written as a part of the assignment for Predictive Analytics studied at Victoria University. By getting the practical hands-on experience of SAP product in the university has always been the best part of learning experience. We are very obliged to our lecturer Dr Shah Miah who helped in understanding the value and scope of Predictive Analytics.

As the competition is growing in the market and there are heaps of data that is kept in the database that is also known as Big Data. The Big Data made up of historical data sets can be used to produce some valuable information about the organization and processes. The information that is extracted is done with the help of Data Mining tools and predictive model is built. These models help company to take better decision.

In our task, we discussed, why more and more organization area turning into Predictive Analytics?
To increase the bottom line and competitive advantage, more and more organization are turning to predictive analytics because of the growing volumes and types of data, and more interest in using data to produce valuable insights, faster and cheap computers, easy to use software, due to tougher economic conditions and a need for competitive differentiation. 

The next question that we touched on is, who is using it?
There are various industry using Predictive Analytics to reduce risks, optimize their operation and increase revenue. For example, Banking and Financial Services, Manufacturing, Healthcare, production. Even the government is also getting the benefits of Predictive Analytics to get the insight data reports.

How this technology works, how do they do that?
Predictive Analytics models use known result to develop a model that can be used to predict values for different data. Modeling provides results in the form of prediction that represent a probability of the target variable for example, revenue based on estimated significance from a set of input variables. There are two different types of models that are classification model and Regression model that is used to predict the future outcomes.

What are the different types of models?
First model is Classification model is basically a Boolean approach that is 0 or 1, where you receive two answers like yes or no. For instance, when you try to classify whether someone is likely to leave, whether he will respond to a solicitation, whether he is a good or bad credit risk. Normally the model results are in the form of 0 or 1. Second model is Regression model that predict a number for example, how much revenue a customer will generate over the next year or number of months before a component will fail on a machine.

What are the importance of Predictive Analytics?
As already mentioned, more and more organization are turning to Predictive Analytics to increase their competitive advantage like detecting fraud, optimizing marketing campaign, improving operations, reducing risks these are some of the common uses that makes predictive analytics helpful. Now the question is how it works, for example in Fraud Detection, combining multiple analytics methods can improve pattern detection and prevent criminal behaviour.  Other use includes Operating Marketing Campaigns, for instance, Predictive Analytics are used to determine customer response or purchases, as well as promote cross sell opportunities. It helps businesses attract, retain and grow their most profitable customers. By Improving operations includes various companies nowadays use predictive models to forecast inventory and mange resources. For example, Hotels try to predict the number of guests for any given night to maximize occupancy and increase revenue. It helps organization to function more efficiently. Last one is by Reducing Risk includes Credit scores are used to assess a buyer’s likelihood of default for purchases and are a well-known example of predictive analytics. A credit score is a number generated by a predictive model that incorporates all data relevant to a person’s creditworthiness. Other risk-related uses include insurance claims and collections.
Hence, it is inferred that Predictive Analytics has become crucial not only in the business but also for different purposes. There are several techniques and models that help organizations obtain insights from their data. However it is essential that companies interested in implementing Predictive Analytics develop deeper technical skills in order to acquire more profits by adapting this solution to their specific needs and requirements.  
Dhruv Arora / Daniel Velasco


   




Data Exploration in Predictive Analytics

By: Shehan Rodrigo, Piyumi Gunamuthu & Samson Akinyemi
 (Masters students in Victoria University Melbourne)

Data is everywhere. But we can measure the value of data only by the way we extract knowledge from it. Visual tools and other analytical techniques support data exploration for meaningful discovery of information. This blog post has been written as part of the assignment for unit Predictive Analytics studied at Victoria University Melbourne. We would like to thank our lecturer Dr.Shah Miah, who provided us guidance in understanding the value and scope of Predictive Analytics.

Data analytics add many more opportunities to businesses by providing deeper and more meaningful insights which enables them to strengthen the decision-making process and improve customer experience.  But today, most companies overwhelmed by the variety and the amount of data that they have and struggling to analyze, interpret and visualize them in a meaningful manner. To overcome these issues organizations are moving towards predictive analytics data discovery methods and visualization tools.  In order to properly handle these data analytics and visualization tools it is very important to identify the nature of the data and feed the data correctly to use in particular tools in required manner.  

What is Data Exploration?           
Data exploration can be identified as the first step in data analysis and basically involves summarizing the main characteristics of a data set. Data exploration usually carried out with the use of visual analytics tools such as SAP Lumira, SAP Analytics Cloud, Tableau and can be done using more advanced statistical software, like “R”. These tools allow users to quickly and simply view most of the relevant features of their data set by displaying data graphically - for example, through scatter plots or bar charts -users can see if two or more variables correlate and determine if they are good candidates for further in-depth analysis.

What is R?
Further, we would like to discuss some aspects of above mentioned data exploration tools. If we select “R”, it is a programming language which also named as an open source scripting language for predictive analytics and data visualization. The R allows academic statisticians and others with sophisticated programming skills to perform complex data statistical analysis and display the results in any of a multitude of visual graphics. The "R" name is derived from the first letter of the names of its two developers, Ross Ihaka and Robert Gentleman, who were associated with the University of Auckland in 1995. R provides a wide variety of statistical and graphical techniques such as linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering etc. and it is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. 

Examples of Data exploration using different tools.

Below is a snipped image from tutorial session where we practiced R programming language to generate histogram to identify the data related to an airline delay. R do parallel external memory algorithms which use to compute results by using xdf file formats. It includes conditionals, loops, user-defined recursive functions and input and output facilities. R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hard copy. R also help to Move, Manage Mung and Merge data to retrieve useful insights.

Source: https://campus.datacamp.com/courses/big-data-revolution-r-enterprise-tutorial/chapter-1-introduction-1?ex=6

Similarly, other data visualization tools and software provide more insight to explore data in more meaningful way. Below is another example where we did data exploration to identify how refugees spread across the world in 2015. Here we used SAP Analytics Cloud visualization software to explore the data which we obtained from world bank. Here we first insert the data into an excel sheet and then upload it to the online software. It helps us to easily identify the highest regions where refugees scattered and rank those countries. If we did it manually it will take hours to generate this output. But with this visualization software it was easy to explore the story behind the data. At a glance, we can identify which country has the highest refugee population and its geographical location. It saves time and help to analyze easily.

Data obtained from: data.worldbank.org/indicator/SM.POP.REFG.OR

As another example below demonstrate a visualization output which we generated for a previous research study. This study conducted for the purpose of representing data sets in a visualization report which gives a better and simplified insight about the Illicit Drug Usage in Australia, using SAP Business Object Lumira tool. We found that tool is very useful and having lots of options to analyse and visualize data. Even a beginner level user can easily make outputs using it. Below visualization reflects the popularity of different types of drugs in Australia during two different year combinations.

Data obtained from: https://www.aihw.gov.au/reports-statistics/behaviours-risk-factors/illicit-use-of-drugs/data

From all these examples, we can justify that proper data exploration methods will allow analysts and other interested parties to explore the stories behind rational data sets. 

Impact of Data Exploration in Predictive Analytics
Moreover, it is important to discuss about the impact of data exploration in predictive analytics. The quality of your data inputs decides the quality of your output. So, once you have got your business hypothesis ready, it makes sense to spend lot of time and efforts here. According to researchers’ estimates data exploration, cleaning and preparation can take up to 70% of your total project time. The analysts make such a great effort to these tasks as they could gain a high quality accurate output as a result. The organizations allocate huge budgets on these tasks as they expect a perfect analysis outcome from the process which is important to make decisions and predict the future of the organization to achieve competitive advantage in the industry. Most organizations gain the ability of planning future and setting up goals and targets depending on the data outputs generated through this process.

Challenges Related to Data Exploration 
Furthermore, there are challenges related to data exploration has identified in different studies. User interaction with database at various level of request is still a challenge. Ability to automatically steer the database with minimal query formations is a task that requires a high level of technical know-how including the versed used of programming instructions and its integration.

Idreos et al (2015) overview of data exploration techniques, highlighted the challenge experienced by data analyst at the user interaction layer, suggesting that we still lack declarative “exploration” languages to present and reason about popular navigational idioms. Such languages could facilitate custom optimizations, such as user-driven , reusing past or in-progress query results and customizing visualization tools. Other future directions include processing past user interaction histories to predict exploration trajectories and identify interesting exploration patterns. Similarly, at the database system layer there are numerous opportunities to reconsider fundamental assumptions about data and storage patterns and how they can be driven dynamically by high level requests.

The overall idea of data exploration is to achieve data navigation systems that automatically steer users towards interesting data. Visualization tools should inherently support exploration with help to identify patterns which are fully driven by the exploration paths taken by the users. Exploration methods should be able to provide answers instantly even if they are not complete, but it should also be able to eventually lead users towards interesting stories. 


Used Resources

Idreos, S., Pa paemmanouil, O. and Chaudhuri, S., 2015, May. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 277-281). ACM.

Linoff, G.S. and Berry, M.J., 2011. Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons.

Matloff, N., 2011. The art of R programming: A tour of statistical software design. No Starch Press.

Ohlhorst, F.J., 2012. Big data analytics: turning big data into big money. John Wiley & Sons

Data Exploration in Predictive Analysis -Romika (s4574112) Nan (4561246)

Data Exploration

In today’s technology savvy environment, Predictive analysis has paramount importance.  One of my mentor advised me to spend significant time on data exploration, Data manipulation and data representation in order to learn predictive analysis. All these three facets are crucial part of analysis. Its noteworthy to dig deeper in each section of predictive analysis.  In this blog, I will discuss Data exploration in detail. Data Exploration is an approach to understand datasets. Also, the datasets with its characteristics. These characteristics can be
1)      Completeness
2)      Correctness
3)      Size
4)      Consistency
5)      uniqueness
6)      Relationship among other files.
All above key features makes data available for Data Exploration. Usually, Data Exploration is conducted by manual or automated processes like data profiling or data visualisation to give analyst an initial view into the data and an understanding of key characteristics.
The manual drill-down or filtering of data is accomplished to find anomalies or patterns recognition. Often, some scripting languages are used to get these insights out of datasets. These scripting languages could be SQL, R, Excel or some other tools available in market.
All these activities are aimed at creating clear mental model with clear view if some understanding of data in mind of analyst from where analyst can use it for further analysis. Here, I would like to introduce new term Data Quality. This is also done during exploration. It concludes removal of unusable parts of data and correcting poorly formatted data.

Why Data Exploration is required?
Data Exploration has crucial role in analytics industry whether it is Supply Chain market, Insurance Market, Real estate market, Healthcare management, banking sectors, Business organisation and many more. Data exploration is done in every part of business.

 And data is everything for organisations. Organisations step up their analytics teams and trying to find out new fact of business entities like
·         Customer
·         Employees
·         Products
·         Management
·         Sales
For example, Data Analytics teams trying to find out regarding
Customer: - Customer Retention, new Customers, Existing customers.
Employees: - Employees innovation programmes, Employee retention, Training and Task allocations.
Products: - Product demand, Product supply in market, Production.
Management: - existing use of resources, better utilisation of resources.
Sales: - Existing sales, drop in sales and setting targets to achieve sales.
These findings are also known as a knowledge discovery data and these findings based on data mining, SQL or R. Data Exploration is the key to enable discoveries whereas other techniques are to polish data sets like data manipulation, data visualisation.

After acknowledging an importance of data exploration, first question comes in mind that which tools we use to explore data. Let’s talk in detail about existing tools. First of all, Data exploration faces so many challenges: -
·         Largely due to the rise of big data, which is not only big but also highly diverse in terms of data structure.
·         Coming from new sources. Given the size, diversity, and newness of today's big data.
·         you need data exploration tools built for exploring a wide range of both new big data and enterprise data.

Features of Tools: -
Search technology for exploring diverse data types: - Data exploration should be as easy as Google, and Search technology satisfies all the requirements.
High ease of use for user productivity: - It should be easy so it conveys its message across and can be highly productive.
Short time to use, short time to business value. A data exploration capability with high ease of use enables a wide range of opportunities to get acquainted with data quickly, yet keep digging deeper over time for new business facts and the opportunities they expose.
Query capabilities in support of data exploration. Technical users, in particular, depend on query capabilities to find just the right data and structure the result set of a query so it is ready for immediate use in analytic applications.
Support for all major data platforms, ranging from relational databases to Hadoop. A modern data exploration tool needs to go where the data lives. Given the expanded range of data types now in play in analytics, it takes multiple platforms to manage diverse data in its original form.
Tools Available in Market
·         Trifacta – a data preparation and analysis platform
·         Paxata – self-service data preparation software
·         Alteryx – data blending and advanced data analytics software
·         IBM Infosphere Analyzer – a data profiling tool
·         Microsoft Power BI - interactive visualization and data analysis tool
·         OpenRefine - a standalone open source desktop application for data clean-up and data transformation.
·         Tableau software – interactive data visualization software.

While we are discussing about Data Exploration, it strikes in mind regarding predictive analysis as well. How predictive analysis is different from data explorations?

Predictive analysis is an analytic process designed to explore data in search of consistent patterns and/or systematic relationship between/among variables. Then validate the findings by applying the detected patterns to new subset of data. The goal of predictive analysis is to prediction or to find out meaningful information out of distinguished patterns. This process involves three stages: -
1)   Data Exploration: - In this initial stage data is explored from various sources. Be it relational database, structured database, unstructured database, weblogs or big data.
2)   Model Building: - in this stage, pattern recognition, the systematic relationships among variables detected and find accuracy in results after applying various methods like regression, correlation, standard deviation, mean, median or mode etc.
3)   Deployment: - in this last stage, after detecting meaningful pattern or some clear picture, the prediction is done for certain aspects based on findings of stage two.


From above scenario, we can simply differentiate Data Exploration and Predictive analysis. Let’s take another example and comprehend it in more detail
In the data exploration phase, you go through the dataset at hand and understand the different types of variables, try to identify any trends of data, understand missing values and outlier values, understand the distribution of the data, etc. Classification can be done during the data exploration stage; however, it would be a very essential one. You must use the values of each variable independently and they bucket them to classify the data points.
If we take example to differentiate data exploration and predictive analysis, let say, In KFC restaurant Managers want to find out sales in certain area of certain age group, in order to get that data Analytics team will explore data with validated or verified variables from given data set. Variable would be
1)      MemberID
2)      Age group
3)      Ordered Meal
4)      Total Price
5)       Postcode
6)       Date

Out of this dataset analytics team will do exploration and shrink down the dataset for region and age group. Analytics will do pattern recognition, correlation among given variables using statistical and mathematical techniques. After wards shrinking down the dataset predictive analysis comes to some conclusion, based on these analysis analytics can forecast or predict the sales for certain region of certain meal by certain age group in certain time of the year.
Here, we got some key differences about Data exploration in detail based on above scenario,
1.Data Exploration is about describing the data by use of statistical and visualization techniques. We explore data to bring important aspects of that data into focus for further analysis.
2. Data Exploration is essentially the first step in any data analysis.
3. Data Exploration is a prerequisite for predictive analytics.
Data Exploration Using R
R is an emerging language and is basically used for handling big data. For any predictive models, it is necessary to use some language to get some accurate values. And Language R is preferred language to provide insights of data during data exploration. I am going to represent some hypothetical example of data to provide an idea regarding Language R
Function to Import Data
rxImport(inData = spssFileName, outFile = xdfFileName, overwrite = TRUE, reportProgress = 0)
Function for Summarizing
rxSummary(~score:ageGroup, data=DF,transforms = list(ageGroup = cut(age, seq(0, 30, 10))))
Function for Cube
rxCube(score~ageGroup, data=DF, transforms=list(ageGroup = cut(age, seq(0,30,10))))
Function for Importing Text Data
RxTextData(file,  stringsAsFactors = FALSE, colClasses = NULL,colInfo = NULL,)
Function for Histogram
## Numeric Variables
rxHistogram(~ Close, data = djiXdf)
## Categorical Variable:
rxHistogram(~ DayOfWeek, data = djiXdf)
## Panel
rxHistogram(~ Close | DayOfWeek, data = djiXdf)
## Numeric Variables with a frequency weighting:
rxHistogram(~ Close, data = djiXdf, fweights = "Volume")

Techniques are used to explore data
1         Steps of Data Exploration and Preparation
2 Missing Value Treatment
§  Why missing value treatment is required?
§  Why data has missing values?
§  Which are the methods to treat missing value?
3 Techniques of Outlier Detection and Treatment
§  What is an outlier?
§  What are the types of outliers?
§  What are the causes of outliers?
§  What is the impact of outliers on dataset?
§  How to detect outlier?
§  How to remove outlier?

Steps of Data Exploration and Preparation
This is very first technique that includes collection and cleansing of data. 70% of the job is to collect and make it uniform. For this stage, we usually do 2 analysis
·         Variable analysis: - in this section predictor(input)and Target(output) variables get decided.
·         Univariate analysis: - In this second stage, analysis is based on categorical and continuous variables.
·         Bi-variate analysis: -in the third stage, Analysis is performed based on association and disassociation among different variable.After these preliminary analysis, Training dataset get ready. Out of this training data set we, now, perform another treatment like missing value and outliers.
Missing value treatment
Missing data in training data can reduce the accuracy of model and lead us to biased prediction. So, missing value treatment is vital in early stages of data exploration.
Table with missing information
Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056
Y
Mary

3046
N
James

3023
Y
Pete
M
3056
Y
Result from Missing Value
Gender
Funds
%funds
F
1
100
M
2
100


Table with complete information
Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056
Y
Mary
F
3046
N
James
M
3023
Y
Pete
M
3056
Y





Gender
Funds
%funds
F
2
50
M
3
100



 From above example, we can clearly say, missing values cannot predict accurate results. But, why we got data with missing values, nulls, empty columns. Let’s discuss it in detail.
Reasons of Missing values
We observed the significance of treatment of missing value in datasets. It may occur at two stages during data extraction and data collection
Data Extraction: - it is the most possible chance that we got missing values during data extraction and we should always check the data at this stage for anomalies and compare data with data at source level. At this stage, we can easily detect and rectify data.
Data Collection: - The errors or anomalies occurs at data collection time is difficult to rectify. Usually, error occurred this stage is categorized 4 stages
·         Missing completely at random: - this situation classifies that when each variable in dataset has equal chance of missing values.
·         Missing at random: -  this term refers to missing values at random and missing ratio is completely varies for each variable. For example, there is missing values for Female and at same level there is different ratio of Male variable. In this scenario, the accuracy will completely give false results and difficult to fix the issue.
·         Missing values at dependents variables: - this is the case when predictor has all values but dependent variables are missing. So, we cannot find the accurate result. Majority this happens in healthcare datasets. When one variable is having all inputs but dependent variables are missing. Because of these situations no analysis can be done and in short, data exploration is inaccurate.
·         Missing that depend upon missing value itself: -  This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.                       
Methods to treat missing values
Deletion: -this concludes deletion as List wise documents and pair wise deletion.
List wise deletion                                                                   
Name
Gender
Postcode
Funds
 Steve
M
3046
Y
Rose
F
3056
Y
Mary

3046
N
James

3023
Y
Pete
M
3056
Y
                        


Name
Gender
Postcode
Funds
Steve
M
3046
Y
Rose
F
3056

Mary

3046
N
James

3023
Y

M
3056
Y

Pairwise Deletion




Prediction model: - this is the mostly used approach and cultured way to handle missing values. In this model the values are assumed based on model and fill the closest value. It is based on probability.
  1. KNN Imputation: - In this method of imputation, the missing value is imputed by similar attribute value. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
    • Pros:
      • k-nearest neighbour can predict both qualitative & quantitative attributes
      • Creation of predictive model for each attribute with missing data is not required
      • Attributes with multiple missing values can be easily treated
      • Correlation structure of the data is taken into consideration
    • Cons:
      • KNN algorithm is very time-consuming in analysing large database. It searches through all the dataset looking for the most similar instances.
      • Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

Outliers detection and handling
Outliers is commonly used term by analyst and refers to certain data out of all dataset is completely varies data. In simple words, if dataset stating income of certain group and out of this data 2 person’s income is way too high which may lead to inaccurate results. At this situation, to get accurate result for the whole dataset is to remove these two incomes from dataset.
Reasons for Outliers
Each time we come across outliers we should know the reason why outliers are appearing in the datasets. The cause of appearing outlier will lead us how to confront it. We broadly classify in two categories: -

Ø  Non-natural or Artificial
Ø  Natural
Data Entry Level Error: - Human errors are very common in during data collection. For example, if we person is entering someone’s income and by mistake it put extra leading zero can make huge difference in accuracy and results.
Measurement errors: -  The other type of the error is regarding measurements. For an instance, measurement done on wrong machine can lead to inaccurate results.
Experimental Errors: - while experiment new things human may come across bizarre values can also lead   up to imprecise results.
Natural Errors: -  When few dataset values are coming completely different but still fall in the same dataset is known as a natural error.
How to Remove Outliers
The simple theory behind removing outliers is to treat them as separate segment so that it doesn’t impact the whole dataset with some different value.

In Synopsis, Data Exploration is initial step to find out the insights of data and to achieve these goals we Use R and some techniques are quite helpful.

References
Submitted To:            Dr. Shah, Victoria University, Melbourne.
Submitted by:            Romika Mehta (s4574112)            Nan Wu (4561246)

Assignment 2(A):     Predictive Analysis - Semester 2 -2017