Sunday, 20 September 2015

Data Analysis in R

Data preparation is an essential activity in data analysis. There are several statistical methods that are used in data analysis which include: Linear model, logistic regression, k-means and decision trees. In this blog we will explain these statistical methods.

Data preparation

Data preparation, also known as data pre-processing is manipulation of the data in a form suitable for additional analysis and processing. Many different tasks are involved in this process and these tasks cannot be fully automated. More of the data preparation activities are tedious, routine and time consuming. According to some estimations around 60% to 80% of the time spent in data mining projects belongs to the process of data preparation. Data preparation is an essential activity for success in data mining projects. The process of data preparation improves the data quality and improves the quality of the data mining results. The process of data preparation involves several steps like checking the data for accuracy, checking or logging the data in, entering the data into the computer and transforming the data.

Linear model

Linear model or linear regression is a statistical approach for modelling of the relationship between a scalar dependent variable and one or more explanatory variables. There are two cases of linear regression:
  • simple linear regression where we have one explanatory variable
  • multiple linear regression where we have two or more explanatory variables.
In linear regression, data is modeled with using of linear predictor functions and unknown model parameters are estimated from the data. These models are called linear models.


image

Predictions

For generating predictions and residuals from the model, rxPredict() is the function which is used on various types of models. Some of the key arguments used in the formation of the syntax are as follows:


image

Logistic regression

Next we move on to logistic regression or a logit model. It is used to represent binary outcome variables. The log odds of the output result are represented as a linear arrangement of the predictor variables. To extract the output from R, we can use the summary command for the logit model:


image

Logit Models need a greater number of scenarios (greater sample size) as they utilize highest likelihood estimation techniques. In few situations, we can evaluate models for dichotomous outcomes where there are just a very few cases using exact regression analysis. It is hard to assess a logit model when the outcome is unusual. This does not depend on the size of the data-set.

k-means

Now, we move to k-means clustering. The k-mean clustering is a popular cluster analysis method in data analysis.  By using rxKmeans() function, we can easily create good visualization for k-means clustering on a certain database.


image

First of all about the rxKmeans function we should know is the syntax. The rxKmeans function syntax includes formula, data, outFile, numClusters…etc like below. Here we provide the explanation for the core syntax.
  • formula - Specifies formulas with variables are provided in the clustering algorithm.
  • data - Variables can be search and found in the formula dataset.
  • outFile - Cluster IDs can be note and written down in the outFile dataset.
  • numClusters - Estimation of the clusters’ number (k)
  • Algorithm means additional arguments which helps  to control the k

Decision tree

The other function which is popularly used in data mining and we call it decision tree. The critical tool in data mining helps analytics to analyse large data-sets, whilst by using the decision tree, it is easier to build better visualization. The visualization mainly explores the decision rules for predicting the categorical or continuous outcome. Here we present a bit brief about the syntax we use to grow the tree:

image

We would like to thank our mentor, Dr Shah Jahan Miah, for providing guidance and support during the activities, which has motivated us to compile our ideas and experience in this blog.
Authors:
Aleksandar Jankulov   s4518571
Komal Bhalla               s4531062
Haoyang Di                 s3812986

Saturday, 19 September 2015

Introduction to Predictive Analytics



by Bakhshash Kaur Saini,Mekete Berhanie & Nuwan Ramawickrama,

When you hear the word “Predictive Analytics” if you think about your local fortune teller, we have got it wrong, somewhere somehow. Predictive analytics is a process to identify patterns and relationships which may happen in the future by analysing the existing data. It is not fortune telling but it can become a factor that decides the fortune of your business. With the new trends such as IoT (Internet of things), The big data and one to one marketing the market place has changed drastically. The nice markets such as Uber, Airbnb are booming because of the recent burst of the technological growth.  While technology advances every second and the quick changes of the market atmosphere created a huge value for the knowledge about the future. Eventhough we would like to know the future there is no proven way of looking into the future. That is when Predictive Analytics comes into picture with an ability of predicting the trends and patterns of the market using the existing data.

Predictive analytics is heavily based on statistical application and mathematical modelling. The need of predictive modelling and analysis made the job “Data Scientist” one of the most talked about positions in the 21st century. The HBR article “Data Scientist: The Sexiest Job of the 21st Century” by  Thomas H. Davenport and D.J. Patil talks about how important this job can be. The article further talks about data scientists becoming key player of the organisations.
A popular writer Bernard Marr in one of his articles talk about what is the skill set one should possess to become successful in data science field. The qualities are as follows.
  • Multidisciplinary Its not only Maths Phd’s become successful you can come from various other fields.
  • Business savvy Regardless of your higher degree you must understand business
  • Analytical. naturally analytical to spot patterns
  • Good at visual communications Able to make correct graph in the correct time
  • Versed in computer science. Familiar with Hadoop, R,Java, Python, etc. are in high demand.
  • Creative. Creative enough to find answers to questions with the existing data
  • Able to add significant value to data
  • A storyteller. Be able to create a story that make value to the organisation
Bearing these qualities in mind anyone who is interested in finding their future in Predictive analytics must develop the above qualities. To get a start for the process, this blog contains a brief tutorial on R basics. This R basics section talks about a case study on social health determinants and finding value from the existing data.


A bit about the Case study

SDOH: SOCIAL DETERMINANTS OF HEALTH

According to WHO(World health organization), people die young or has poor health based on where they live and what they do (WHO 2015).WHO identified nine social determinants of health that affect people with low socioeconomic status that have less accesses to SDOH (WHO 2015). The nine social determinants of health are 

  • Social gradients, mortality rate is higher for communities with poor socioeconomic status.
  • Stress at work or in life general.
  • Early childhood development such as conditions alcohol or drug use during pregnancy.
  • Social exclusion such as racism.
  • Unemployment
  • Social support networks
  • Addiction
  • Availability of healthy food.
  • Availability of transportation

SDOH DATA FROM ADELAIDE UNIVERSITY 

Socioeconomic data was extracted and transformed for predictive analytics regarding social health. The data can be acquired from Adelaide university at  www.publichealth.gov.au (Adelaide 2015).
The data has demography and socioeconomic status of Australian population across the states based on LGA local area governments. The data values are percentage out of 100 populations therefore the values are scaled for statistical analysis.




R Basics

R is a versatile statistical computational package commonly used in the predictive analytics environment. In this section we will have a look at few basic functionality which can get you stated in R.

Uploading data into R

There are few ways to do the data uploading. The most common and easiest way is to use the “read.csv” command. This is how it can be used
Convert your data file into a “.csv” file.
Enter the following in your R console. To start with create an object with any name (eg: data1) and followed by the file uploading command.

data1<-read.csv(file.choose(), header=T)

file.choose” command is one of the easy ways to browse the required file without nominating actual file path. This will open a new window for you to find the needed file.

Exploration

To start exploring we can generate a summary of the data set.

summary(data1)

A similar command which will show you the first 6 entries of ach variable is “head()” .

head(data1)

To find out what type of data your variable has you can run the following command

class(data1)
class(data1$ your variable name)





  Basic Visualisations with R



Scatter plot

To make a scatter plot you can use the plot command in the following manner. 

plot(
data1$Obese,data1$Smokers, x and y variables you are looking at
xlab="Obese People %",   X axis label
ylab="Smokers %",   Y axis label
main="Obesity vs Smoking"  The main title of the plot which will appear on the top.
)


To show the mean in your scatter plot
mean.ob<-mean(Obese)  calculates the mean of the required variable
plot(Obese~Full.time.Education.at.age.16) creates a scatter plot
abline(h=mean.ob) draws the line


Linear regression
model1<-lm(Obese~Full.time.Education.at.age.16) Creates a linear model
model1 prints the model
abline(model1,col="red") shows the regression

Multiple Linear regression
model2<-lm(Obese~Full.time.Education.at.age.16+Unemployed)


3D Scatter plots
The following commands can be used to create 3D scatter plots
install.packages("scatterplot3d") Installing required packages
require(scatterplot3d) calling the libraries
scatterplot3d(data1[3:5]) creating the 3d scatter plot

Interactive 3D scatter plot

library("rgl") Call the libraries or install them if you don’t have them already
library("RColorBrewer")
plot3d(data1$Obese,data1$Smokers,data1$Unemployed,xlab="Obesity",ylab="Smokers",col=brewer.pal(3,"Dark2"),size=8)


Basic Predictions
You can use do basic predictions using the linear model that you have created earlier.
predict (model1) r will predict the outcomes using the linear model

Thursday, 17 September 2015

Tutorial for Educational use

Predictive Analytics
Written by Amandeep Kaur Sekhon and Ruhi Saini

As a post graduate student in Victoria University, we got a chance to work with R tool for predictive analytic s under the guidance of our Respected Professor Shah Miah. He made us to write a blog as a part of assessment.We learnt to work in visualizing data in Business Analytic s and now we learnt to work R tool in Predictive Analytic s.

Predictive analytic s is the branch of the progressed investigation which is utilized to make forecasts about obscure future occasions. Predictive analytic s utilizes numerous strategies from information mining, insights, demonstrating, machine learning, and man made brainpower to dissect current information to make forecasts about future. 

Different Aspects of Predictive analytic s
·         Data exploration
·         Data Manipulation
·         Data Representation

We decided to investigate data related to Vehicle Impoundments in last few years and to find out some conclusions to analyse that data for the coming years. In the competitive era,  there are numerous cases of vehicle impoundements. It varies according to the age groups.  
Vehicle impoundment is the lawful procedure of setting a vehicle into an impoundment parcel, which is a holding spot for autos until they are either set back in the proprietor's control, or sold for the appropriating's advantage organization. Vehicles may be seized by government organizations (generally districts) when:
·         There are uncertain stopping violation(s) of a specific age, and perhaps over an aggregate fine edge
·         In specific cases, amid the infringement of a stopping mandate ("tow away zone")
·         The vehicle's registrant has certain uncertain moving infringement
·         The vehicle is gathered as confirmation of the commission of a potential wrongdoing (e.g. crime or medication pirating).
·         In a few purview, as a major aspect of repossession of a vehicle by a lessor or moneylender
For the purpose of analyzing data, we use the data for vehicle impoundment  in Victoria. We visited numerous sites for example Australian Bureau of Statistics, Parliament of Australia, Police sites. Gathering information is usually a challenging part and here the specific form of data was required. And any how we successfully collected the data from the Victoria Police site, cleanse the data according to the requirement and uploaded to R tool.
Working with R
·         First of all we need to download R package. Different packages are available and one can download and install according to the memory available. There are different packages for windows and mac OS X.
·         Then the cleansed data needs to be loaded to . The command that can be used to load is predßread.table(“File with the specific path”, header=T/F, sep=”,”)

·         Then we can execute number of commands to represent and explore data in different ways. Some of the examples are shown in screenshots below: 





fix(pred) to open data in R Data Editor

Predicted data for the year 2015,2016 and 2017

      In this way working with R and using the R commands and functions, data can be represented and explored and analysed in an efficient way.
      Many Thanks to

Mr. Shah Miah



Predictive Analytics Blog : A Move towards the Future (Tutorial/Assignment Material for Educational Use only)

The amazing technology through which you can predict the future outcomes and trends by extracting information and creating patterns from the existing historical data sets is known as Predictive Analytics.

This blog post has been written as part of the assignment for Predictive Analytics studied at Victoria University. We are very grateful to our lecturer Dr Shah Miah who helped in understanding the value and scope of Predictive Analytics.

The competition is rising in the market and their are loads of data that is sitting in the database also known as Big Data. This big data comprises of the historical data sets and can be used to produce some valuable information about the company and its operations/processes. The extraction of information is done with the help of Data Mining tools and a Predictive Model is built. Predictive model act as a training set and when the new testing for prediction needs to be done with the new real-time data, then the data is run through the model and it predicts the outcomes. These outcomes about the future helps us make better informed decisions.

A 2014 report found that the top five things predictive analytics are used for is to:
  • Identify trends.
  • Understand customers.
  • Improve business performance.
  • Drive strategic decision making.
  • Predict behaviour.

Predictive analytics are used in almost all fields of Fraud Detection and Security, Marketing, Operations and Risk. The most common areas of application are Credit Card, Banking and Financial Services, Governments and the Public Sector, Health Care Providers, Manufacturers, Media and Entertainment, Oil, Gas and Utility Companies, Retailers, Sports Franchises, Health Insurers and Insurance Companies.

We decided to do our predictive analytics study with the help of a healthcare example. One of the nemesis for Australia economy is considered to be obesity according to most journals. The prevalence percentage of obesity across Australia has been increasing in the recent years and it is expected to continue this trend until a proper solution is proposed. We had collected the data from Australian bureau of statistics and some health journals which gave us the percent of obesity population across Australia. The screen shot below shows the data from year 2000 to 2015 with a five year gap analysis.

We consolidated the given data to a flat file and used the data history to calculate our predictions on prevalence of obesity in the years 2020 and 2025. We had used the prediction tools to calculate mean and variance here to find the varying trend and increase of obesity in years with previous data. This data was than applied to the already present data to find the gap and fill out the predicted percentage of increase that will happen in future. The below screen shot of excel file with consolidated data shows us the final data or the percentage of increase that would probably occur if the problem is not taken care off.


The data here is the method of data representation which uses the data to be represented in desired format for study purposes. As for being a data analyst, we have to consider the importance of data collection from the past in order to study about the patterns that develop  in particular fields.


In our case study here, the value of predictive analytics is used as a research purpose analytical case. With the predicted data found above, the government can take actions against various factors that cause this physical sickness among the youngsters of Australia. The government can concentrate on particular areas or suburbs to create awareness centres to curb this economical nemesis. Same type of study can be applied to various other fields for their origin of business or study. The sales industry can concentrate on individualized portfolio of a customer in order to find out the pattern buying behaviour of that person to sell him or her desirable products from their data history. Another case can be the human resource industry planned operations to meet the company requirements of work force for the next ten years in advance and keeping the idea of immediate prospects out of their company.

At the end, we would like to conclude that this formula of prediction can also be used in many other cases for predicting the future outcomes of study or business. It does not mean the values are real and perfect, but rather approximated for future. Its enough said that predictive analytics is the future of business intelligence.

Thank You.

Pranav Gulati
Karthik Nagaraj