When and why to standardize a variable

April 9, 2017, 2:58 am

≪ Previous: Facebook Data Mining using R

This tutorial explains when, why and how to standardize a variable in statistical modeling. Variable Standardization is one of the most important concept of predictive modeling. It is a preprocessing step in building a predictive model. Standardization is also called Normalization and Scaling.

Standardization / Scaling

The concept of standardization comes into picture when continuous independent variables are measured at different scales. It means these variables do not give equal contribution to the analysis. For example, we are performing customer segmentation analysis in which we are trying to group customers based on their homogenous (similar) attributes. A variable called 'transaction amount' that ranges between $100 and $10000 carries more weightage as compared to a variable i.e. number of transactions that in general ranges between 0 and 30. Hence, it is required to transform the data to comparable scales. The idea is to rescale an original variable to have equal range and/or variance.

Standardization / Scaling

Methods of Standardization / Normalization

There are main four methods of standardization. They are as follows -

1. Z score

Z score standardization is one of the most popular method to normalize data. In this case, we rescale an original variable to have a mean of zero and standard deviation of one.

Z score

Mathematically, scaled variable would be calculated by subtracting mean of the original variable from raw vale and then divide it by standard deviation of the original variable.

R Code : Standardize a variable using Z-score

# Creating a sample data
set.seed(123)
X =data.frame(k1 = sample(100:1000,1000, replace=TRUE),
k2 = sample(10:100,1000, replace=TRUE))
X.scaled = scale(X, center= TRUE, scale=TRUE)

In scale() function, center= TRUE implies subtracting the mean from its original variable. The scale = TRUE implies dividing the centered column by its standard deviations.

Check Mean and Variance of Standardized Variable

colMeans(X.scaled)

Result : 0 for both k1 and k2

var(X.scaled)

Result : 1 for both k1 and k2

Interpretation

A value of 1 implies that the value for that case is one standard deviation above the mean, while a value of -1 indicates that a case has a value one standard deviations lower than the mean.

Important Point

The standardized values do not lie in a particular interval. It can be any real number.

2. Min-Max Scaling

It is also called 0-1 scaling because the standardized value using this method lies between 0 and 1.

The formula is shown below -

x-min(x)/(max(x)-min(x))

This method is used to make equal ranges but different means and standard deviations.

library(dplyr)
mins= as.integer(summarise_all(X, min))
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= mins, scale=rng))

Check Min and Max of standardized variables

summarise_all(X.scaled, funs(min, max))

k1_min k2_min k1_max k2_max
0 0 1 1

3. Standard Deviation Method

In this method, we divide each value by the standard deviation. The idea is to have equal variance, but different means and ranges. Formula : x/stdev(x)

X.scaled = data.frame(scale(X, center= FALSE , scale=apply(X, 2, sd, na.rm = TRUE)))

Check Equal Variance

summarise_all(X.scaled, var)

Result : 1 for both the variables

4. Range Method

In this method, we dividing each value by the range. Formula : x /(max(x) - min(x)). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar.

library(dplyr)
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= FALSE, scale=rng))
summarise_all(X.scaled, var)

var_x1 = 0.08614377
var_x2 = 0.08833861

What is Centering?

Centering means subtracting a constant value from every value of a variable. The constant value can be average, min or max. Most of the times we use average value to subtract it from every value.

X=sample(1:100,1000, replace=TRUE)
scale(X,center = TRUE, scale=FALSE)

By default, scale() function with center=TRUE subtract mean value from values of a variable.

When it is important to standardize variables?

1. It is important to standardize variables before running Cluster Analysis. It is because cluster analysis techniques depend on the concept of measuring the distance between the different observations we're trying to cluster. If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable.

2. Prior to Principal Component Analysis, it is critical to standardize variables. It is because PCA gives more weightage to those variables that have higher variances than to those variables that have very low variances. In effect the results of the analysis will depend on what units of measurement are used to measure each variable. Standardizing raw values makes equal variance so high weight is not assigned to variables having higher variances.

3. It is required to standardize variable before using k-nearest neighbors with an Euclidean distance measure. Standardization makes all variables to contribute equally.

4. All SVM kernel methods are based on distance so it is required to scale variables prior to running final Support Vector Machine (SVM) model.

5. It is necessary to standardize variables before using Lasso and Ridge Regression. Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression.

6. In regression analysis, we can calculate importance of variables by ranking independent variables based on the descending order of absolute value of standardized coefficient.

7. In regression analysis, when an interaction is created from two variables that are not centered on 0, some amount of collinearity will be induced. Centering first addresses this potential problem. In simple terms, having non-standardized variables interact simply means that when X1 is big, then X1X2 is also going to be bigger on an absolute scale irrespective of X2, and so X1 and X1X2 will end up correlated.

8. In regression analysis, it is also helpful to standardize a variable when you include power terms X². Standardization removes collinearity.

When it is not required to standardize variables

1. If you think model performance of linear regression model would improve if you standardize variables, it is absolutely incorrect! It does not change RMSE, R-squared value, Adjusted R-squared value, p-value of coefficients. See the detailed R script below. It shows standardization does not affect model performance at all.

Without Standardization -

# Create Sample Data
set.seed(123)
train <- data.frame(X1=sample(1:100,1000, replace=TRUE),
X2=1e2*sample(1:500,1000, replace=TRUE),
X3=1e-2*sample(1:100,1000, replace=TRUE))
train$y <- with(train,2*X1 + 3*1e-2*X2 - 5*1e2*X3 + 1 + rnorm(1000,sd=10))

#Fit linear regression model
fit <- lm(y~X1+X2+X3,train)
summary(fit)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.216e+00 9.732e-01 1.25 0.212
X1 1.984e+00 1.089e-02 182.19 <2e-16 ***
X2 3.000e-02 2.188e-05 1371.21 <2e-16 ***
X3 -4.990e+02 1.070e+00 -466.21 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.849 on 996 degrees of freedom
Multiple R-squared: 0.9995,Adjusted R-squared: 0.9995
F-statistic: 6.799e+05 on 3 and 996 DF, p-value: < 2.2e-16

# create test dataset
test <- data.frame(X1=sample(-5:5,100,replace=TRUE),
X2=1e2*sample(-5:5,100, replace=TRUE),
X3=1e-2*sample(-5:5,100, replace=TRUE))
# predict y based on test data without standardization
pred <- predict(fit,newdata=test)

With Standardization

# Standardize predictors
means <- sapply(train[,1:3],mean)
stdev <- sapply(train[,1:3],sd)
train.scaled <- as.data.frame(scale(train[,1:3],center=means,scale=stdev))
head(train.scaled)
train.scaled$y <- train$y

# Check mean and Variance of Standardized Variables
library(dplyr)
summarise_at(train.scaled, vars(X1,X2,X3), funs(round(mean(.),4)))
summarise_at(train.scaled, vars(X1,X2,X3), var)

Result : Mean is 0 and Variance is 1 for all the standardized variables

#Fit Scaled Data
fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)
summary(fit.scaled)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 598.4244 0.3114 1921.4 <2e-16 ***
X1 57.0331 0.3130 182.2 <2e-16 ***
X2 428.6441 0.3126 1371.2 <2e-16 ***
X3 -145.8587 0.3129 -466.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.849 on 996 degrees of freedom
Multiple R-squared: 0.9995,Adjusted R-squared: 0.9995
F-statistic: 6.799e+05 on 3 and 996 DF, p-value: < 2.2e-16

Compare Coefficients, R-Squared and Adjusted R-Squared

The value of coefficients are not same when we run regression analysis with and without standardizing independent variables. It does not mean they are affected by scaling / standardization. The values are different because of these are the slopes - how much the target variable changes if you change independent variable by 1 unit. In other words, standardization can be interpreted as scaling the corresponding slopes. The adjusted r-squared and multiple r-squared value is exactly same.

How to standardize validation / test dataset

To standardize validation and test dataset, we can use mean and standard deviation of independent variables from training data. Later we apply them to test dataset using Z-score formula. See the formula below -

Z = (X_test - Xbar_training) / Stdev_training

R Script - Standardize Test Data

test.scaled <- as.data.frame(scale(test,center=means,scale=stdev))
head(test.scaled)

Compare Prediction and RMSE

# predict y based on new data scaled, with fit from scaled dataset
pred.scaled <- predict(fit.scaled,newdata=test.scaled)
# Compare Prediction - unscaled vs. scaled fit
all.equal(pred,pred.scaled)
# RMSE on train data with un-scaled fit
pred_train <- predict(fit,newdata=train)
rmse <- sqrt(mean((train$y - pred_train)^2))
# RMSE on train data with scaled fit
pred_train.scaled <- predict(fit.scaled,newdata=train.scaled)
rmse.scaled <- sqrt(mean((train$y - pred_train.scaled)^2))
# Compare RMSE
all.equal(rmse,rmse.scaled)

RMSE is same in both the cases 9.829196. It is because RMSE is associated with scale of Y (target variable). Prediction is also unchanged.

Interpretation of Standardized Regression Coefficient

Most of modern statistical softwares automatically produces standardized regression coefficient. It is important metrics to rank predictors. Its interpretation is slightly different from unstandardized estimates. Standardized coefficients are interpreted as the number of standard deviation units Y changes with an increase in one standard deviation in X.

Correlation with or without Centering / Standardization

The correlation score does not change if you perform correlation analysis on centered and uncentered data.

X=sample(1:100,1000, replace=TRUE)
Y=1e2*sample(1:500,1000, replace=TRUE)
cor(X,Y)
cor(X-mean(X),Y-mean(X))

Standardization after missing imputation and outlier treatment

Centering and Scaling data should be done after imputing missing values. It is because the imputation could influence correct center and scale to use. Similarly, outlier treatment should be done prior to standardization.

Standardize Binary (Dummy) Variables

Standardizing binary variables makes interpretation of binary variables vague as it cannot be increased by a standard deviation. The simplest solution is : not to standardize binary variables but code them as 0/1, and then standardize all other continuous variables by dividing by two standard deviation. It would make them approximately equal scale. The standard deviation of both the variables would be approx. 0.5
Some researchers are in favor of standardizing binary variables as it would make all predictors on same scale. It is a standard practice in penalized regression (lasso). In this case, researchers ignore the interpretation of variables.

Standardization and Tree Algorithms and Logistic Regression

Standardization does not affect logistic regression, decision tree and other ensemble techniques such as random forest and gradient boosting.

↧

How to Integrate R with PHP

April 9, 2017, 10:49 am

≫ Next: Check number of observations in SAS dataset

≪ Previous: When and why to standardize a variable

This tutorial explains how to integrate R with PHP.

Online reporting tools have gained popularity in recent years. There is a growing demand to implement advanced analytics in these tools. Use of advanced analytics help to solve various organization problems such as retaining existing customers or acquiring new customers, increasing customer satisfaction etc.

PHP is one of the most popular programming language to develop websites and online reporting tools. It has rich functionality to write business logic, however they are not effective when it comes to data science and machine learning. In the field of data science, R dominates in terms of popularity among statisticians and data scientists with over 10k number of packages.

How to make PHP communicate with R

There are times when you want to showcase the output of R program like charts that you create based on the user inputted data from a web page. In that case you might want your PHP based web application to communicate with the R script.

When it comes to PHP, it has a very useful function calledexec(). It lets you execute the outside program you provide as the source. We will be using the very same function to execute the R script you created. The then generates the graph and we will show the graph in our web page.

The exec function can be used on both the Linux and Windows environments.

On the Linux environment it will open the terminal window to execute the command you set and arguments you specify.
While on the Windows environment it will open the CMD to execute the command you provide along with the arguments you specify.

I will walk you through the process of integrating the R code with PHP web page with code and explanation.

Let’s first create a PHP based web form:

index.php:

<html>
<head>
<title>PHP and R Integration Sample</title>
</head>
<body>
<div id=”r-output” id=”width: 100%; padding: 25px;”>
<?php
// Execute the R script within PHP code
// Generates output as test.png image.
exec("sample.R");
?>
<img src=”test.png?var1.1” alt=”R Graph” />
</div>
</body>
</html>

Now save the file as index.php under your /htdocs/PROJECT-NAME/index.php.

Let’s create a sample chart using R code.

Write the following code and save it as sample.R file.

x <- rnorm(6,0,1)
png(filename="test.png", width=500, height=500)
hist(x, col="red")
dev.off()

Histogram

rnorm(6, 0 ,1) means generating 6 random values with mean 0 and standard deviation 1. The dev.off() command is used to close the graph. Once chart created it will save it as the test.png file.

The only downside of this code is that it will create the same test.png file for all the incoming requests. Meaning if you are creating charts based on user specified inputs, there will always be one test.png file created for various purpose.

Let’s understand the code

As specified earlier the exec('sample.R'); will execute the R script. It in turn generates the test.png graph image.

In the very next line we used the HTML <img /> tag to display the R program generated image on the page. We used the src=test.png?ver1.1 where ver1.1 is used to invalidate the browser cache and download the new image from server.

All modern browsers supports the browser caching. You might have experienced some website loads way faster on your repetitive visits. It’s due to the fact that browsers cache the image and other static resources for brief period of time.

How to serve concurrent requests?

sample2.R

args <- commandArgs(TRUE)
cols <- args[1]
fname <- args[2]
x <- rnorm(cols,0,1)
fname = paste(fname, "png", sep = ".")
png(filename=fname, width=500, height=500)
hist(x, col="red") dev.off()

Index.php

<html>
<head>
<title>PHP and R Integration Sample</title>
</head>
<body>
<div id=”r-output” id=”width: 100%; padding: 25px;”>
<?php
// Execute the R script within PHP code
// Generates output as test.png image.
$filename = “samplefile”.rand(1,100);
exec("sample2.R 6 “.$filename.");
?>
<img src=”.$filename.”.png?var1.1” alt=”R Graph” />
</div>
</body>
</html>

It will help you eliminate the need to using the same test.png file name. I have used the $filename=”samplefile”. You can use any random sequence as I have used in the end of the samplefile name. rand(min, max) will help you generate a random number.

It will help you fix the file overwriting issue. And you will be able to handle the concurrent requests and server each with unique set of image(s).

You might need to take care of old file removals. If you are on a linux machine you can setup a cron job which will find and delete the chart image files those are older than 24 hours.

Here is the code to find and remove files:

Delete.php

<?php
// set the path to your chart image directory
$dir = "images/temp/";
// loop through all the chart png files inside the directory.
foreach (glob($dir."*.png") as $file) {
// if file is 24 hours old then delete it
if (filemtime($file) < time() - 86400) {
unlink($file);
}
}
?>

Conclusion

Making PHP communicate with R and showcase the result is very simple. You might need to understand the exec() function and some PHP code if in-case you want to delete those residual files/images generated by your R program.

Author Bio

This article was originally written by Darshan Joshi, later Deepanshu gave final touch to the post. Darshan is a programming enthusiast. He loves to help developers in every possible way. He is a founder of AlphansoTech : Web Application Development company. You can connect with him on Twitter and LinkedIn.

↧

Check number of observations in SAS dataset

April 14, 2017, 11:12 am

≫ Next: Python Tutorial for Beginners: Learn in 3 Days

≪ Previous: How to Integrate R with PHP

This post explains how to determine the number of observations in a SAS dataset. Most of the times we need to check whether a SAS dataset is empty or not. In macro, we generally tell SAS to go to the next iteration only when SAS dataset is non-empty. In this post, we will see various methods to count number of rows (records) in SAS table.

Method I : Proc SQL Count (Not Efficient)

In the example below, we will use CARS dataset from SASHELP library. This dataset contains 428 observations and 15 columns.

The easiest method is to use count(*) in Proc SQL. It returns all rows (missing plus non-missing rows) in a dataset.

proc sql;
select count(*) as N from sashelp.cars;
quit;

Result : 428

In case you want to store it in a macro variable, you can use INTO : keyword.

proc sql noprint;
select count(*) into :N from sashelp.cars;
quit;

%put &N;

This will print the number of records in SAS log. Check log after running the above program.

Is it an efficient method?

No, it is not efficient at all. It does not use metadata information of SAS dataset. Instead it reads through each record (row) of your SAS dataset. It takes a long time to do it in big SAS tables. However, it is a simple and handy trick to calculate the number of rows in a SAS dataset.

Method 2 : Descriptor Portion (Efficient)

Before getting into detail, we need to understand the descriptor portion and how it works -

SAS dataset consists of the following two portion -

Descriptor portion. It constitutes information about name of dataset, number of observations and variables, creation date, engine type.
Data portion. It stores values of data.

This method is one of the most efficient way to count observations in a SAS table as it uses metadata information and does not search in dataset.

data _NULL_;
if 0 then set sashelp.cars nobs=n;
put "no. of observations =" n;
stop;
run;

Explanation

The 'if 0' statement does not process at execution time because IF statement does not hold TRUE. The whole IF THEN statement is used to pull the header information of the data set and later hand over to the compiler to adjust it to the PDV.
NOBS is a SAS automatic variable which contains the number of rows in a dataset i.e. SASHELP.CARS dataset.
NOBS = N puts the returns count of records in the variable n.
The STOP statement is used to stop an endless loop.

Like the first method, we can keep it in a macro variable. See the implementation below -

data _NULL_;
if 0 then set sashelp.cars nobs=n;
call symputx('totobs',n);
stop;
run;
%put no. of observations = &totobs;

SAS Output

CALL SYMPUT is one of the method to create a SAS macro variable in data step. In this case, we have used a newer function i.e. CALL SYMPUTX which left justifies and trims trailing blanks from a numeric value. If you want to stick to the old style CALL SYMPUT, you can write like below -

call symput('totobs',left(n));

3. Proc SQL Dictionary Method (Efficient)

Like second method, we can use metadata information of a dataset with PROC SQL Dictionary.Tables.

proc sql noprint;
select nobs into :totobs separated by '' from dictionary.tables
where libname='SASHELP' and memname='CARS';
quit;
%put total records = &totobs.;

Proc SQL Dictionary.Tables

It is an efficient method as it does not look into each values of a dataset to determine the count. The LIBNAME= refers to the name of the library in which data is stored. The MEMNAME= refers to SAS table (dataset). The separated by '' is used in this case to left align the numeric value.

4. Macro Language Method (Efficient)

This method also uses metadata information but it is via the macro language using DATA step functions. The OPEN function is used to open a data. The ATTRN function returns the value of a numeric attribute for a SAS data set. When it is used with the NOBS argument, it returns the number of observations. Later we are closing the opened dataset using CLOSE function.

%macro totobs(mydata);
%let mydataID=%sysfunc(OPEN(&mydata.,IN));
%let NOBS=%sysfunc(ATTRN(&mydataID,NOBS));
%let RC=%sysfunc(CLOSE(&mydataID));
&NOBS
%mend;
%put %totobs(sashelp.cars);

SAS : Check if it is empty table

Suppose you only need to check whether a table is empty or not. You can use the same logic as explained above. And if the returned value is 0, write 'Empty Data' in log. Otherwise, count the number of records.

data _NULL_;
if 0 then set sashelp.cars nobs=n;
if n = 0 then put 'empty dataset';
else put 'Not empty. Total records=' n;
stop;
run;

Result : Not Empty. Total records = 428

Let's create a blank dataset to check the above code. The following program returns empty dataset as 1=2 condition does not meet.

proc sql noprint;
create table temp as
select * from sashelp.cars
where 1 = 2;
quit;

Try it yourself!

Let's wrap the above code in a SAS macro

%macro emptydataset (inputdata=);
data _NULL_;
if 0 then set &inputdata. nobs=n;
call symputx('totobs',n);
stop;
run;
%if &totobs. = 0 %then %put Empty dataset;
%else %do;
%put TotalObs=&totobs;
%end;
%mend;

%emptydataset(inputdata=sashelp.cars);
Result : TotalObs=428

%emptydataset(inputdata=work.temp);

Result : Empty dataset

If you think it's difficult to memorize sas code of descriptor portion method, you can use the code below.

data _NULL_;
set sashelp.cars nobs=N;
if _N_ = 2 then stop;
put N;
run;

SAS log

It reads only first two observations from the dataset. See log above.

↧

Python Tutorial for Beginners: Learn in 3 Days

May 21, 2017, 2:31 pm

≫ Next: Feature Selection : Select Important Variables with Boruta Package

≪ Previous: Check number of observations in SAS dataset

This tutorial helps you to get started with Python. It's a step by step practical guide to learn Python by examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.

Learn Python : Tutorial for Beginners

Table of Contents

Getting Started with Python
- Python 2.7 vs. 3.6
- Python for Data Science
- How to install Python?
- Spyder Shortcut keys
- Basic programs in Python
- Comparison, Logical and Assignment Operators
Data Structures and Conditional Statements
- Python Data Structures
- Python Conditional Statements
Python Libraries
- List of popular packages (comparison with R)
- Popular python commands
- How to import a package
Data Manipulation using Pandas
- Pandas Data Structures - Series and DataFrame
- Important Pandas Functions (vs. R functions)
- Examples - Data analysis with Pandas
Data Science with Python
- Logistic Regression
- Decision Tree
- Random Forest
- Grid Search - Hyper Parameter Tuning
- Cross Validation
- Preprocessing Steps

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

Key Takeaway

You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

Python for Data Science

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows -

Python runs well in automating various steps of a predictive model.
Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
Python wins over R when it comes to deploying machine learning models in production.
It can be easily integrated with big data frameworks such as Spark and Hadoop.
Python has a great online community support.

Do you know these sites are developed in Python?

YouTube
Instagram
Reddit
Dropbox
Disqus

How to Install Python

There are two ways to download and install Python

Download Anaconda. It comes with Python software along with preinstalled popular libraries.
Download Python from its official website. You have to manually install libraries.

Recommended :Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :

Jupyter (Ipython) Notebook
Spyder

Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!

Spyder - Python Coding Environment

Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys which makes you more productive.

Press F5 to run the entire script
Press F9 to run selection or line
Press Ctrl + 1 to comment / uncomment
Go to front of function and then press Ctrl + I to see documentation of the function
Run %reset -f to clean workspace
Ctrl + Left click on object to see source code
Ctrl+Enter executes the current cell.
Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic Operators	Operation	Example
+	Addition	10 + 2 = 12
–	Subtraction	10 – 2 = 8
*	Multiplication	10 * 2 = 20
/	Division	10 / 2 = 5.0
%	Modulus (Remainder)	10 % 3 = 1
**	Power	10 ** 2 = 100
//	Floor	17 // 3 = 5
(x + (d-1)) // d	Ceiling	(17 +(3-1)) // 3 = 6

Basic Programs

Example 1

#Basics
x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)

Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2

x = 100
x > 80 and x <=95
x > 35 or x < 60

x > 80 and x <=95
Out[45]: False

x > 35 or x < 60
Out[46]: True

Comparison & Logical Operators	Description	Example
>	Greater than	5 > 3 returns True
<	Less than	5 < 3 returns False
>=	Greater than or equal to	5 >= 3 returns True
<=	Less than or equal to	5 <= 3 return False
==	Equal to	5 == 3 returns False
!=	Not equal to	5 != 3 returns True
and	Check both the conditions	x > 18 and x <=35
or	If atleast one condition hold True	x > 35 or x < 60
not	Opposite of Condition	not(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.

x = 100
y = 10
x += y
print(x)

print(x)
110

In this case, x+=y implies x=x+y which is x = 100 + 10.

Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

1. List

It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.

x = [1, 2, 3, 4, 5]
y = [‘A’, ‘O’, ‘G’, ‘M’]
z = [‘A’, 4, 5.1, ‘M’]

Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).

x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]

x[0]
Out[68]: 1

x[1]
Out[69]: 2

x[4]
Out[70]: 5

x[-1]
Out[71]: 5

x[-2]
Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method

x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -

A tuple cannot be changed once created whereas list can be modified.
A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]

Examples

K = (1,2,3)
City = ('Delhi','Mumbai','Bangalore')

Perform for loop on Tuple

for i in City:
print(i)

Delhi
Mumbai
Bangalore

Functions

Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

Rules to define a function

Function starts with def keyword followed by function name and ( )
Function body starts with a colon (:) and is indented
The keyword return ends a function and give value of previous expression.

def sum_fun(a, b):
result = a + b
return result

z = sum_fun(10, 15)

Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.

def sum_fun(a, b=0):
result = a + b
return result
z = sum_fun(10)

In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

Example

k = 27
if k%5 == 0:
print('Multiple of 5')
else:
print('Not a Multiple of 5')

Result : Not a Multiple of 5

Popular python packages for Data Analysis & Visualization

Some of the leading packages in Python along with equivalent libraries in R are as follows-

pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R.
Scipy. For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
pandasql. It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.

Maximum of the above packages are already preinstalled in Spyder.

Comparison of Python and R Packages by Data Mining Task

Task	Python Package	R Package
IDE	Rodeo / Spyder	Rstudio
Data Manipulation	pandas	dplyr and reshape2
Machine Learning	Scikit-learn	glm, knn, randomForest, rpart, e1071
Data Visualization	ggplot + seaborn + bokeh	ggplot2
Character Functions	Built-In Functions	stringr
Reproducibility	Jupyter	Knitr
SQL Queries	pandasql	sqldf
Working with Dates	datetime	lubridate
Web Scraping	beautifulsoup	rvest

Popular Python Commands

The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

Install Package
!pip install pandas

Uninstall Package
!pip uninstall pandas

Show Information about Installed Package
!pip show pandas

List of Installed Packages
!pip list

Upgrade a package
!pip install --upgrade pandas

How to import a package

There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

1. import pandas as pd

It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

2. import pandas

It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

3. from pandas import *
It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

Pandas Data Structures : Series and DataFrame

In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -

Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.

In the example below, we are generating 5 random values.

import pandas as pd
s1 = pd.Series(np.random.randn(5))
s1

0   -2.412015
1   -0.451752
2    1.174207
3    0.766348
4   -0.361815
dtype: float64

Extract first and second value

You can get a particular element of a series using index value. See the examples below -

s1[0]

-2.412015

s1[1]

-0.451752

s1[:3]

0   -2.412015
1   -0.451752
2    1.174207

2. DataFrame

It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

Comparison of Data Type in Python and Pandas

The following table shows how Python and pandas package stores data.

Data Type	Pandas	Standard Python
For character variable	object	string
For categorical variable	category	-
For Numeric variable without decimals	int64	int
Numeric characters with decimals	float64	float
For date time variables	datetime64	-

Important Pandas Functions

The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorise pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

Functions	R	Python (pandas package)
Installing a package	install.packages('name')	!pip install name
Loading a package	library(name)	import name as other_name
Checking working directory	getwd()	import os os.getcwd()
Setting working directory	setwd()	os.chdir()
List files in a directory	dir()	os.listdir()
Remove an object	rm('name')	del object
Select Variables	select(df, x1, x2)	df[['x1', 'x2']]
Drop Variables	select(df, -(x1:x2))	df.drop(['x1', 'x2'], axis = 1)
Filter Data	filter(df, x1 >= 100)	df.query('x1 >= 100')
Structure of a DataFrame	str(df)	df.info()
Summarize dataframe	summary(df)	df.describe()
Get row names of dataframe "df"	rownames(df)	df.index
Get column names	colnames(df)	df.columns
View Top N rows	head(df,N)	df.head(N)
View Bottom N rows	tail(df,N)	df.tail(N)
Get dimension of data frame	dim(df)	df.shape
Get number of rows	nrow(df)	df.shape[0]
Get number of columns	ncol(df)	df.shape[1]
Length of data frame	length(df)	len(df)
Get random 3 rows from dataframe	sample_n(df, 3)	df.sample(n=3)
Get random 10% rows	sample_frac(df, 0.1)	df.sample(frac=0.1)
Check Missing Values	is.na(df$x)	pd.isnull(df.x)
Sorting	arrange(df, x1, x2)	df.sort_values(['x1', 'x2'])
Rename Variables	rename(df, newvar = x1)	df.rename(columns={'x1': 'newvar'})

Data Manipulation with pandas - Examples

1. Import Required Packages

You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.

import numpy as np
import pandas as pd

2. Build DataFrame

We can build dataframe using DataFrame() function of pandas package.

mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
df = pd.DataFrame(mydata)

In this dataframe, we have three variables - productcode, sales, cost.

Sample DataFrame

To import data from CSV file

You can use read_csv() function from pandas package to get data into python from CSV file.

mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")

Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

Detailed Tutorial : Import Data in Python

3. To see number of rows and columns

You can run the command below to find out number of rows and columns.

df.shape

Result : (6, 3). It means 6 rows and 3 columns.

4. To view first 3 rows

The df.head(N) function can be used to check out first some N rows.

df.head(3)

     cost productcode   sales
0  1020.0          AA  1010.0
1  1625.2          AA  1025.2
2  1204.0          AA  1404.2

5. Select or Drop Variables

To keep a single variable, you can write in any of the following three methods -

df.productcode
df["productcode"]
df.loc[: , "productcode"]

To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.

df.iloc[: , 1]

We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.

df[["productcode", "cost"]]
df.loc[ : , ["productcode", "cost"]]

Drop Variable

We can remove variables by using df.drop() function. See the example below -

df2 = df.drop(['sales'], axis = 1)

6. To summarize data frame

To summarize or explore data, you can submit the command below.

df.describe()

              cost       sales
count     6.000000     6.00000
mean   1166.150000  1242.65000
std     237.926793   230.46669
min    1003.700000  1010.00000
25%    1020.000000  1058.90000
50%    1072.000000  1205.85000
75%    1184.000000  1366.07500
max    1625.200000  1604.80000

To summarise all the character variables, you can use the following script.

df.describe(include=['O'])

Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

To select only a particular variable, you can write the following code -

df.productcode.describe()
OR
df["productcode"].describe()

count      6
unique     2
top       BB
freq       3
Name: productcode, dtype: object

7. To calculate summary statistics

We can manually find out summary statistics such as count, mean, median by using commands below

df.sales.mean()
df.sales.median()
df.sales.count()
df.sales.min()
df.sales.max()

8. Filter Data

Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.

df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]

It can also be written like :

df1 = df.query('(productcode == "AA") & (sales >= 1250)')

In the second query, we do not need to specify DataFrame along with variable name.

9. Sort Data

In the code below, we are arrange data in ascending order by sales.

df.sort_values(['sales'])

10. Group By : Summary by Grouping Variable

Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.

df.groupby(df.productcode).mean()

                    cost        sales
productcode                          
AA           1283.066667  1146.466667
BB           1049.233333  1338.833333

Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.

df["sales"].groupby(df.productcode).mean()

11. Define Categorical Variable

Let's create a classification variable - id which contains only 3 unique values - 1/2/3.

df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})

Let's define as a categorical variable.
We can use astype() function to make id as a categorical variable.

df0.id = df0["id"].astype('category')

Summarize this classification variable to check descriptive statistics.

df0.describe()

       id
count    7
unique   3
top      2
freq     3

Frequency Distribution

You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.

df['productcode'].value_counts()

BB    3
AA    3

12. Generate Histogram

Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.

df['sales'].hist()

Histogram

13. BoxPlot

Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.

df.boxplot(column='sales')

BoxPlot

Data Science using Python - Examples

In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

1. Install the required libraries

Import the following libraries before reading or exploring data

#Import required libraries
import pandas as pd
import statsmodels.api as sm
import numpy as np

2. Download and import data into Python

With the use of python library, we can easily get data from web into python.

# Read data from web
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

Variables Type Description
gre Continuous Graduate Record Exam score
gpa Continuous Grade Point Average
rank Categorical Prestige of the undergraduate institution
admit Binary Admission in graduate school

The binary variable admit is a target variable.

3. Explore Data

Let's explore data. We'll answer the following questions -

How many rows and columns in the data file?
What are the distribution of variables?
Check if any outlier(s)
If outlier(s), treat them
Check if any missing value(s)
Impute Missing values (if any)

# See no. of rows and columns
df.shape

Result : 400 rows and 4 columns

In the code below, we rename the variable rank to 'position' as rank is already a function in python.

# rename rank column
df = df.rename(columns={'rank': 'position'})

Summarize and plot all the columns.

# Summarize
df.describe()
# plot all of the columns
df.hist()

Categorical variable Analysis

It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.

# Summarize
df.position.value_counts(ascending=True)

Generating Crosstab

By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.

pd.crosstab(df['admit'], df['position'])

position   1   2   3   4
admit                   
0         28  97  93  55
1         33  54  28  12

Number of Missing Values

We can write a simple loop to figure out the number of blank values in all variables in a dataset.

for i in list(df.columns) :
k = sum(pd.isnull(df[i]))
print(i, k)

In this case, there are no missing values in the dataset.

4. Logistic Regression Model

Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

In python, we can write R-style model formula y ~ x1 + x2 + x3 using patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.

#Reference Category
from patsy import dmatrices, Treatment
y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')

It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -

P  P_1 P_2 P_3
3  0 0 1
3  0 0 1
1  1 0 0
4  0 0 0
4  0 0 0
2  0 1 0

Split Data into two parts

80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Build Logistic Regression Model

By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.

#Fit Logit model
logit = sm.Logit(y_train, X_train)
result = logit.fit()

#Summary of Logistic regression model
result.summary()
result.params

                          Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  320
Model:                          Logit   Df Residuals:                      315
Method:                           MLE   Df Model:                            4
Date:                Sat, 20 May 2017   Pseudo R-squ.:                 0.03399
Time:                        19:57:24   Log-Likelihood:                -193.49
converged:                       True   LL-Null:                       -200.30
                                        LLR p-value:                  0.008627
=======================================================================================
                      coef    std err          z       P|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------
C(position)[T.1]     1.4933      0.440      3.392      0.001         0.630     2.356
C(position)[T.2]     0.6771      0.373      1.813      0.070        -0.055     1.409
C(position)[T.3]     0.1071      0.410      0.261      0.794        -0.696     0.910
gre                  0.0005      0.001      0.442      0.659        -0.002     0.003
gpa                  0.4613      0.214     -2.152      0.031        -0.881    -0.041
======================================================================================

Confusion Matrix and Odd Ratio

Odd ratio is exponential value of parameter estimates.

#Confusion Matrix
result.pred_table()
#Odd Ratio
np.exp(result.params)

Prediction on Test Data

In this step, we take estimates of logit model which was built on training data and then later apply it into test data.

#prediction on test data
y_pred = result.predict(X_test)

Calculate Area under Curve (ROC)

# AUC on test data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.6763

Calculate Accuracy Score

accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

Decision Tree Model

Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

#Drop Intercept from predictors for tree algorithms
X_train = X_train.drop(['Intercept'], axis = 1)
X_test = X_test.drop(['Intercept'], axis = 1)

#Decision Tree
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=7)

#Fit the model:
model_tree.fit(X_train,y_train)

#Make predictions on test set
predictions_tree = model_tree.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.664

Important Note

Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

Random Forest Model

Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

#Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

#Fit the model:
target = y_train['admit']
model_rf.fit(X_train,target)

#Make predictions on test set
predictions_rf = model_rf.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

#Variable Importance
importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances)
importances.plot.bar()

Result : AUC = 0.6974

Grid Search - Hyper Parameters Tuning

The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier()
target = y_train['admit']

param_grid = { 
'n_estimators': [100, 200, 300],
'max_features': ['sqrt', 3, 4]
}

CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
CV_rfc.fit(X_train,target)

#Parameters with Scores
CV_rfc.grid_scores_

#Best Parameters
CV_rfc.best_params_
CV_rfc.best_estimator_

#Make predictions on test set
predictions_rf = CV_rfc.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

Cross Validation

# Cross Validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict,cross_val_score
target = y['admit']
prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
#AUC
cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

Data Mining : PreProcessing Steps

1. The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn, there is already a function for this step.

from sklearn.preprocessing import LabelEncoder
def ConverttoNumeric(df):
    cols = list(df.select_dtypes(include=['category','object']))
    le = LabelEncoder()
    for i in cols:
        try:
            df[i] = le.fit_transform(df[i])
        except:
            print('Error in Variable :'+i)
    return df

ConverttoNumeric(mydf)

Encoding

2. Impute Missing Values

Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

# fill missing values with 0
df['var1'] = df['var1'].fillna(0)
# fill missing values with mean
df['var1'] = df['var1'].fillna(df['var1'].mean())

3. Outlier Treatment

There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -

Cap extreme values at 95th / 99th percentile depending on distribution
Apply log transformation of variables. See below the implementation of log transformation in Python.

import numpy as np
df['var1'] = np.log(df['var1'])

Next Steps

Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

↧

Feature Selection : Select Important Variables with Boruta Package

May 30, 2017, 1:03 pm

≫ Next: SAS, R & Predictive Modeling Training in Delhi NCR

≪ Previous: Python Tutorial for Beginners: Learn in 3 Days

This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

Why Variable Selection is important?

Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
Too many variables might result to overfitting which means model is not able to generalize pattern
Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special". See the following reasons to use boruta package for feature selection.

It works well for both classification and regression problem.
It takes into account multi-variable relationships.
It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
It can handle interactions between variables
It can deal with fluctuating nature of random a random forest importance measure

Boruta Package

Basic Idea of Boruta Algorithm

Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works

Follow the steps below to understand the algorithm -

Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
Combine the original ones with shuffled copies
Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
Find the maximum Z score among shadow attributes (MZSA)
Tag the variables as 'unimportant' when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
Tag the variables as 'important' when they have importance significantly higher than MZSA.
Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.

Is Boruta a solution for all?

Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

What is shuffled feature or permuted copies?

It simply means changing order of values of a variable. See the practical example below -

set.seed(123)
mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))
head(cbind(mydata, shuffle))


Original         Shuffled
   var1   var2    var1      var2
1    1 0.2875775    4 0.9404673
2    2 0.7883051    5 0.4089769
3    3 0.4089769    3 0.2875775
4    4 0.8830174    2 0.0455565
5    5 0.9404673    6 0.8830174
6    6 0.0455565    1 0.7883051

R : Feature Selection with Boruta Package

1. Get Data into R

The read.csv() function is used to read data from CSV and import it into R environment.

#Read data
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

2. List of variables

#Column Names
names(df)

Result :"admit""gre" "gpa" "rank"

3. Define categorical variables

df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

4. Explore Data

#Summarize Data
summary(df)

#Check number of missing values
sapply(df, function(y) sum(is.na(y)))

 admit        gre             gpa        rank   
 0:273   Min.   :220.0   Min.   :2.260   1: 61  
 1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
         Median :580.0   Median :3.395   3:121  
         Mean   :587.7   Mean   :3.390   4: 67  
         3rd Qu.:660.0   3rd Qu.:3.670          
         Max.   :800.0   Max.   :4.000

No missing values in the dataframe df.

Handle Missing Values

In this dataset, we have no missing values. If it exists in your dataset, you need to impute them before implementing boruta package.

5. Run Boruta Algorithm

#Install and load Boruta package
install.packages("Boruta")
library(Boruta)

# Run Boruta Algorithm
set.seed(456)
boruta <- Boruta(admit~., data = df, doTrace = 2)
print(boruta)
plot(boruta)


Boruta performed 9 iterations in 4.870027 secs.
 3 attributes confirmed important: gpa, gre, rank;
 No attributes deemed unimportant.

It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.

Box Plot - Variable Selection

As you can see above the label of shadowMean is not displayed as it got truncated due to insufficient space. To fix this problem, run the following program.

plot(boruta, xlab = "", xaxt = "n")
k <-lapply(1:ncol(boruta$ImpHistory),function(i)
boruta$ImpHistory[is.finite(boruta$ImpHistory[,i]),i])
names(k) <- colnames(boruta$ImpHistory)
Labels <- sort(sapply(k,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta$ImpHistory), cex.axis = 0.7)

Let's add some irrelevant data to our original dataset

It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.

#Add some random permuted data
set.seed(777)
df.new<-data.frame(df,apply(df[,-1],2,sample))
names(df.new)[5:7]<-paste("Random",1:3,sep="")
df.new$Random1 = as.numeric(as.character(df.new$Random1))
df.new$Random2 = as.numeric(as.character(df.new$Random2))

> head(df.new)
  admit gre  gpa rank Random1 Random2 Random3
1     0 380 3.61    3     600    3.76       4
2     1 660 3.67    3     660    3.30       4
3     1 800 4.00    1     700    3.37       2
4     1 640 3.19    4     620    3.33       3
5     0 520 2.93    4     600    3.04       2
6     1 760 3.00    2     520    3.64       4

Run Boruta Algorithm

set.seed(456)
boruta2 <- Boruta(admit~., data = df.new, doTrace = 1)
print(boruta2)
plot(boruta2)

Boruta performed 55 iterations in 21.79995 secs.
 3 attributes confirmed important: gpa, gre, rank;
 3 attributes confirmed unimportant: Random1, Random2, Random3;

The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.

> attStats(boruta2)
            meanImp   medianImp    minImp    maxImp   normHits  decision
gre      5.56458881  5.80124786  2.347609  8.410490 0.90909091 Confirmed
gpa      9.66289180  9.37140347  6.818527 13.405592 1.00000000 Confirmed
rank    10.16762154 10.22875211  6.173894 15.235444 1.00000000 Confirmed
Random1  0.05986751  0.18360283 -1.281078  2.219137 0.00000000  Rejected
Random2  1.15927054  1.35728128 -2.779228  3.816915 0.29090909  Rejected
Random3  0.05281551 -0.02874847 -3.126645  3.219810 0.05454545  Rejected

To save a final list of important variables in a vector, use getSelectedAttributes() function.

#See list of finalvars
finalvars = getSelectedAttributes(boruta2, withTentative = F)

[1] "gre" "gpa" "rank"

Incase you get tentative attributes in your dataset, you need to treat them. In this dataset, we did not get any one. When you run the following function, it will compare the median Z score of the variables with the median Z score of the best shadow attribute and then make a decision whether an attribute should be confirmed or rejected.

Tentative.boruta <- TentativeRoughFix(boruta2)

List of parameters used in Boruta

maxRuns: maximal number of random forest runs. Default is 100.
doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus reporting each iteration. Default is 0.
getImp : function used to obtain attribute importance. The default is getImpRfZ, which runs random forest from the ranger package and gathers Z-scores of mean decrease accuracy measure.
holdHistory: The full history of importance runs is stored if set to TRUE (Default).

Compare Boruta with RFE Algorithm

In caret, there is a variable selection algorithm called recursive feature elimination (RFE). It is also called backward selection. A brief explanation of the algorithm is given below -

Fit the model using all independent variables.
Calculate variable importance of all the variables.
Each independent variable is ranked using its importance to the model.
Drop the weakest variable (worst ranked) and builds a model using the remaining variables and calculate model accuracy.
Repeat step 4 until all variables are used.
Variables are then ranked according to when they were dropped.
For regression, RMSE and R-Squared are used as a metrics. For classification, it is 'Accuracy' and 'Kappa'.

In the code below, we are building a random forest model in RFE algorithm. The function 'rfFuncs' denotes for random forest.

library(caret)
library(randomForest)
set.seed(456)
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
rfe <- rfe(df.new[,2:7], df.new[,1], rfeControl=control)
print(rfe, top=10)
plot(rfe, type=c("g", "o"), cex = 1.0)
predictors(rfe)
head(rfe$resample, 10)

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         4   0.6477 0.1053    0.07009  0.1665         
         6   0.7076 0.2301    0.06285  0.1580        *

The top 6 variables (out of 6):
   gpa, rank, gre, Random2, Random3, Random1

RFE - Variable Selection

In this case, RFE algorithm returned all the variables based on model accuracy. As compared to RFE, boruta final variables make more sense in terms of interpretation. It all depends on data and its variables' distribution. As an analyst, we should explore both the techniques and see which one works better for the dataset. There are many packages in R for variable selection. Every technique has pros and cons.

The following functions can be used for model fitting in RFE selections

linear regression (lmFuncs)
random forests (rfFuncs)
naive Bayes (nbFuncs)
bagged trees (treebagFuncs)

Does Boruta handle multicollinearity?

Multicollinearity means high correlation between independent variables. It is an important assumption in linear and logistic regression model. It makes coefficients (or estimates) more biased. Lets's check whether boruta algorithm takes care of it. Let's create some sample data. In this case, we are creating 3 predictors x1-x3 and target variable y.

set.seed(123)
x1 <- runif(500)
x2 <- rnorm(500)
x3 <- x2 + rnorm(n,sd=0.5)
y <- x3 + runif(500)

cor(x2,x3)

[1] 0.8981247

The correlation of variables x2 and x3 is very high (close to 0.9). It means they are highly correlated.

mydata = data.frame(x1,x2,x3)
Boruta(mydata, y)

Boruta performed 9 iterations in 7.088029 secs.

2 attributes confirmed important: x2, x3;

1 attributes confirmed unimportant: x1;

Boruta considered both highly correlated variables to be important. It implies it does not treat collinearity while selecting important variables. It is because of the way algorithm works.

Important points related to Boruta

Impute missing values - Make sure missing or blank values are filled up before running boruta algorithm.
Collinearity - It is important to handle collinearity after getting important variables from boruta.
Slow Speed - It is slow in terms of speed as compared to other traditional feature selection algorithms.

↧

SAS, R & Predictive Modeling Training in Delhi NCR

June 5, 2017, 5:30 am

≫ Next: Online SAS, R & Predictive Modeling Training

≪ Previous: Feature Selection : Select Important Variables with Boruta Package

SAS Training in Delhi NCR

We are starting our first classroom training batch from July 15, 2017 in Delhi NCR (Delhi / Gurgaon). We are offering courses on SAS , R and Predictive Modeling.

Practical SAS Programming - Learning SAS by Case Studies
Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge

Rs 20,000

Base and Advanced SAS Programming
Classroom Training + Videos
Live Projects
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Limited Seats Available

Venue : To be decided
Weekend Classes

Rs 25,000

Predictive Modeling + Intro to SAS Programming
Classroom Training + Videos
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Limited Seats Available

Venue : To be decided
Weekend Classes

Rs 30,000

R Programming + Predictive Modeling
Classroom Training + Videos
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Limited Seats Available

Venue : To be decided
Weekend Classes

Combo Deals - Spend Less, Learn More

Get Rs 10,000 off on registration for any of the two courses
Get Rs 20,000 off on registration for all of the three courses
**Offer ends 15th June, 2017

How We are different from other institutes?

Here are some of the features of ListenData that makes us better than other training institutes.

Explain Advanced Analytics and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
Hands-on Model Development and Validation Experience
Strategies to implement predictive model
New algorithms to solve problems efficiently
Explain complex topics via visual lessons

Who should do these courses?

These courses are ideal for candidates who want to make a career in analytics.

Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

The decline of SAS Jobs and rise of R?

I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career. List of Companies using SAS

At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast.

Is my registration fees refundable?

It would be automatically adjusted on total fees. In other words, you pay Rs 1000 less of the amount of total fees. Incase you want to opt out of the course, you can ask for refund within 7 days of registration.

Any Questions?

Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

SAS & R Training

Curriculum - Practical SAS Programming

Introduction to SAS
How SAS works
Import Raw Data Files - Basics
Import Raw Data Files - Special Cases
Importing / Exporting Data with Procedures
Exploring Data - Various Methods
Data Subsetting
Data Manipulation - Basics
Data Manipulation - Intermediate
Data Manipulation - Advanced
Do Loops and Arrays
Merging Data
Appending Data
Character & Numeric Functions
Date Functions
Reporting - Creating tabular reports
Proc SQL - Part I
Proc SQL - Part II
Proc SQL - Part III
SAS Macros - Basics
SAS Macros - Intermediate
SAS Macros - Advanced
SAS Macros - Debugging Tips
Efficient SAS Programming Tips
Connect to Databases using SAS
Interview Tips - Scenario Based Questions
Live Project

Back To Top

Curriculum - Predictive Modeling using SAS

Introduction to Statistics & Modeling
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
SAS Programming - Basics
SAS Programming - Intermediate
Descriptive Statistics with SAS
Hypothesis Testing with SAS
Correlation Analysis with SAS
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with SAS
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Time Series Forecasting - Theory
Time Series Analysis with SAS
Special Cases - Handle rare event model
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

Curriculum - R Programming + Data Science with R

Introduction to R
Introduction to RStudio
Data Structures in R
Importing / Exporting Data in R
Data Exploration
Data Manipulation with dplyr package - Basics
Data Manipulation with dplyr package - Intermediate
Data Manipulation with dplyr package - Advanced
Character and Numeric Functions in R
Data & Time Functions in R
Data Visualization in R
Loops in R (Apply Family of Functions & For Loop)
R Functions - Part I
R Functions - Part II
Introduction to Data Science
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
Hypothesis Testing with R
Correlation Analysis with R
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with R
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Machine Learning - Basics
Random Forest - How it works
Random Forest vs. Decision Tree
Random Forest - Model Development and Validation
Time Series Forecasting - Theory
Time Series Analysis with R
Special Cases - Handle rare event model
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

↧

Online SAS, R & Predictive Modeling Training

June 6, 2017, 11:58 pm

≫ Next: Python Data Structures

≪ Previous: SAS, R & Predictive Modeling Training in Delhi NCR

First Online Training Batch

We are starting our first Instructor-led online training batch from July 15, 2017. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus pre-recorded videos from any part of the world via web conference mode.

Practical SAS Programming - Learning SAS by Case Studies
Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge

Batch : 15th July, Saturday Mode : Live Instructor-led

Rs 20,000

(325 USD)

Special Price : Rs 16,000 (260USD) till 15th June

Base and Advanced SAS Programming
Instructor-led live class + Pre-recorded videos
Live Projects
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Duration : 10 Weeks
Limited Seats Available
Weekend Classes
Money Back Guarantee

Indian Users

All Users (Except India)

Rs 25,000

(400 USD)

Special Price : Rs 20,000 (320USD) till 15th June

Predictive Modeling + Intro to SAS Programming
Instructor-led live class + Pre-recorded videos
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Duration : 12 Weeks
Limited Seats Available
Weekend Classes
Money Back Guarantee

Indian Users

All Users (Except India)

Rs 30,000

(480 USD)

Special Price : Rs 24,000 (380USD) till 15th June

R Programming + Predictive Modeling
Instructor-led live class + Pre-recorded videos
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Duration : 13 Weeks
Limited Seats Available
Weekend Classes
Money Back Guarantee

Indian Users

All Users (Except India)

Combo Deals - Spend Less, Learn More

Pay only Rs 50,000 on purchase of all three courses
Pay only Rs 40,000 on purchase of any of two courses
**Offer applies on registration before 15th June, 2017

When do I need to pay full course fee?

Now you just need to pay Rs 1,000 (15USD) to reserve your seat. Hit 'Pay & Reserve your seat' button above in the table. You can pay the remaining balance until 15th July,2017 to access to live lectures and pre-recorded videos.
Money Back Guarantee : If you do not like our training, you can ask for 100% course fees refund after your first live session.

How We are different from other institutes?

Here are some of the features of ListenData that makes us better than other training institutes.

Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
Hands-on Model Development and Validation
Strategies to implement predictive model
New algorithms to solve problems efficiently
Explain complex topics via visual lessons

Who should do these courses?

These courses are ideal for candidates who want to make a career in analytics.

Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

Every training institute promises job. Why should i trust you?

Let's be honest! It's a universal fact that no college or training institute can provide 100% job guarantee. If they are claiming 100% job guarantee, they are luring learners by false promises. Even IITs do not hit 100% score. Some Facts - Only 66% of IITians landed a job offer via campus recruitment in 2016-17, as against 79% in 2015-16 and 78% in 2014-15, according to HRD ministry.

Let me list down the common reasons why people don't get jobs in analytics industry even after completing training from some colleges / institutes -

No hands-on experience
No domain knowledge
No theoretical knowledge of statistical concepts
Poor analytical skill

The objective of this program is to cover the above first three points in detail. In addition we provide job placement assistance to all students.

The decline of SAS Jobs and rise of R?

I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

In case if I miss any live session?

Every class is recorded. We will provide you recording of every session.

I never studied Programming or Statistics during graduation. Can I still apply for this course?

Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

Is my registration fees refundable?

100% refundable. Incase you want to opt out of the course, you can ask for 100% refund within 7 days of registration. If you want to continue, it would be automatically adjusted on total fees. In other words, you pay Rs 1000 (15USD) less of the amount of total fees.

Any Questions?

Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

Predictive Modeling using SAS & R

Curriculum - Practical SAS Programming

Introduction to SAS
How SAS works
Import Raw Data Files - Basics
Import Raw Data Files - Special Cases
Importing / Exporting Data with Procedures
Exploring Data - Various Methods
Data Subsetting
Data Manipulation - Basics
Data Manipulation - Intermediate
Data Manipulation - Advanced
Do Loops and Arrays
Merging Data
Appending Data
Character & Numeric Functions
Date Functions
Reporting - Creating tabular reports
Proc SQL - Part I
Proc SQL - Part II
Proc SQL - Part III
SAS Macros - Basics
SAS Macros - Intermediate
SAS Macros - Advanced
SAS Macros - Debugging Tips
Efficient SAS Programming Tips
Connect to Databases using SAS
Interview Tips - Scenario Based Questions
Live Project

Back To Top

Curriculum - Predictive Modeling using SAS

Introduction to Statistics & Modeling
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
SAS Programming - Basics
SAS Programming - Intermediate
Descriptive Statistics with SAS
Hypothesis Testing with SAS
Correlation Analysis with SAS
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with SAS
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Time Series Forecasting - Theory
Time Series Analysis with SAS
Special Cases - Handle rare event model
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

Curriculum - R Programming + Data Science with R

Introduction to R
Introduction to RStudio
Data Structures in R
Importing / Exporting Data in R
Data Exploration
Data Manipulation with dplyr package - Basics
Data Manipulation with dplyr package - Intermediate
Data Manipulation with dplyr package - Advanced
Character and Numeric Functions in R
Data & Time Functions in R
Data Visualization in R
Loops in R (Apply Family of Functions & For Loop)
R Functions - Part I
R Functions - Part II
Introduction to Data Science
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
Hypothesis Testing with R
Correlation Analysis with R
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with R
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Machine Learning - Basics
Random Forest - How it works
Random Forest vs. Decision Tree
Random Forest - Model Development
Random Forest - Model Validation
Random Forest - How it works
Gradient Boosting - How it works
Gradient Boosting - Model Development
Gradient Boosting - Model Validation
Support Vector Machine - How it works
Support Vector Machine - Model Development
Support Vector Machine - Model Validation
Ensemble Stacking / Blending
Time Series Forecasting - Theory
Time Series Analysis with R
Special Cases - Handle rare event model
Text Mining Basics & Applications
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

↧

Python Data Structures

June 25, 2017, 1:37 am

≫ Next: Predictive Modeling using SAS & R Online Training

≪ Previous: Online SAS, R & Predictive Modeling Training

This post explains the data structures used in Python. It is essential to understand the data structures in a programming language. In python, there are many data structures available. They are as follows :

strings
lists
tuples
dictionaries
sets

Python Data Structures

1. Strings

Python String is a sequence of characters.

How to create a string in Python

You can create Python string using a single or double quote.

mystring = "Hello Python3.6"
print(mystring)

Output:

Hello Python3.6

Can I use multiple single or double quotes to define string?

Answer is Yes. See examples below -

Multiple Single Quotes

mystring = '''Hello Python3.6'''
print(mystring)

Output:

Hello Python3.6

Multiple Double Quotes

mystring = """Hello Python3.6"""
print(mystring)

Output:

Hello Python3.6

How to include quotes within a string?

mystring = r'Hello"Python"'
print(mystring)

Output:

Hello"Python"

How to extract Nth letter or word?

You can use the syntax below to get first letter.

mystring = 'Hi How are you?'
mystring[0]

Output

'H'

mystring[0] refers to first letter as indexing in python starts from 0. Similarly, mystring[1] refers to second letter.

To pull last letter, you can use -1 as index.

mystring[-1]

To get first word

mystring.split('')[0]

Output : Hi

How it works -

1. mystring.split('') tells Python to use space as a delimiter.

Output : ['Hi', 'How', 'are', 'you?']

2. mystring.split('')[0] tells Python to pick first word of a string.

2. List

Unlike String, List can contain different types of objects such as integer, float, string etc.

x = [142, 124, 234, 345, 465]
y = [‘A’, ‘C’, ‘E’, ‘M’]
z = [‘AA’, 44, 5.1, ‘KK’]

Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).

k = [124, 225, 305, 246, 259]
k[0]
k[1]
k[-1]

k[0]
124

k[1]
225

k[-1]
259

Explanation :

k[0] picks first element from list. Negative sign tells Python to search list item from right to left. k[-1] selects the last element from list.

To select multiple elements from a list, you can use the following method :

k[:3] returns [124, 225, 305]

Add 5 to each element of a list

In the program below, len() function is used to count the number of elements in a list. In this case, it returns 5. With the help of range() function, range(5) returns 0,1,2,3,4.

x = [1, 2, 3, 4, 5]
for i in range(len(x)):
x[i] = x[i] + 5
print(x)

[6, 7, 8, 9, 10]

It can also be written like this -

for i in range(len(x)):
x[i] += 5
print(x)

Combine / Join two lists

The '+' operator is concatenating two lists.

X = [1, 2, 3]
Y = [4, 5, 6]
Z = X + Y
print(Z)

[1, 2, 3, 4, 5, 6]

Sum of values of two list

X = [1, 2, 3]
Y = [4, 5, 6]
import numpy as np
Z = np.add(X, Y)
print(Z)

print(Z)
[5 7 9]

Similarly, you can use np.multiply(X, Y) to multiply values of two list.

Repeat List N times

The '*' operator is repeating list N times.

X = [1, 2, 3]
Z = X * 3
print(Z)

[1, 2, 3, 1, 2, 3, 1, 2, 3]

Note : The above two methods also work for string list.

Modify / Replace a list item

Suppose you need to replace third value to a different value.

X = [1, 2, 3]
X[2]=5
print(X)

print(X)
[1, 2, 5]

Add / Remove a list item

We can add a list item by using append method.

X = ['AA', 'BB', 'CC']
X.append('DD')
print(X)

Result : ['AA', 'BB', 'CC', 'DD']

Similarly, we can remove a list item by using remove method.

X = ['AA', 'BB', 'CC']
X.remove('BB')
print(X)

Result : ['AA', 'CC']

Sort list

k = [124, 225, 305, 246, 259]
k.sort()
print(k)

Output : [124, 225, 246, 259, 305]

3. Tuple

Like list, tuple can also contain mixed data. But tuple cannot be changed or altered once created whereas list can be modified. Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

Examples

mytuple = (123,223,323)
City = ('Delhi','Mumbai','Bangalore')

Perform for loop on Tuple

for i in City:
print(i)

Delhi
Mumbai
Bangalore

Tuple cannot be altered

Run the following command and check error

X = (1, 2, 3)
X[2]=5

TypeError:'tuple' object does not support item assignment

4. Dictionary

It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings, numbers, or tuples.

Create a dictionary

It is defined in curly braces {}. Each key is followed by a colon (:) and then values.

teams = {'Dave' : 'team A',
'Tim' : 'team B',
'Babita' : 'team C',
'Sam' : 'team B',
'Ravi' : 'team C'
}

Find Values

teams['Sam']

Output :'team B'

Delete an item

del teams['Ravi']

Add an item

teams['Deep'] = 'team B'

Output :

{'Babita': 'team C',

'Dave': 'team A',

'Deep': 'team B',

'Sam': 'team B',

'Tim': 'team B'}

5. Sets

Sets are unordered collections of simple objects.

X = set(['A', 'B', 'C'])

Q. Does 'A' exist in set X?

'A' in X

Result : True

Q. Does 'D' exist in set X?

'D' in X

Result : False

Q. How to add 'D' in set X?

X.add('D')

Q. How to remove 'C' from set X?

X.remove('C')

Q. How to create a copy of set X?

Y = X.copy()

Q. Which items are common in both sets X and Y?

Y & X

↧

Predictive Modeling using SAS & R Online Training

September 16, 2017, 12:12 pm

≫ Next: K Nearest Neighbor : Step by Step Tutorial

≪ Previous: Python Data Structures

First Online Training Batch

The next Instructor-led online training batch will commence on October 8, 2017. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus recorded videos from any part of the world via web conference mode. Also you can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared.

Practical SAS Programming - Learning SAS by Case Studies
Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge)

Batch : 8th October, Sunday Mode : Live Instructor-led

Rs 20,000

($325)

Special Price : Get 10% off till 25th Sept, 2017
Base and Advanced SAS Programming
Instructor-led live class + Recorded videos
Duration : 8 Weeks (100 hours)
Live Projects + Scenario-Based Questions
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee

Indian Users
All Users (Except India)

Rs 25,000

($400)

Special Price : Get 10% off till 25th Sept, 2017
Predictive Modeling with SAS
Instructor-led live class + Recorded videos
Duration : 8 - 10 Weeks (100 hours)
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee

Indian Users
All Users (Except India)

Rs 30,000

($480)

Special Price : Get 10% off till 25th Sept, 2017
R Programming + Predictive Modeling with R
Instructor-led live class + Recorded videos
Duration : 10-12 Weeks (120 hours)
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee

Indian Users
All Users (Except India)

Combo Deals - Spend Less, Learn More

Pay only Rs 35,000 ($600) on purchase of 'Practical SAS Programming' and 'Predictive Modeling with SAS' courses
Offer expires on 25th September,2017
Enroll Now - Indian Users
Enroll Now - All Users (Except India)

What is Instructor-led live program?

It is an interactive training program. Learners will get an access to live lectures via live webinar mode and can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared. Also you can go through video recording if you miss a class.

Money Back Guarantee?

If you do not like our training, you can ask for 100% course fees refund after your first live session. No question asked refund policy!

What is the duration of these programs?

These are weekend programmes comprising 100-130 hours. Classes will be held on every Saturday and Sunday The course duration is as follows -

Practical SAS Programming - 100 hours (At least 50 hours live training + 5 hours video based training + ~60 hours of Practice and Self Study)
Predictive Modeling with SAS - 100 hours (Includes hours of Video based training and Practice and Self Study)
Data Science with R - 120 hours (At least 60 hours live training + 7 hours video based training + ~80 hours of Practice and Self Study)

If I opt for all the 3 courses, will classes be scheduled at the same time?

All classes will be scheduled on weekend but not at the same time. It'll be one by one. For example if class A gets over at 5. Next class will start at 6.

How We are different from other institutes?

Here are some of the features of ListenData that makes us better than other training institutes.

Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
Hands-on Model Development and Validation
Strategies to implement predictive model
New algorithms to solve problems efficiently
Explain complex topics via visual lessons

Who should do these courses?

These courses are ideal for candidates who want to make a career in analytics.

Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

Every training institute promises job. Why should i trust you?

No hands-on experience
No domain knowledge
No theoretical knowledge of statistical concepts
Poor analytical skill

The objective of this program is to cover the above first three points in detail. In addition we provide job placement assistance to all students. We will keep you informed about current openings in analytics industry. We are in constant contact with job consultancy firms and a solid network of analytics professionals.

The decline of SAS Jobs and rise of R?

I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

In case if I miss any live session?

Every class is recorded. We will provide you recording of every session.

I never studied Programming or Statistics during graduation. Can I still apply for this course?

Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

Is my registration fees refundable?

100% refundable. Incase you want to opt out of the course for any reason, you can ask for 100% refund within 7 days of registration. If you want to continue, it would be automatically adjusted on total fees. In other words, you pay $15 (Rs 1000) less of the amount of total fees.

About Instructor

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. He has worked with companies like Aon, Cognizant, Genpact, RBS. He has handled global clients in various domains like retail and commercial banking, Telecom, HR and Automotive. He has worked extensively in various data science projects such as Customer Attrition, Customer Lifetime Value Model, Propensity Model, Opinion / Sentiment Mining, Geo Analytics, Credit risk scorecard, Portfolio Optimization, Pricing Analytics, Cross sell/Up sell campaign models, Survey Analytics, Customer Segmentation, Market Benchmarking, Employee Attrition, Employee Engagement etc.

Any Questions?

Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

Predictive Modeling using SAS & R Training

Curriculum - Practical SAS Programming

Introduction to SAS
How SAS works
Import Raw Data Files - Basics
Import Raw Data Files - Special Cases
Importing / Exporting Data with Procedures
Exploring Data - Various Methods
Data Subsetting
Data Manipulation - Basics
Data Manipulation - Intermediate
Data Manipulation - Advanced
Do Loops and Arrays
Merging Data
Appending Data
Character & Numeric Functions
Date Functions
Reporting - Creating tabular reports
Proc SQL - Part I
Proc SQL - Part II
Proc SQL - Part III
SAS Macros - Basics
SAS Macros - Intermediate
SAS Macros - Advanced
SAS Macros - Debugging Tips
Efficient SAS Programming Tips
Connect to Databases using SAS
Interview Preparation - Scenario Based Questions
Live Project

Back To Top

Curriculum - Predictive Modeling using SAS

Introduction to Statistics & Modeling
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
SAS Programming - Basics
SAS Programming - Intermediate
Descriptive Statistics with SAS
Hypothesis Testing with SAS
Correlation Analysis with SAS
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with SAS
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Time Series Forecasting - Theory
Time Series Analysis with SAS
Special Cases - Handle rare event model
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - Employee Attrition
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

Curriculum - R Programming + Data Science with R

Introduction to R
Introduction to RStudio
Data Structures in R
Importing / Exporting Data in R
Data Exploration
Data Manipulation with dplyr package - Basics
Data Manipulation with dplyr package - Intermediate
Data Manipulation with dplyr package - Advanced
Character and Numeric Functions in R
Data & Time Functions in R
Data Visualization in R
Loops in R (Apply Family of Functions & For Loop)
R Functions - Part I
R Functions - Part II
Introduction to Data Science
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
Hypothesis Testing with R
Correlation Analysis with R
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with R
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Machine Learning - Basics
Random Forest - How it works
Random Forest vs. Decision Tree
Random Forest - Model Development
Random Forest - Model Validation
Random Forest - How it works
Gradient Boosting - How it works
Gradient Boosting - Model Development
Gradient Boosting - Model Validation
Support Vector Machine - How it works
Support Vector Machine - Model Development
Support Vector Machine - Model Validation
Ensemble Stacking / Blending
Time Series Forecasting - Theory
Time Series Analysis with R
Special Cases - Handle rare event model
Text Mining Basics & Applications
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

↧

K Nearest Neighbor : Step by Step Tutorial

December 18, 2017, 8:48 pm

≫ Next: Pandas Python Tutorial - Learn by Example

≪ Previous: Predictive Modeling using SAS & R Online Training

In this article, we will cover how K-nearest neighbor (KNN) algorithm works and how to run k-nearest neighbor in R. It is one of the most widely used algorithm for classification problems.

K-Nearest Neighbor Simplified

Introduction to K-Nearest Neighbor (KNN)

Knn is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.

Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.

How KNN algorithm works

Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown below -

Height (in cms)	Weight (in kgs)	T Shirt Size
158	58	M
158	59	M
158	63	M
160	59	M
160	60	M
163	60	M
163	61	M
160	64	L
163	64	L
165	61	L
165	62	L
165	65	L
168	62	L
168	63	L
168	66	L
170	63	L
170	64	L
170	68	L

Step 1 : Calculate Similarity based on distance function

There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. Manhattan distance is also very common for continuous variables.

Distance Functions

The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight.

New customer named 'Monica' has height 161cm and weight 61kg.

Euclidean distance between first observation and new observation (monica) is as follows -

=SQRT((161-158)^2+(61-58)^2)

Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.

Step 2 : Find K-Nearest Neighbors

Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt. See the calculation shown in the snapshot below -

Calculate KNN manually

In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are close to yellow circle. so the prediction for the new case is blue highlighted data point which is Medium T-shirt size.

KNN: Visual Representation

Assumptions of KNN

1. Standardization

When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance. For example, if one variable is based on height in cms, and the other is based on weight in kgs then height will influence more on the distance calculation. In order to make them comparable we need to standardize them which can be done by any of the following methods :

Standardization

2. Outlier

Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction.

Why KNN is non-parametric?

Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you can imagine each training case as a "parameter" in the model.

KNN vs. K-mean

Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor. See some of the difference below -

K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning algorithm (dependent variable exists)
K-mean is a clustering technique which tries to split data points into K-clusters such that the points in each cluster tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the classification of the K nearest points

Can KNN be used for regression?

Yes, K-nearest neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied when dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.

Pros and Cons of KNN

Pros

Easy to understand
No assumptions about data
Can be applied to both classification and regression
Works easily on multi-class problems

Cons

Memory Intensive / Computationally expensive
Sensitive to scale of data
Not work well on rare event (skewed) target variable
Struggle when high number of independent variables

For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value may lead to a large model bias.

How to handle categorical variables in KNN?

Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike regression, create k dummies instead of (k-1). For example, a categorical variable named "Department" has 5 unique levels / categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

How to find best K value?

Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by holding out a subset of the training set from the model building process.

Cross-validation (let's say 10 fold validation) involves randomly dividing the training set into 10 groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10% to validate it. The misclassification rate is then computed on the 10% validation data. This procedure repeats 10 times. Different group of observations are treated as a validation set each of the 10 times. It results to 10 estimates of the validation error which are then averaged out.

K Nearest Neighbor in R

We are going to use historical data of past win/loss statistics and the corresponding speeches. This dataset comprises of 1524 observations on 14 variables. Dependent variable is win/loss where 1 indicates win and 0 indicates loss. The independent variables are:

1.Proportion of words in the speech showing

a.Optimism

b.Pessimism

c.the use of Past

d.the use of Present

e.the use of Future

2.Number of time he/she mentions his/her own party

3.Number of time he/she mentions his/her opposite parties.

4.Some measure indicating the content of speech showing

a.Openness

b.Conscientiousness

c.Extraversion

d.Agreeableness

e.Neuroticism

f.emotionality

Download Link :Data File

Read Data

# Read data
data1 = read.csv("US Presidential Data.csv")
View(data1)

We read the CSV file with the help of read.csv command. Here the first argument is the name of the dataset. The second argument - Header = TRUE or T implies that the first row in our csv file denotes the headings while header = FALSE or F indicates that the data should be read from the first line and does not involves any headings.

# load library
library(caret)
library(e1071)

# Transforming the dependent variable to a factor
data1$Win.Loss = as.factor(data1$Win.Loss)

Here we will use caret package in order to run knn. Since my dependent variable is numeric here thus we need to transform it to factor using as.factor().

#Partitioning the data into training and validation data
set.seed(101)
index = createDataPartition(data1$Win.Loss, p = 0.7, list = F )
train = data1[index,]
validation = data1[-index,]

In order to partition the data into training and validation sets we use createDataPartition() function in caret.

Firstly we set the seed to be 101 so that the same results can be obtained. In the createDataPartition() the first argument is the dependent variable , p denotes how much data we want in the training set; here we take 70% of the data in training set and rest in cross validation set, list = F denotes that the indices we obtain should be in form of a vector.

# Explore data
dim(train)
dim(validation)
names(train)
head(train)
head(validation)

The dimensions of training and validation sets are checked via dim(). See first 6 rows of training dataset -

   Win.Loss   Optimism  Pessimism  PastUsed FutureUsed PresentUsed OwnPartyCount
1        X1 0.10450450 0.05045045 0.4381443  0.4948454  0.06701031             2
3        X1 0.11257190 0.04930156 0.4159664  0.5168067  0.06722689             1
5        X1 0.10582640 0.05172414 0.3342618  0.5821727  0.08356546             3
7        X1 0.09838275 0.06401617 0.3240741  0.6018519  0.07407407             6
9        X1 0.10610734 0.04688464 0.3633540  0.5372671  0.09937888             2
10       X1 0.10066128 0.05951506 0.3554817  0.5382060  0.10631229             1
   OppPartyCount NumericContent Extra Emoti Agree Consc Openn
1              2    0.001877543 4.041 4.049 3.469 2.450 2.548
3              1    0.002131163 3.463 4.039 3.284 2.159 2.465
5              4    0.002229220 4.658 4.023 3.283 2.415 2.836
7              4    0.002251985 3.727 4.108 3.357 2.128 2.231
9              5    0.002446440 4.119 4.396 3.661 2.572 2.599
10             2    0.002107436 3.800 4.501 3.624 2.117 2.154

By default, levels of dependent variable in this dataset is "0""1". Later when we will do prediction, these levels will be used as variable names for prediction so we need to make it valid variable names.

# Setting levels for both training and validation data
levels(train$Win.Loss) <- make.names(levels(factor(train$Win.Loss)))
levels(validation$Win.Loss) <- make.names(levels(factor(validation$Win.Loss)))

Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.

# Setting up train controls
repeats = 3
numbers = 10
tunel = 10

set.seed(1234)
x = trainControl(method = "repeatedcv",
number = numbers,
repeats = repeats,
classProbs = TRUE,
summaryFunction = twoClassSummary)

Using train() function we run our knn; Win.Loss is dependent variable, the full stop after tilde denotes all the independent variables are there. In ‘data=’ we pass our training set, ‘method=’ denotes which technique we want to deploy, setting preProcess to center and scale tells us that we are standardizing our independent variables

center : subtract mean from values.
scale : divide values by standard deviation.

trControl demands our ‘x’ which was obtained via train( ) and tunelength is always an integer which is used to tune our algorithm.

model1 <- train(Win.Loss~. , data = train, method = "knn",
preProcess = c("center","scale"),
trControl = x,
metric = "ROC",
tuneLength = tunel)

# Summary of model
model1
plot(model1)

k-Nearest Neighbors 

1068 samples
  13 predictor
   2 classes: 'X0', 'X1'

Pre-processing: centered (13), scaled (13) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 961, 962, 961, 962, 961, 962, ... 
Resampling results across tuning parameters:

  k   ROC        Sens       Spec     
   5  0.8440407  0.6910182  0.8382051
   7  0.8537506  0.6847658  0.8520513
   9  0.8575183  0.6712350  0.8525796
  11  0.8588422  0.6545296  0.8592152
  13  0.8585478  0.6560976  0.8556333
  15  0.8570397  0.6432249  0.8648329
  17  0.8547545  0.6448509  0.8627894
  19  0.8520574  0.6336043  0.8632867
  21  0.8484632  0.6215447  0.8627894
  23  0.8453320  0.6071622  0.8658664

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 11.

Cross Validation : Fine Tuning

Finally to make predictions on our validation set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.

# Validation
valid_pred <- predict(model1,validation, type = "prob")

#Storing Model Performance Scores
library(ROCR)
pred_val <-prediction(valid_pred[,2],validation$Win.Loss)

# Calculating Area under Curve (AUC)
perf_val <- performance(pred_val,"auc")
perf_val

# Plot AUC
perf_val <- performance(pred_val, "tpr", "fpr")
plot(perf_val, col = "green", lwd = 1.5)

#Calculating KS statistics
ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))
ks

The Area under curve (AUC) on validation dataset is 0.8642.

↧

Pandas Python Tutorial - Learn by Example

December 26, 2017, 6:03 am

≫ Next: Predictive Modeling using SAS & R Online Training

≪ Previous: K Nearest Neighbor : Step by Step Tutorial

Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

Data Analysis with Python : Pandas Step by Step Guide

Why pandas?
It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -

Create a structured data set similar to R's data frame and Excel spreadsheet.
Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
Selecting particular rows or columns from data set
Arranging data in ascending or descending order
Filtering data based on some conditions
Summarizing data by classification variable
Reshape data into wide or long format
Time series analysis
Merging and concatenating two datasets
Iterate over the rows of dataset
Writing or Exporting data in CSV or Excel format

Datasets:

In this tutorial we will use two datasets: 'income'and'iris'.

'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link

Important pandas functions to remember

The following is a list of common tasks along with pandas functions.

Utility	Functions
To select random no. of rows	df.sample(n = 10)
To select fraction of random rows	df.sample(frac = 0.2)
To rename the variables	df.rename( )
Selecting a column as index	df.set_index( )
Removing rows or columns	df.drop( )
Sorting values	df.sort_values( )
Grouping variables	df.groupby( )
Filtering	df.query( )
Finding the missing values	df.isnull( )
Dropping the missing values	df.dropna( )
Removing the duplicates	df.drop_duplicates( )
Creating dummies	pd.get_dummies( )
Ranking	df.rank( )
Cumulative sum	df.cumsum( )
Quantiles	df.quantile( )
Selecting numeric variables	df.select_dtypes( )
Concatenating two dataframes	pd.concat()
Merging on basis of common variable	pd.merge( )

Importing pandas library

You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:

import pandas as pd

The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of pandas.function every time you need to apply it.

Importing Dataset

To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.

income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")

 Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A     Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A      Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A     Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A    Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
4     C  California  1685349  1675807  1889570  1480280  1735069  1812546   

     Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0  1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1  1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2  1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3  1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341  
4  1487315  1663809  1624509  1639670  1921845  1156536  1388461  1644607

Knowing the Variable types

You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.

income.dtypes

Index    object
State    object
Y2002     int64
Y2003     int64
Y2004     int64
Y2005     int64
Y2006     int64
Y2007     int64
Y2008     int64
Y2009     int64
Y2010     int64
Y2011     int64
Y2012     int64
Y2013     int64
Y2014     int64
Y2015     int64
dtype: object

Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -

income['State'].dtypes

It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

Changing the data types

Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:

income.Y2008 = income.Y2008.astype(float)
income.dtypes

Index     object
State     object
Y2002      int64
Y2003      int64
Y2004      int64
Y2005      int64
Y2006      int64
Y2007      int64
Y2008    float64
Y2009      int64
Y2010      int64
Y2011      int64
Y2012      int64
Y2013      int64
Y2014      int64
Y2015      int64
dtype: object

To view the dimensions or shape of the data

income.shape

 (51, 16)

51 is the number of rows and 16 is the number of columns.

You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R).

income.shape[0]
income.shape[1]

To view only some of the rows

By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.

income.head()
income.head(2)
income.tail()
income.tail(2)

Alternatively, the command income[0:5] can be used to fetch first five rows.

Extract Unique Values

The unique() function shows the unique levels or categories in the dataset.

income.Index.unique()

array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)

The nunique( ) shows the number of unique values.

income.Index.nunique()

It returns 19 as index column contains distinct 19 values.

Generate Cross Tab

pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.

pd.crosstab(income.Index,income.State)

Creating a frequency distribution

income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.

income.Index.value_counts(ascending = True)

F    1
G    1
U    1
L    1
H    1
P    1
R    1
D    2
T    2
S    2
V    2
K    2
O    3
C    3
I    4
W    4
A    4
M    8
N    8
Name: Index, dtype: int64

To draw the samples
income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.

income.sample(n = 5)
income.sample(frac = 0.1)

Selecting only a few of the columns
To select only a specific columns we use either loc[ ] or iloc[ ] commands. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.

income.loc[:,["Index","State","Y2008"]]
income.loc[:,"Index":"Y2008"] #Selecting consecutive columns
#In the above command both Index and Y2008 are included.
income.iloc[:,0:5] #Columns from 1 to 5 are included. 6th column not included

The difference between loc and iloc is that loc requires the column(rows) names to be selected while iloc requires the column(rows) indices (position).

Renaming the variables

We create a dataframe 'data' for information of people and their respective zodiac signs.

data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
data

       A          B
0   John      Libra
1   Mary  Capricorn
2  Julia      Aries
3  Kenny    Scorpio
4  Henry   Aquarius

If all the columns are to be renamed then we can use data.columns and assign the list of new column names.

#Renaming all the variables.
data.columns = ['Names','Zodiac Signs']

   Names Zodiac Signs
0   John        Libra
1   Mary    Capricorn
2  Julia        Aries
3  Kenny      Scorpio
4  Henry     Aquarius

If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.

#Renaming only some of the variables.
data.rename(columns = {"Names":"Cust_Name"},inplace = True)

  Cust_Name Zodiac Signs
0      John        Libra
1      Mary    Capricorn
2     Julia        Aries
3     Kenny      Scorpio
4     Henry     Aquarius

By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"

income.columns = income.columns.str.replace('Y' , 'Year ')
income.columns

Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
      dtype='object')

Setting one column in the data frame as the index

Using set_index("column name") we can set the indices as that column and that column gets removed.

income.set_index("Index",inplace = True)
income.head()
#Note that the indices have changed and Index column is now no more a column
income.columns
income.reset_index(inplace = True)
income.head()

reset_index( ) tells us that one should use the by default indices.

Removing the columns and rows

To drop a column we use drop( ) where the first argument is a list of columns to be removed. By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.

income.drop('Index',axis = 1)

#Alternatively
income.drop("Index",axis = "columns")
income.drop(['Index','State'],axis = 1)
income.drop(0,axis = 0)
income.drop(0,axis = "index")
income.drop([0,1,2,3],axis = 0)

Also inplace = False by default thus no alterations are made in the original dataset. axis = "columns" and axis = "index" means the column and row(index) should be removed respectively.

Sorting the data
To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.

income.sort_values("State",ascending = False)
income.sort_values("State",ascending = False,inplace = True)
income.Y2006.sort_values()

We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002

income.sort_values(["Index","Y2002"])

Some arithmetic operations.
Using eval( ) arithmetic operations on various columns can be carried out in a dataset.

income["difference"] = income.Y2008-income.Y2009

#Alternatively
income["difference2"] = income.eval("Y2008 - Y2009")
income.head()

  Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A     Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A      Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A     Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A    Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
4     C  California  1685349  1675807  1889570  1480280  1735069  1812546   

       Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  \
0  1945229.0  1944173  1237582  1440756  1186741  1852841  1558906  1916661   
1  1551826.0  1436541  1629616  1230866  1512804  1985302  1580394  1979143   
2  1752886.0  1554330  1300521  1130709  1907284  1363279  1525866  1647724   
3  1188104.0  1628980  1669295  1928238  1216675  1591896  1360959  1329341   
4  1487315.0  1663809  1624509  1639670  1921845  1156536  1388461  1644607   

   difference  difference2  
0      1056.0       1056.0  
1    115285.0     115285.0  
2    198556.0     198556.0  
3   -440876.0    -440876.0  
4   -176494.0    -176494.0

income.ratio = income.Y2008/income.Y2009

The above command does not work, thus to create new columns we need to use square brackets.
We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.

data = income.assign(ratio = (income.Y2008 / income.Y2009))
data.head()

Finding Descriptive Statistics
describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.

income.describe() #for numeric variables

To find the total count, maximum occuring string and its frequency we write include = ['object']

income.describe(include = ['object']) #Only for strings / objects

Mean, median, maximum and minimum can be obtained for a particular column(s) as:

income.Y2008.mean()
income.Y2008.median()
income.Y2008.min()
income.loc[:,["Y2002","Y2008"]].max()

Groupby function
To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.

income.groupby("Index").Y2008.min()
income.groupby("Index")["Y2008","Y2010"].max()

agg( ) function is used to find all the functions for a given variable.

income.groupby("Index").Y2002.agg(["count","min","max","mean"])
income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])

The following command finds minimum and maximum values for Y2002 and only mean for Y2003

income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})

          Y2002                 Y2003
           min      max         mean
Index                               
A      1170302  1742027  1810289.000
C      1343824  1685349  1595708.000
D      1111437  1330403  1631207.000
F      1964626  1964626  1468852.000
G      1929009  1929009  1541565.000
H      1461570  1461570  1200280.000
I      1353210  1776918  1536164.500
K      1509054  1813878  1369773.000
L      1584734  1584734  1110625.000
M      1221316  1983285  1535717.625
N      1395149  1885081  1382499.625
O      1173918  1802132  1569934.000
P      1320191  1320191  1446723.000
R      1501744  1501744  1942942.000
S      1159037  1631522  1477072.000
T      1520591  1811867  1398343.000
U      1771096  1771096  1195861.000
V      1134317  1146902  1498122.500
W      1677347  1977749  1521118.500

Filtering
To filter only those rows which have Index as "A" we write:

income[income.Index == "A"]

#Alternatively
income.loc[income.Index == "A",:]

  Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A   Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A    Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A   Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A  Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   

     Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0  1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1  1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2  1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3  1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341

To select the States having Index as "A":

income.loc[income.Index == "A","State"]
income.loc[income.Index == "A",:].State

To filter the rows with Index as "A" and income for 2002 > 1500000"

income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]

To filter the rows with index either "A" or "W", we can use isin( ) function:

income.loc[(income.Index == "A") | (income.Index == "W"),:]

#Alternatively.
income.loc[income.Index.isin(["A","W"]),:]

   Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0      A        Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1      A         Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2      A        Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3      A       Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
47     W     Washington  1977749  1687136  1199490  1163092  1334864  1621989   
48     W  West Virginia  1677347  1380662  1176100  1888948  1922085  1740826   
49     W      Wisconsin  1788920  1518578  1289663  1436888  1251678  1721874   
50     W        Wyoming  1775190  1498098  1198212  1881688  1750527  1523124   

      Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0   1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1   1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2   1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3   1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341  
47  1545621  1555554  1179331  1150089  1775787  1273834  1387428  1377341  
48  1238174  1539322  1539603  1872519  1462137  1683127  1204344  1198791  
49  1980167  1901394  1648755  1940943  1729177  1510119  1701650  1846238  
50  1587602  1504455  1282142  1881814  1673668  1994022  1204029  1853858

Alternatively we can use query( ) function and write our filtering criteria:

income.query('Y2002>1700000 & Y2003 > 1500000')

Dealing with missing values
We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.

import numpy as np
mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
'Yield': [1010, 1025.2, 1404.2, 1251.7],
'cost' : [102, np.nan, 20, 68]}
crops = pd.DataFrame(mydata)
crops

isnull( ) returns True and notnull( ) returns False if the value is NaN.

crops.isnull() #same as is.na in R
crops.notnull() #opposite of previous command.
crops.isnull().sum() #No. of missing values.

crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

crops[crops.cost.isnull()] #shows the rows with NAs.
crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop

To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

crops.dropna(how = "any").shape
crops.dropna(how = "all").shape

To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:

crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
crops.dropna(subset = ['Yield',"cost"],how = 'all').shape

Replacing the missing values by "UNKNOWN" sub attribute in Column name.

crops['cost'].fillna(value = "UNKNOWN",inplace = True)
crops

Dealing with duplicates

We create a new dataframe comprising of items and their respective prices.

data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
data

             Items  Price
0               TV  10000
1  Washing Machine  50000
2           Mobile  20000
3               TV  10000
4               TV  10000
5  Washing Machine  40000

duplicated() returns a logical vector returning True when encounters duplicated.

data.loc[data.duplicated(),:]
data.loc[data.duplicated(keep = "first"),:]

By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.

data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.

If keep = "False" then it considers all the occurences of the repeated observations as duplicates.

data.loc[data.duplicated(keep = False),:] #all the duplicates, including unique are shown.

To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )

data.drop_duplicates(keep = "first")
data.drop_duplicates(keep = "last")
data.drop_duplicates(keep = False,inplace = True) #by default inplace = False
data

Creating dummies

Now we will consider the iris dataset.

iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
iris.head()

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

map( ) function is used to match the values and replace them in the new series automatically created.

iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
iris.head()

To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.

pd.get_dummies(iris.Species,prefix = "Species")
pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1] #1 is not included
species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]

With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.

iris = pd.concat([iris,species_dummies],axis = 1)
iris.head()

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
0           5.1          3.5           1.4          0.2  setosa   
1           4.9          3.0           1.4          0.2  setosa   
2           4.7          3.2           1.3          0.2  setosa   
3           4.6          3.1           1.5          0.2  setosa   
4           5.0          3.6           1.4          0.2  setosa   

   Species_setosa  Species_versicolor  Species_virginica  
0               1                   0                  0  
1               1                   0                  0  
2               1                   0                  0  
3               1                   0                  0  
4               1                   0                  0

It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True

pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

Ranking
To create a dataframe of all the ranks we use rank( )

iris.rank()

Ranking by a specific variable
Suppose we want to rank the Sepal.Length for different species in ascending order:

iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
iris.head( )

#Alternatively
iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
iris.head()

Calculating the Cumulative sum
Using cumsum( ) function we can obtain the cumulative sum

iris['cum_sum'] = iris["Sepal.Length"].cumsum()
iris.head()

Cumulative sum by a variable
To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )

iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
iris.head()

Calculating the percentiles.

Various quantiles can be obtained by using quantile( )

iris.quantile(0.5)
iris.quantile([0.1,0.2,0.5])
iris.quantile(0.55)

if else
We create a new dataframe of students' name and their respective zodiac signs.

students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})

Importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.

import numpy as np
students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
students

      Names Zodiac Signs flag
0      John     Aquarius  yes
1      Mary        Libra   no
2     Henry       Gemini  yes
3  Augustus       Pisces   no
4     Kenny        Virgo   no

Multiple conditions for if else
We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False

conditions = [
(students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
(students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
(students['Zodiac Signs'] == 'Pisces')]
choices = ['yellow', 'blue', 'purple']
students['color'] = np.select(conditions, choices, default='black')
students

      Names Zodiac Signs flag   color
0      John     Aquarius  yes  yellow
1      Mary        Libra   no    blue
2     Henry       Gemini  yes   black
3  Augustus       Pisces   no  purple
4     Kenny        Virgo   no   black

If using simple 'if else' we need to take care of the indentation. Python does not involve curly braces for the loops and if else.

x = 1001 #Take care of the indentation.
if x%10 == 0:
print("x is a multiple of 10")
else:
print("It is not a multiple of 10")

Select numeric or categorical columns only
To include numeric columns we use select_dtypes( )

data1 = iris.select_dtypes(include=[np.number])
data1.head()

_get_numeric_data also provides utility to select the numeric columns only.

data3 = iris._get_numeric_data()
data3.head(3)

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
0           5.1          3.5           1.4          0.2      5.1      5.1
1           4.9          3.0           1.4          0.2     10.0     10.0
2           4.7          3.2           1.3          0.2     14.7     14.7

For selecting categorical variables

data4 = iris.select_dtypes(include = ['object'])
data4.head(2)

 Species
0  setosa
1  setosa

Concatenating
We create 2 dataframes containing the details of the students:

students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})

using pd.concat( ) function we can join the 2 dataframes:

data = pd.concat([students,students2]) #by default axis = 0

   Marks     Names Zodiac Signs
0    NaN      John     Aquarius
1    NaN      Mary        Libra
2    NaN     Henry       Gemini
3    NaN  Augustus       Pisces
4    NaN     Kenny        Virgo
0   50.0      John          NaN
1   81.0      Mary          NaN
2   98.0     Henry          NaN
3   25.0  Augustus          NaN
4   35.0     Kenny          NaN

By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1

data = pd.concat([students,students2],axis = 1)
data

      Names Zodiac Signs  Marks     Names
0      John     Aquarius     50      John
1      Mary        Libra     81      Mary
2     Henry       Gemini     98     Henry
3  Augustus       Pisces     25  Augustus
4     Kenny        Virgo     35     Kenny

Using append function we can join the dataframes row-wise

students.append(students2) #for rows

Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise

classes = {'x': students, 'y': students2}
result = pd.concat(classes)
result

     Marks     Names Zodiac Signs
x 0    NaN      John     Aquarius
  1    NaN      Mary        Libra
  2    NaN     Henry       Gemini
  3    NaN  Augustus       Pisces
  4    NaN     Kenny        Virgo
y 0   50.0      John          NaN
  1   81.0      Mary          NaN
  2   98.0     Henry          NaN
  3   25.0  Augustus          NaN
  4   35.0     Kenny          NaN

Merging or joining on the basis of common variable.

We take 2 dataframes with different number of observations:

students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})

Using pd.merge we can join the two dataframes. on = 'Names'denotes the common variable on the basis of which the dataframes are to be combined is 'Names'

result = pd.merge(students, students2, on='Names') #it only takes intersections
result

   Names Zodiac Signs  Marks
0   John     Aquarius     50
1   Mary        Libra     81
2  Henry       Gemini     98

By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"

result = pd.merge(students, students2, on='Names',how = "outer") #it only takes unions
result

      Names Zodiac Signs  Marks
0      John     Aquarius   50.0
1      Mary        Libra   81.0
2     Henry       Gemini   98.0
3     Maria    Capricorn    NaN
4  Augustus          NaN   25.0
5     Kenny          NaN   35.0

To take only intersections and all the values in left df set how = 'left'

result = pd.merge(students, students2, on='Names',how = "left")
result

   Names Zodiac Signs  Marks
0   John     Aquarius   50.0
1   Mary        Libra   81.0
2  Henry       Gemini   98.0
3  Maria    Capricorn    NaN

Similarly how = 'right'takes only intersections and all the values in right df.

result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
result

      Names Zodiac Signs  Marks      _merge
0      John     Aquarius     50        both
1      Mary        Libra     81        both
2     Henry       Gemini     98        both
3  Augustus          NaN     25  right_only
4     Kenny          NaN     35  right_only

indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

↧

Predictive Modeling using SAS & R Online Training

December 29, 2017, 5:47 am

≫ Next: NumPy Tutorial : Step by Step Guide

≪ Previous: Pandas Python Tutorial - Learn by Example

First Online Training Batch

The next Instructor-led online training batch will commence on February 10, 2018. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus recorded videos from any part of the world via web conference mode. Also you can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared.

Practical SAS Programming - Learning SAS by Case Studies
Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge)

Batch : 10th February, Saturday Mode : Live Instructor-led

Rs 20,000

($325)

Special Price : Get 10% off till 8th Jan, 2018
Base and Advanced SAS Programming
Instructor-led live class + Recorded videos
Duration : 8 Weeks (100 hours)
Timing : 17:00 - 19:00 IST
Live Projects + Scenario-Based Questions
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee
Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date

Indian Users
All Users (Except India)

Rs 25,000

($400)

Special Price : Get 10% off till 8th Jan, 2018
Predictive Modeling with SAS
Instructor-led live class + Recorded videos
Duration : 8 - 10 Weeks (100 hours)
Timing : 19:00 - 21:00 IST
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee
Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date

Indian Users
All Users (Except India)

Rs 30,000

($480)

Special Price : Get 10% off till 8th Jan, 2018
R Programming + Predictive Modeling with R
Instructor-led live class + Recorded videos
Duration : 10-12 Weeks (120 hours)
Timing : 21:15 - 23:15 IST
Live Projects + Domain Knowledge
Case Studies
Hands-on Examples
Weekly Assignments
Certification
Job Placement Assistance
Weekend Classes
Money Back Guarantee
Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date

Indian Users
All Users (Except India)

Combo Deals - Spend Less, Learn More

Pay only Rs 35,000 ($600) on purchase of 'Practical SAS Programming' and 'Predictive Modeling with SAS' courses
Offer expires on 8th January,2018
Enroll Now - Indian Users
Enroll Now - All Users (Except India)

What is Instructor-led live program?

Money Back Guarantee?

If you do not like our training, you can ask for 100% course fees refund after your first live session. No question asked refund policy!

What is the duration of these programs?

These are weekend programmes comprising 100-130 hours. Classes will be held on every Saturday and Sunday.

SAS Programming : 17:00 - 19:00 IST (07:30AM - 09:30AM Eastern Daylight Time)
Predictive Modeling with SAS : 19:00 - 21:00 IST (09:30AM - 11:30AM Eastern Daylight Time)
Data Science with R : 21:15 - 23:15 IST (11:45AM - 01:45PM Eastern Daylight Time)

The course duration is as follows -

Practical SAS Programming -100 hours (At least 50 hours live training + 5 hours video based training + ~60 hours of Practice and Self Study)
Predictive Modeling with SAS - 100 hours (Includes hours of Video based training and Practice and Self Study)
Data Science with R - 120 hours (At least 40 hours live training + 7 hours video based training + ~80 hours of Practice and Self Study)

If I opt for all the 3 courses, will classes be scheduled at the same time?

All classes will be scheduled on weekend but not at the same time. It'll be one by one. For example if class A gets over at 5. Next class will start at 5.

How We are different from other institutes?

Here are some of the features of ListenData that makes us better than other training institutes.

Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
Hands-on Model Development and Validation
Strategies to implement predictive model
New algorithms to solve problems efficiently
Explain complex topics via visual lessons

Who should do these courses?

These courses are ideal for candidates who want to make a career in analytics.

Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

Every training institute promises job. Why should i trust you?

No hands-on experience
No domain knowledge
No theoretical knowledge of statistical concepts
Poor analytical skill

The decline of SAS Jobs and rise of R?

I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

In case if I miss any live session?

Every class is recorded. We will provide you recording of every session.

I never studied Programming or Statistics during graduation. Can I still apply for this course?

Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

Is my registration fees refundable?

About Instructor

Any Questions?

Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

Predictive Modeling using SAS & R Training

Curriculum - Practical SAS Programming

Introduction to SAS
How SAS works
Import Raw Data Files - Basics
Import Raw Data Files - Special Cases
Importing / Exporting Data with Procedures
Exploring Data - Various Methods
Data Subsetting
Data Manipulation - Basics
Data Manipulation - Intermediate
Data Manipulation - Advanced
Do Loops and Arrays
Merging Data
Appending Data
Character & Numeric Functions
Date Functions
Reporting - Creating tabular reports
Proc SQL - Part I
Proc SQL - Part II
Proc SQL - Part III
SAS Macros - Basics
SAS Macros - Intermediate
SAS Macros - Advanced
SAS Macros - Debugging Tips
Efficient SAS Programming Tips
Connect to Databases using SAS
Interview Preparation - Scenario Based Questions
Live Project

Back To Top

Curriculum - Predictive Modeling using SAS

Introduction to Statistics & Modeling
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
SAS Programming - Basics
SAS Programming - Intermediate
Descriptive Statistics with SAS
Hypothesis Testing with SAS
Correlation Analysis with SAS
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with SAS
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Time Series Forecasting - Theory
Time Series Analysis with SAS
Special Cases - Handle rare event model
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - Employee Attrition
Case Studies - Time Series Forecasting
Interview Tips - Common Interview Questions

Back To Top

Curriculum - R Programming + Data Science with R

Introduction to R
Introduction to RStudio
Data Structures in R
Importing / Exporting Data in R
Data Exploration
Data Manipulation with dplyr package - Basics
Data Manipulation with dplyr package - Intermediate
Data Manipulation with dplyr package - Advanced
Character and Numeric Functions in R
Data & Time Functions in R
Data Visualization in R
Loops in R (Apply Family of Functions & For Loop)
R Functions - Part I
R Functions - Part II
Introduction to Data Science
Marketing Analytics : Applications
Predictive Modeling in Financial Services Industry
Predictive Modeling in HR
Hypothesis Testing with R
Correlation Analysis with R
Steps of Predictive Modeling
Data Preparation in Predictive Modeling
Variable Selection Methods in Predictive Modeling
Segmentation - Introduction
Segmentation - Cluster Analysis : Theory
Segmentation - Cluster Analysis : Data Preparation
Segmentation - Cluster Analysis : k-means and Hierarchical
Segmentation - Cluster Analysis : Cluster Performance
Principal Component Analysis (PCA) - Theory
Running and Understanding PCA with R
Linear Regression - Theory
Linear Regression - Assumptions and Treatment
Linear Regression - Important Metrics
Linear Regression - Variable Selection Methods
Linear Regression - Model Development
Linear Regression - Model Validation
Linear Regression - Model Performance
Linear Regression - Model Scoring
Linear Regression - Model Implementation
Logistic Regression - Theory
Logistic Regression - Assumptions and Treatment
Logistic Regression - Important Metrics
Logistic Regression - Variable Selection Methods
Logistic Regression - Model Development
Logistic Regression - Model Validation
Logistic Regression - Model Performance
Logistic Regression - Model Implementation
Decision Tree - How it works
Decision Tree - Model Development
Decision Tree - Model Validation
Decision Tree - Model Performance
Decision Tree - Model Implementation
Machine Learning - Basics
Random Forest - How it works
Random Forest vs. Decision Tree
Random Forest - Model Development
Random Forest - Model Validation
Random Forest - How it works
Gradient Boosting - How it works
Gradient Boosting - Model Development
Gradient Boosting - Model Validation
Support Vector Machine - How it works
Support Vector Machine - Model Development
Support Vector Machine - Model Validation
Ensemble Stacking / Blending
Time Series Forecasting - Theory
Time Series Analysis with R
Special Cases - Handle rare event model
Text Mining Basics & Applications
Case Studies - Attrition / Churn Model (BFSI / Telecom)
Case Studies - Customer Segmentation
Case Studies - Probability of Default
Case Studies - HR Drivers Analysis
Case Studies - Sales Forecasting
Case Studies - Text Mining
Interview Tips - Common Interview Questions

Back To Top

↧

NumPy Tutorial : Step by Step Guide

December 31, 2017, 1:56 am

≫ Next: Data Visualization in R using ggplot2

≪ Previous: Predictive Modeling using SAS & R Online Training

NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Python : Numpy Tutorial

Why NumPy instead of lists?

One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:

Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
They are more speedy to work with and hence are more efficient than the lists.
They are more convenient to deal with.

NumPy vs. Pandas

Pandas is built on top of NumPy. In other words, Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additional method or provides more streamlined way of working with numerical and tabular data in Python.

Importing numpy

Firstly you need to import the numpy library. Importing numpy can be done by running the following command:

import numpy as np

It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

Functions	Tasks
array	Create numpy array
ndim	Dimension of the array
shape	Size of the array (Number of rows and Columns)
size	Total number of elements in the array
dtype	Type of elements in the array, i.e., int64, character
reshape	Reshapes the array without changing the original shape
resize	Reshapes the array. Also change the original shape
arrange	Create sequence of numbers in array
Itemsize	Size in bytes of each item
diag	Create a diagonal matrix
vstack	Stacking vertically
hstack	Stacking horizontally

1D array

Using numpy an array is created by using np.array:

a = np.array([15,25,14,78,96])
a
print(a)

a
Output: array([15, 25, 14, 78, 96])

print(a)
Output: [15 25 14 78 96]

Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

Changing the datatype

np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.

a.dtype
a = np.array([15,25,14,78,96],dtype = "float")
a
a.dtype

Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

Creating the sequence of numbers

If you want to create a sequence of numbers then using np.arange we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.

b = np.arange(start = 20,stop = 30)
b

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In np.arange the end point is always excluded.

Create an Arithmetic Progression

np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.

c = np.arange(20,30,2) #30 is excluded.
c

array([20, 22, 24, 26, 28])

It is to be taken care that in np.arange( ) the stop argument is always excluded.

Reshaping the arrays

To reshape the array we can use reshape( ).

f = np.arange(101,113)
f.reshape(3,4)
f

 array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )

f.resize(3,4)
f

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.

f.reshape(3,-1)

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

2D arrays

A 2D array in numpy can be created in the following manner:

g = np.array([(10,20,30),(40,50,60)])
#Alternatively
g = np.array([[10,20,30],[40,50,60]])
g

The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:

g.ndim
g.size
g.shape

g.ndim
Output: 2

g.size
Output: 6

g.shape
Output: (2, 3)

Creating some usual matrices.

numpy provides the utility to create some usual matrices which are commonly used for linear algebra.

To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):

np.zeros( (2,4) )

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'

np.zeros([2,4],dtype=np.int16)

array([[0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int16)

To get a matrix of all random numbers from 0 to 1 we write np.empty.

np.empty( (2,3) )

array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
       [  2.29175545e-312,   2.33419537e-312,   2.37663529e-312]])

Note: The results may vary everytime you run np.empty.
To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:

np.ones([3,3])

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:

np.diag([14,15,16,17])

array([[14,  0,  0,  0],
       [ 0, 15,  0,  0],
       [ 0,  0, 16,  0],
       [ 0,  0,  0, 17]])

To create an identity matrix we can use np.eye( ) .

np.eye(5,dtype = "int")

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

Reshaping 2D arrays:

To get a flattened 1D array we can use ravel( )

g = np.array([(10,20,30),(40,50,60)])
g.ravel()

 array([10, 20, 30, 40, 50, 60])

To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.

g.reshape(3,-1) # returns the array with a modified shape
#It does not modify the original array
g.shape

 (2, 3)

Similar to 1D arrays, using resize( ) will modify the shape in the original array.

g.resize((3,2))
g #resize modifies the original array

array([[10, 20],
       [30, 40],
       [50, 60]])

Time for some matrix algebra.

Let us create some arrays A,b and B and they will be used for this section:

A = np.array([[2,0,1],[4,3,8],[7,6,9]])
b = np.array([1,101,14])
B = np.array([[10,20,30],[40,50,60],[70,80,90]])

In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.

A.T #transpose
A.transpose() #transpose
np.trace(A) # trace
np.linalg.inv(A) #Inverse

A.transpose()  #transpose
Output: 
array([[2, 4, 7],
       [0, 3, 6],
       [1, 8, 9]])

np.trace(A)  # trace
Output: 14

np.linalg.inv(A)  #Inverse
Output: 
array([[ 0.53846154, -0.15384615,  0.07692308],
       [-0.51282051, -0.28205128,  0.30769231],
       [-0.07692308,  0.30769231, -0.15384615]])

Note that transpose does not modify the original array.

Matrix addition and subtraction can be done in the usual way:

A+B
A-B

A+B
Output: 
array([[12, 20, 31],
       [44, 53, 68],
       [77, 86, 99]])

A-B
Output: 
array([[ -8, -20, -29],
       [-36, -47, -52],
       [-63, -74, -81]])

Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.

A.dot(B)

array([[  90,  120,  150],
       [ 720,  870, 1020],
       [ 940, 1160, 1380]])

To solve the system of linear equations: Ax = b we use np.linalg.solve( )

np.linalg.solve(A,b)

array([-13.92307692, -24.69230769,  28.84615385])

The eigen values and eigen vectors can be calculated using np.linalg.eig( )

np.linalg.eig(A)

(array([ 14.0874236 ,   1.62072127,  -1.70814487]),
 array([[-0.06599631, -0.78226966, -0.14996331],
        [-0.59939873,  0.54774477, -0.81748379],
        [-0.7977253 ,  0.29669824,  0.55608566]]))

The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

Some Mathematics functions

We can have various trigonometric functions like sin, cosine etc. using numpy:

B = np.array([[0,-20,36],[40,50,1]])
np.sin(B)

array([[ 0.        , -0.91294525, -0.99177885],
       [ 0.74511316, -0.26237485,  0.84147098]])

The resultant is the matrix of all sin( ) elements.
In order to get the exponents we use **

B**2

array([[   0,  400, 1296],
       [1600, 2500,    1]], dtype=int32)

We get the matrix of the square of all elements of B.
In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:

B>25

array([[False, False,  True],
       [ True,  True, False]], dtype=bool)

We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.

np.absolute(B)
np.sqrt(B)
np.exp(B)

Now we consider a matrix A of shape 3*3:

A = np.arange(1,10).reshape(3,3)
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:

A.sum()
A.min()
A.max()
A.mean()
A.std() #Standard deviation
A.var() #Variance

A.sum()
Output: 45

A.min()
Output: 1

A.max()
Output: 9

A.mean()
Output: 5.0

A.std()   #Standard deviation
Output: 2.5819888974716112

A.var()
Output: 6.666666666666667

In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.

A.argmin()
A.argmax()

A.argmin()
Output: 0

A.argmax()
Output: 8

If we wish to find the above statistics for each row or column then we need to specify the axis:

A.sum(axis=0)
A.mean(axis = 0)
A.std(axis = 0)
A.argmin(axis = 0)

A.sum(axis=0)                 # sum of each column, it will move in downward direction
Output: array([12, 15, 18])

A.mean(axis = 0)
Output: array([ 4.,  5.,  6.])

A.std(axis = 0)
Output: array([ 2.44948974,  2.44948974,  2.44948974])

A.argmin(axis = 0)
Output: array([0, 0, 0], dtype=int64)

By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column.
To find the min and index of maximum element fow each row, we need to move in rightwise direction so we write axis = 1:

A.min(axis=1)
A.argmax(axis = 1)

A.min(axis=1)                  # min of each row, it will move in rightwise direction
Output: array([1, 4, 7])

A.argmax(axis = 1)
Output: array([2, 2, 2], dtype=int64)

To find the cumulative sum along each row we use cumsum( )

A.cumsum(axis=1)

array([[ 1,  3,  6],
       [ 4,  9, 15],
       [ 7, 15, 24]], dtype=int32)

Creating 3D arrays

Numpy also provides the facility to create 3D arrays. A 3D array can be created as:

X = np.array( [[[ 1, 2,3],
[ 4, 5, 6]],
[[7,8,9],
[10,11,12]]])
X.shape
X.ndim
X.size

X contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.
To calculate the sum along a particular axis we use the axis parameter as follows:

X.sum(axis = 0)
X.sum(axis = 1)
X.sum(axis = 2)

X.sum(axis = 0)
Output: 
array([[ 8, 10, 12],
       [14, 16, 18]])

X.sum(axis = 1)
Output: 
array([[ 5,  7,  9],
       [17, 19, 21]])

X.sum(axis = 2)
Output: 
array([[ 6, 15],
       [24, 33]])

axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.

X.ravel()

 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

ravel( ) writes all the elements in a single array.

Indexing in arrays

It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -

x[start:end] : Elements in array x start through the end (but the end is excluded)
x[start:] : Elements start through the end
x[:end] : Elements from the beginning through the end (but the end is excluded)

If we want to extract 3rd element we write the index as 2 as it starts from 0.

x = np.arange(10)
x[2]
x[2:5]

 x[2]
Output: 2

x[2:5]
Output: array([2, 3, 4])

Note that in x[2:5] elements starting from 2nd index upto 5th index(exclusive) are selected.
If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:

x[:7:3] = 123
x

 array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])

To reverse a given array we write:

x = np.arange(10)
x[ : :-1] # reversed x

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Note that the above command does not modify the original array.
Consider a 3D array:

X = np.array( [[[ 1, 2,3],
[ 4, 5, 6]],
[[7,8,9],
[10,11,12]]])

To extract the 2nd matrix we write:

X[1,...] # same as X[1,:,:] or X[1]

array([[ 7,  8,  9],
       [10, 11, 12]])

Remember python indexing starts from 0 that is why we wrote 1 to extract the2nd 2D array.
To extract the first element from all the rows we write:

X[...,0] # same as X[:,:,0]

array([[ 1,  4],
       [ 7, 10]])

Indexing with Arrays of Indices

Consider a 1D array.

x = np.arange(11,35,2)
x

array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])

We form a 1D array i which subsets the elements of x as follows:

i = np.array( [0,1,5,3,7,9 ] )
x[i]

array([11, 13, 21, 17, 25, 29])

In a similar manner we create a 2D array j of indices to subset x.

j = np.array( [ [ 0, 1], [ 6, 2 ] ] )
x[j]

array([[11, 13],
       [23, 15]])

Similarly we can create both i and j as 2D arrays of indices for x

x = np.arange(15).reshape(3,5)
x
i = np.array( [ [0,1], # indices for the first dim
[2,0] ] )
j = np.array( [ [1,1], # indices for the second dim
[2,0] ] )

To get the ith index in row and jth index for columns we write:

x[i,j] # i and j must have equal shape

array([[ 1,  6],
       [12,  0]])

To extract ith index from 3rd column we write:

x[i,2]

array([[ 2,  7],
       [12,  2]])

For each row if we want to find the jth index we write:

x[:,j]

array([[[ 1,  1],
        [ 2,  0]],

       [[ 6,  6],
        [ 7,  5]],

       [[11, 11],
        [12, 10]]])

Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

You can also use indexing with arrays to assign the values:

x = np.arange(10)
x
x[[4,5,8,1,2]] = 0
x

array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])

0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
When the list of indices contains repetitions then it assigns the last value to that index:

x = np.arange(10)
x
x[[4,4,2,3]] = [100,200,300,400]
x

array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])

Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.

x = np.arange(10)
x[[1,1,1,7,7]]+=1
x

 array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])

Although index 1 and 7 are repeated but they are incremented only once.

Indexing with Boolean arrays

We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.

a = np.arange(12).reshape(3,4)
b = a > 4
b

array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

Note that 'b' is a Boolean with same shape as that of 'a'.
To select the elements from 'a' which adhere to condition 'b' we write:

a[b]

array([ 5,  6,  7,  8,  9, 10, 11])

Now 'a' becomes a 1D array with the selected elements
This property can be very useful in assignments:

a[b] = 0
a

array([[0, 1, 2, 3],
       [4, 0, 0, 0],
       [0, 0, 0, 0]])

All elements of 'a' higher than 4 become 0
As done in integer indexing we can use indexing via Booleans:
Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.

x = np.arange(15).reshape(3,5)
y = np.array([True,True,False]) # first dim selection
z = np.array([True,True,False,True,False]) # second dim selection

We write the x[y,:] which will select only those rows where y is True.

x[y,:] # selecting rows
x[y] # same thing

Writing x[:,z] will select only those columns where z is True.

x[:,z] # selecting columns

x[y,:]                                   # selecting rows
Output: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[y]                                     # same thing
Output: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[:,z]                                   # selecting columns
Output: 
array([[ 0,  1,  3],
       [ 5,  6,  8],
       [10, 11, 13]])

Stacking various arrays

Let us consider 2 arrays A and B:

A = np.array([[10,20,30],[40,50,60]])
B = np.array([[100,200,300],[400,500,600]])

To join them vertically we use np.vstack( ).

np.vstack((A,B)) #Stacking vertically

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [100, 200, 300],
       [400, 500, 600]])

To join them horizontally we use np.hstack( ).

np.hstack((A,B)) #Stacking horizontally

array([[ 10,  20,  30, 100, 200, 300],
       [ 40,  50,  60, 400, 500, 600]])

newaxis helps in transforming a 1D row vector to a 1D column vector.

from numpy import newaxis
a = np.array([4.,1.])
b = np.array([2.,8.])
a[:,newaxis]

array([[ 4.],
       [ 1.]])

#The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:

np.column_stack((a[:,newaxis],b[:,newaxis]))
np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack

np.column_stack((a[:,newaxis],b[:,newaxis]))
Output: 
array([[ 4.,  2.],
       [ 1.,  8.]])

np.hstack((a[:,newaxis],b[:,newaxis]))
Output: 
array([[ 4.,  2.],
       [ 1.,  8.]])

Splitting the arrays.

Consider an array 'z' of 15 elements:

z = np.arange(1,16)

Using np.hsplit( ) one can split the arrays

np.hsplit(z,5) # Split a into 5 arrays

[array([1, 2, 3]),
 array([4, 5, 6]),
 array([7, 8, 9]),
 array([10, 11, 12]),
 array([13, 14, 15])]

It splits 'z' into 5 arrays of eqaual length.
On passing 2 elements we get:

np.hsplit(z,(3,5))

[array([1, 2, 3]),
 array([4, 5]),
 array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15])]

It splits 'z' after the third and the fifth element.
For 2D arrays np.hsplit( ) works as follows:

A = np.arange(1,31).reshape(3,10)
A
np.hsplit(A,5) # Split a into 5 arrays

[array([[ 1,  2],
        [11, 12],
        [21, 22]]), array([[ 3,  4],
        [13, 14],
        [23, 24]]), array([[ 5,  6],
        [15, 16],
        [25, 26]]), array([[ 7,  8],
        [17, 18],
        [27, 28]]), array([[ 9, 10],
        [19, 20],
        [29, 30]])]

In the above command A gets split into 5 arrays of same shape.
To split after the third and the fifth column we write:

np.hsplit(A,(3,5))

[array([[ 1,  2,  3],
        [11, 12, 13],
        [21, 22, 23]]), array([[ 4,  5],
        [14, 15],
        [24, 25]]), array([[ 6,  7,  8,  9, 10],
        [16, 17, 18, 19, 20],
        [26, 27, 28, 29, 30]])]

Copying

Consider an array x

x = np.arange(1,16)

We assign y as x and then say 'y is x'

y = x
y is x

Let us change the shape of y

y.shape = 3,5

Note that it alters the shape of x

x.shape

(3, 5)

Creating a view of the data:

Let us store z as a view of x by:

z = x.view()
z is x

False

Thus z is not x.
Changing the shape of z

z.shape = 5,3

Creating a view does not alter the shape of x

x.shape

(3, 5)

Changing an element in z

z[0,0] = 1234

Note that the value in x also get alters:

x

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.

Creating a copy of the data:
Now let us create z as a copy of x:

z = x.copy()

Note that z is not x

z is x

Changing the value in z

z[0,0] = 9999

No alterations are made in x.

x

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

↧

Data Visualization in R using ggplot2

January 3, 2018, 4:09 am

≫ Next: Linear Regression in Python

≪ Previous: NumPy Tutorial : Step by Step Guide

For the purpose of data visualization, R offers various methods through inbuilt graphics and powerful packages such as ggolot2. Former helps in creating simple graphs while latter assists in creating customized professional graphs. In this article we will try to learn how various graphs can be made and altered using ggplot2 package.

Data Visualization with R

What is ggplot2?

ggplot2 is a robust and a versatile R package, developed by the most well known R developer, Hadley Wickham, for generating aesthetic plots and charts.

The ggplot2 implies "Grammar of Graphics" which believes in the principle that a plot can be split into the following basic parts -

Plot = data + Aesthetics + Geometry

data refers to a data frame (dataset).
Aesthetics indicates x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.)

Apart from the above three parts, there are other important parts of plot -

Faceting implies the same type of graph can be applied to each subset of the data. For example, for variable gender, creating 2 graphs for male and female.
Annotation lets you to add text to the plot.
Summary Statistics allows you to add descriptive statistics on a plot.
Scales are used to control x and y axis limits

Why ggplot2 is better?

Excellent themes can be created with a single command.
Its colors are nicer and more pretty than the usual graphics.
Easy to visualize data with multiple variables.
Provides a platform to create simple graphs providing plethora of information.

The table below shows common charts along with various important functions used in these charts.

Important Plots	Important Functions
Scatter Plot	geom_point(), geom_smooth(), stat_smooth()
Bar Chart	geom_bar(), geom_errorbar()
Histogram	geom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge()
Box Plot	geom_boxplot(), stat_boxplot(), stat_summary()
Line Plot	geom_line(), geom_step(), geom_path(), geom_errorbar()
Pie Chart	coord_polar()

Datasets

In this article, we will use three datasets - 'iris' , 'mpg' and 'mtcars' datasets available in R.

1. The 'iris' data comprises of 150 observations with 5 variables. We have 3 species of flowers: Setosa, Versicolor and Virginica and for each of them the sepal length and width and petal length and width are provided.

2. The 'mtcars' data consists of fuel consumption (mpg) and 10 aspects of automobile design and performance for 32 automobiles. In order words, we have 32 observations and 11 different variables:

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

3. The 'mpg' data consists of 234 observations and 11 variables.

Install and Load Package

First we need to install package in R by using command install.packages( ).

#installing package
install.packages("ggplot2")
library(ggplot2)

Once installation is completed, we need to load the package so that we can use the functions available in the ggplot2 package. To load the package, use command library( )

Histogram, Density plots and Box plots are used for visualizing a continuous variable.

Creating Histogram:

Firstly we consider the iris data to create histogram and scatter plot.

# Considering the iris data.
# Creating a histogram
ggplot(data = iris, aes( x = Sepal.Length)) + geom_histogram( )

Here we call ggplot( ) function, the first argument being the dataset to be used.

aes( ) i.e. aesthetics we define which variable will be represented on the x- axis; here we consider 'Sepal.Length'
geom_histogram( ) denotes we want to plot a histogram.

Histogram in R

To change the width of bin in the histograms we can use binwidth in geom_histogram( )

ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(binwidth=1)

One can also define the number of bins being wanted, the binwidth in that case will be adjusted automatically.

ggplot(data = iris , aes(x=Sepal.Length)) + geom_histogram(color="black", fill="white", bins = 10)

Using color = "black" and fill = "white" we are denoting the boundary colors and the inside color of the bins respectively.

How to visualize various groups in histogram

ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_histogram(fill="white", binwidth = 1)

Histogram depicting various species

Creating Density Plot

Density plot is also used to present the distribution of a continuous variable.

ggplot(iris, aes( x = Sepal.Length)) + geom_density( )

geom_density( ) function is for displaying density plot.

Density Plot

How to show various groups in density plot

ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )

Density Plot by group

Creating Bar and Column Charts :

Bar and column charts are probably the most common chart type. It is best used to compare different values.

Now mpg data will be used for creating the following graphics.

ggplot(mpg, aes(x= class)) + geom_bar()

Here we are trying to create a bar plot for number of cars in each class using geom_bar( ).

Column Chart using ggplot2

Using coord_flip( ) one can inter-change x and y axis.

ggplot(mpg, aes(x= class)) + geom_bar() + coord_flip()

Bar Chart

How to add or modify Main Title and Axis Labels

The following functions can be used to add or alter main title and axis labels.

ggtitle("Main title"): Adds a main title above the plot
xlab("X axis label"): Changes the X axis label
ylab("Y axis label"): Changes the Y axis label
labs(title = "Main title", x = "X axis label", y = "Y axis label"): Changes main title and axis labels

p = ggplot(mpg, aes(x= class)) + geom_bar()
p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")

Title and Axis Labels

How to add data labels

p = ggplot(mpg, aes(x= class)) + geom_bar()
p = p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
p + geom_text(stat='count', aes(label=..count..), vjust=-0.25)

geom_text() is used to add text directly to the plot. vjust is to adjust the position of data labels in bar.

Add Data Labels in Bar

How to reorder Bars

Using stat="identity"we can use our derived values instead of count.

library(plyr)
library(dplyr)
count(mpg,class) %>% arrange(-n) %>%
mutate(class = factor(class,levels= class)) %>%
ggplot(aes(x=class, y=n)) + geom_bar(stat="identity")

The above command will firstly create a frequency distribution for the type of car and then arrange it in descending order using arrange(-n). Then using mutate( ) we modify the 'class' column to a factor with levels 'class' and hence plot the bar plot using geom_bar( ).

Change order of bars

Here, bar of SUV appears first as it has maximum number of cars. Now bars are ordered based on frequency count.

Showing Mean of Continuous Variable by Categorical Variable

df = mpg %>% group_by(class) %>% summarise(mean = mean(displ)) %>%
arrange(-mean) %>% mutate(class = factor(class,levels= class))

p = ggplot(df, aes(x=class, y=mean)) + geom_bar(stat="identity")
p + geom_text(aes(label = sprintf("%0.2f", round(mean, digits = 2))),
vjust=1.6, color="white", fontface = "bold", size=4)

Now using dplyr library we create a new dataframe 'df' and try to plot it.
Using group_by we group the data according to various types of cars and summarise enables us to find the statistics (here mean for 'displ' variable) for each group. To add data labels (with 2 decimal places) we use geom_text( )

Customized BarPlot

Creating Stacked Bar Chart

p <- ggplot(data=mpg, aes(x=class, y=displ, fill=drv))
p + geom_bar(stat = "identity")

Stacked BarPlot

p + geom_bar(stat="identity", position=position_dodge())

Stacked - Position_dodge

Creating BoxPlot

Using geom_boxplot( ) one can create a boxplot.

To create different boxplots for 'disp' for different levels of x we can define aes(x = cyl, y = disp)

mtcars$cyl = factor(mtcars$cyl)
ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()

We can see one outlier for 6 cylinders.

To create a notched boxplot we write notch = TRUE

ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot(notch = TRUE)

Notched Boxplot

Scatter Plot

A scatterplot is used to graphically represent the relationship between two continuous variables.

# Creating a scatter plot denoting various species.
ggplot(data = iris, aes( x = Sepal.Length, y = Sepal.Width,shape = Species, color = Species)) + geom_point()

We plot the points using geom_point( ). In the aesthetics we define that x axis denotes sepal length, y axis denotes sepal width; shape = Species and color = Species denotes that different shapes and different sizes should be used for each particular specie of flower.

Scatter Plot

Scatter plots are constructed using geom_point( )

# Creating scatter plot for automatic cars denoting different cylinders.
ggplot(data = subset(mtcars,am == 0),aes(x = mpg,y = disp,colour = factor(cyl))) + geom_point()

Scatter plot denotingvarious levels of cyl

We use subset( ) function to select only those cars which have am = 0; paraphrasing it; we are considering only those cars which are automatic. We plot the displacement corresponding to mileage and for different cylinders we are using various colors. Also factor(cyl) transforms our continuous variable cylinder to a factor.

# Seeing the patterns with the help of geom_smooth.
ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp)) + geom_point() + geom_smooth()

In the above command we try to plot mileage (mpg) and displacement (disp) and variation in colors denote the varying horsepower(hp) . geom_smooth( ) is used to determine what kind of pattern is exhibited by the points.
In a similar way we can use geom_line( ) to plot another line on the graph:

# Plotting the horsepower using geom_line
ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp)) + geom_point(size = 2.5) + geom_line(aes(y = hp))

Here in geom_point we have added an optional argument size = 2.5 denoting the size of the points. geom_line( ) creates a line. Note that we have not provided any aesthetics for x axis in geom_line, it means that it will plot the horsepower(hp) corresponding to mileage(mpg) only.

Modifying the axis labels and appending the title and subtitle.

#Adding title or changing the labels
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + labs(title = "Scatter plot")

#Alternatively
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot")
ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot",
subtitle = "mtcars data in R")

Adding title and subtitle to plots

Here using labs( ) we can change the title of our legend or ggtitle we can assign our graph some title. If we want to add some title or sub-title to our graph thus we can use ggtitle( ) where the first argument is our 'main title' and second argument is our subtitle.

a <- ggplot(mtcars,aes(x = mpg, y = disp, color = factor(cyl))) + geom_point()
a
#Changing the axis labels.
a + labs(color = "Cylinders")
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement")

We firstly save our plot to 'a' and thus we make the alterations.
Note that in the labs command we are using color = "Cylinders" which changes the title of our legend.
Using the xlab and ylab commands we can change the x and y axis labels respectively. Here our x axis label is 'mileage' and y axis label is 'displacement'

#Combining it all
a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement") + ggtitle(label = "Scatter plot", subtitle = "mtcars data in R")

In the above plot we can see that the labels on x axis,y axis and legend have changed; the title and subtitle have been added and the points are colored, distinguishing the number of cylinders.

Playing with themes

Themes can be used in ggplot2 to change the backgrounds,text colors, legend colors and axis texts.

Firstly we save our plot to 'b' and hence create the visualizations by manipulating 'b'. Note that in aesthetics we have written mpg, disp which automatically plots mpg on x axis and disp on y axis.

#Changing the themes.
b <- ggplot(mtcars,aes(mpg,disp)) + geom_point() + labs(title = "Scatter Plot")

#Changing the size and color of the Title and the background color.
b + theme(plot.title = element_text(color = "blue",size = 17),plot.background = element_rect("orange"))

Plot background color changed.

We use theme( ) to modify the the plot title and background. plot.title is an element_text( ) object in which we have specified the color and size of our title. Utilizing plot.background which is an element_rect( ) object we can specify the color of our background.

ggplot2( ) offers by default themes with background panel design colors being changed automatically. Some of them are theme_gray, theme_minimal, theme_dark etc.

b + theme_minimal( )

We can observe horizontal and vertical lines behind the points. What if we don't need them? This can be achieved via:

#Removing the lines from the background.
b + theme(panel.background = element_blank())

Setting panel.background = element_blank( ) with no other parameter can remove those lines and color from the panel.

#Removing the text from x and y axis.
b + theme(axis.text = element_blank())
b + theme(axis.text.x = element_blank())
b + theme(axis.text.y = element_blank())

To remove the text from both the axis we can use axis.text = element_blank( ). If we want to remove the text only from particular axis then we need to specify it.

Now we save our plot to c and then make the changes.

#Changing the legend position
c <- ggplot(mtcars,aes(x = mpg, y = disp, color = hp)) +labs(title = "Scatter Plot") + geom_point()
c + theme(legend.position = "top")

If we want to move the legend then we can specify legend.position as "top" or "bottom" or "left" or "right".

Finally combining all what we have learnt in themes we create the above plot where the legend is placed at bottom, plot title is in forest green color, the background is in yellow and no text is displayed on both the axis.

#Combining everything.
c + theme(legend.position = "bottom", axis.text = element_blank()) +
theme(plot.title = element_text(color = "Forest Green",size = 17),plot.background = element_rect("Yellow"))

Changing the color scales in the legend.

In ggplot2, by default the color scale is from dark blue to light blue. It might happen that we wish to innovate the scales by changing the colors or adding new colors. This can be done successfuly via scale_color_gradient function.

c + scale_color_gradient(low = "yellow",high = "red")

Suppose we want the colors to vary from yellow to red; yellow denoting the least value and red denoting the highest value; we set low = "yellow" and high = "red". Note that in the legend it takes the scale to be started from 0 and not the minimum value of the series.
What if we want 3 colors?

c + scale_color_gradient2(low = "red",mid = "green",high = "blue")

To serve the purpose of having 3 colors in the legend we use scale_color_gradient2 with low = "red",mid = "green" and high = "blue" means it divides the entire range(Starting from 0) to the maximum observation in 3 equal parts, with first part being shaded as red, central part as green and highest part as blue.

c + theme(legend.position = "bottom") + scale_color_gradientn(colours = c("red","forest green","white","blue"))

If we want more than 3 colors to be represented by our legend we can utilize scale_color_gradientn( ) function and the argument colors will be a vector starting where 1st element denotes the color of the 1st part, 2nd color denotes the color of 2nd part etc.

Changing the breaks in the legend.

It can be seen that the legend for continuous variable starts from 0.
Suppose we want the breaks to be: 50,125,200,275 and 350, we use seq(50,350,75) where 50 denotes the least number, 350 is the maximum number in the sequence and 75 is the difference between 2 consecutive numbers.

#Changing the breaks in the legend
c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), labels = paste(seq(50,350,75),"hp"))

In scale_color_continuous we set the breaks as our desired sequence, and can change the labels if we want. Using paste function our sequence is followed by the word "hp" and name = "horsepower" changes the name of our legend.

Changing the break points and color scale of the legend together.

Let us try changing the break points and the colors in the legend together by trial and error.

#Trial 1 : This one is wrong
c + scale_color_continuous( breaks = seq(50,350,75)) +
scale_color_gradient(low = "blue",high = "red")

We can refer to trial1 image for the above code which can be found below. Notice that the color scale is blue to red as desired but the breaks have not changed.

#Trial 2: Next one is wrong.
c + scale_color_gradient(low = "blue",high = "red") +
scale_color_continuous( breaks = seq(50,350,75))

trial2 image is the output for the above code. Here the color scale has not changed but the breaks have been created.

trial1

trial2

What is happening? The reason for this is that we cannot have 2 scale_color functions for a single graph. If there are multiple scale_color_ functions then R overwrites the other scale_color_ functions by the last scale_color_ command it has received.
In trial 1, scale_color_gradient overwrites the previous scale_color_continuous command. Similarly in trial 2, scale_color_continuous overwrites the previous scale_color_gradient command.

The correct way to do is to define the arguments in one function only.

c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), low = "red", high = "black") + theme(panel.background = element_rect("green"),
plot.background = element_rect("orange"))

Here low = "red" and high = "black" are defined in scale_color_continuous function along with the breaks.

Changing the axis cut points.

We save our initial plot to 'd'.

d <- ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point(aes(color = factor(am))) +
xlab("Mileage") + ylab("Displacement") +
theme(panel.background = element_rect("black") , plot.background = element_rect("pink"))

To change the axis cut points we use scale_(axisname)_continuous.

d + scale_x_continuous(limits = c(2,4)) + scale_y_continuous(limits = c(15,30))

To change the x axis limits to 2 to 4, we use scale_x_continuous and my 'limits' is a vector defining the upper and lower limits of the axis. Likewise, scale_y_continuous set the least cut off point to 15 and highest cut off point of y axis to 30.

d + scale_x_continuous(limits = c(2,4),breaks = seq(2,4,0.25)) +
scale_y_continuous(limits = c(15,30),breaks = seq(15,30,3))

We can also add another parameter 'breaks' which will need a vector to specify all the cut of points of the axis. Here we create a sequence of 2,2.5,3,3.5,4 for x axis and for y axis the sequence is 15,18,21,...,30.

Faceting.

Faceting is a technique which is used to plot the graphs for the data corresponding to various categories of a particular variable. Let us try to understand it via an illustration:

facet_wrap function is used for faceting where the after the tilde(~) sign we define the variables on which we want the classification.

Faceting for carb

We see that there are 6 categories of "carb". Faceting creates 6 plots between mpg and disp; where the points correspond to the categories.

We can mention the number of rows we need for faceting.
# Control the number of rows and columns with nrow and ncol
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb,nrow = 3)

Here an additional parameter nrow = 3 depicts that in total all the graphs should be adjusted in 3 rows.

Faceting using multiple variables.

Faceting can be done for various combinations of carb and am.

# You can facet by multiple variables
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am)
#Alternatively
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(c("carb","am"))

There are 6 unique 'carb' values and 2 unique 'am' values thus there could be 12 possible combinations but we can get only 9 graphs, this is because for remaining 3 combinations there is no observation.

It might be puzzling to grasp which the level of am and carb specially when the labels ain't provided. Accordingly we can label the variables.

# Use the `labeller` option to control how labels are printed:
ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am, labeller = "label_both")

facet_wrap in multiple variables.

R provides facet_grid( ) function which can be used to faced in two dimensions.

z <- ggplot(mtcars, aes(mpg, disp)) + geom_point()

We store our basic plot in 'z' and thus we can make the additions:

z + facet_grid(. ~ cyl) #col
z + facet_grid(cyl ~ .) #row
z + facet_grid(gear ~ cyl,labeller = "label_both") #row and col

using facet_grid( )

In facet_grid(.~cyl), it facets the data by 'cyl' and the cylinders are represented in columns. If we want to represent 'cyl' in rows, we write facet_grid(cyl~.). If we want to facet according to 2 variables we write facet_grid(gear~cyl) where gears are represented in rows and 'cyl' are illustrated in columns.

Adding text to the points.

Using ggplot2 we can define what are the different values / labels for all the points. This can be accomplished by using geom_text( )

#Adding texts to the points
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
geom_text(aes(label = am))

In geom_text we provide aes(label = am) which depicts that for all the points the corresponding levels of "am" should be shown.

In the graph it can be perceived that the labels of 'am' are overlapping with the points. In some situations it may become difficult to read the labels when there are many points. In order to avoid this we use geom_text_repel function in 'ggrepel' library.

require(ggrepel)
ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
geom_text_repel(aes(label = am))

We load the library ggrepel using require( ) function. If we don't want the text to overlap we use geom_text_repel( ) instead of geom_text( ) of ggplot2 , keeping the argument aes(label = am).

geom_text_repel

↧

Linear Regression in Python

January 10, 2018, 2:08 am

≫ Next: SAS Date Formats and Informats

≪ Previous: Data Visualization in R using ggplot2

Linear Regression is a supervised statistical technique where we try to estimate the dependent variable with a given set of independent variables. We assume the relationship to be linear and our dependent variable must be continuous in nature.

Python : Linear Regression

In the following diagram we can see that as horsepower increases mileage decreases thus we can think to fit linear regression. The red line is the fitted line of regression and the points denote the actual observations.

The vertical distance between the points and the fitted line (line of best fit) are called errors. The main idea is to fit this line of regression by minimizing the sum of squares of these errors. This is also known as principle of least squares.

Examples:

Estimating the price (Y) of a house on the basis of its Area (X1), Number of bedrooms (X2), proximity to market (X3) etc.
Estimating the mileage of a car (Y) on the basis of its displacement (X1), horsepower(X2), number of cylinders(X3), whether it is automatic or manual (X4) etc.
To find the treatment cost or to predict the treatment cost on the basis of factors like age, weight, past medical history, or even if there are blood reports, we can use the information from the blood report.

Simple Linear Regression Model: In this we try to predict the value of dependent variable (Y) with only one regressor or independent variable(X).

Multiple Linear Regression Model: Here we try to predict the value of dependent variable (Y) with more than one regressor or independent variables.

The linear regression model:

Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term.

Multiple Regression Equation

Assumptions of linear regression:

There must be a linear relationship between the dependent and independent variables.
Sample observations are independent.
Error terms are normally distributed with mean 0.
No multicollinearity - When the independent variables in my model are highly linearly related then such a situation is called multicollinearity.
Error terms are identically and independently distributed. (Independence means absence of autocorrelation).
Error terms have constant variance i.e. there is no heteroscedasticity.
No outliers are present in the data.

Important Model Performance Metrics

Coefficient of determination (R square)

It suggests the proportion of variation in Y which can be explained with the independent variables. Mathematically, it is the ratio of predicted values and observed values, i.e.

RSquare

If our fit is perfect then

If then R² = 0 indicates a poor fit. Thus it lies between 0 and 1.

If the value of R2 is 0.912 then this suggests that 91.2% of the variation in Y can be explained with the help of given explanatory variables in that model. In other words, it explains the proportion of variation in the dependent variable that is explained by the independent variables.

R square solely not such a good measure:

On addition of a new variable the error is sure to decrease, thus R square always increases whenever a new variable is added to our model. This may not describe the importance of a variable.

For eg. In a model determining the price of the house, suppose we had the variables GDP, Inflation rate, Area. If we add a new variable: no. of plane crashes (which is irrelevant) then still R square will increase.

Adjusted R square:

Adjusted R square is given by:

Adjusted R-Square

where k is the no. of regressors or predictors.

Hence adjusted R square will always be less than or equal to R square.

On addition of a variable then R square in numerator and 'k' in the denominator will increase.

If the variable is actually useful then R square will increase by a large amount and 'k' in the denominator will be increased by 1. Thus the magnitude of increase in R square will compensate for increase in 'k'. On the other hand, if a variable is irrelevant then on its addition R square will not increase much and hence eventually adjusted R square will increase.

Thus as a general thumb rule if adjusted R square increases when a new variable is added to the model, the variable should remain in the model. If the adjusted R square decreases when the new variable is added then the variable should not remain in the model.

Why error terms should be normally distributed?

For parameter estimate (i.e. estimating the β_i’s) we don't need that assumption. But, if it is not a normal distribution, some of those hypotheses tests which we will be doing as part of diagnostics may not be valid.

For example: To check whether the Beta (the regression coefficient) is significant or not, I'll do a T-test. So, if my error is not a normal distribution, then the statistic I derive may not be a T-distribution. So, my diagnostic test or hypotheses test is not valid. Similarly, F-test for linear regression which checks whether any of the independent variables in a multiple linear regression model are significant will be not be viable.

Why is expectation of error always zero?

The error term is the deviation between observed points and the fitted line. The observed points will be above and below the fitted line, so if I took the average of all the deviations, it should be 0 or near 0. Zero conditional mean is there which says that there are both negative and positive errors which cancel out on an average. This helps us to estimate dependent variable precisely.

Why multicollinearity is a problem?

If my Xi’s are highly correlated then |X’X| will be close to 0 and hence inverse of (X’X) will not exist or will be indefinitely large. Mathematically, which will be indefinitely large in presence of multicollinearity. Long story in short, multicollinearity increases the estimate of standard error of regression coefficients which makes some variables statistically insignificant when they should be significant.

How can you detect multicollinearity?

1. Bunch Map Analysis: By plotting scatter plots between various Xi’ s we can have a visual description of how the variables are related.

2. Correlation Method: By calculating the correlation coefficients between the variables we can get to know about the extent of multicollinearity in the data.

3. VIF (Variance Inflation Factor) Method: Firstly we fit a model with all the variables and then calculate the variance inflation factor (VIF) for each variable. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables.

So what is Variance Inflation Factor?

Variance inflation factor (VIF) for an explanatory variable is given 1/(1-R^2 ) . Here, we take that particular X as response variable and all other explanatory variables as independent variables. So, we run a regression between one of those explanatory variables with remaining explanatory variables.

Detecting heteroscedasticity!

Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.
Goldfeld Quandt (GQ)Test: It assumes that heteroscedastic variance σi2 is positively related to one of the explanatory variables And errors are assumed to be normal. Thus if heteroscedasticity is present then the variance would be high for large values of X.

Steps for GQ test:
Order/ rank (ascending) the observations according to the value of Xi beginning with the lowest X value.
Omit ‘c’ central observations and divide the remaining (n-c) observations into 2 groups of (n-c)/2 observations each.
Fit separate OLS regression to both the groups and obtain residual sum of squares (RSS1 and RSS2) for both the groups.
Obtain F = RSS2/ RSS1

It follows F with ((n-c)/2-k) d.f. both both numerator and denominator.
Where k is the no. of parameters to be estimated including the intercept.
If errors are homoscedastic then the two variances RSS2 and RSS1 turn out to be equal i. e. F will tend to 1.

Dataset used:

We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

Cement - kg in a m3 mixture
Blast Furnace Slag - kg in a m3 mixture
Fly Ash - kg in a m3 mixture
Water - kg in a m3 mixture
Superplasticizer - kg in a m3 mixture
Coarse Aggregate - kg in a m3 mixture
Fine Aggregate - kg in a m3 mixture
Age - Day (1-365)

Importing the libraries:

Numpy, pandas and matplotlib.pyplot are imported with aliases np, pd and plt respectively.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the data

We load our data using pd.read_csv( )

data = pd.read_csv("Concrete_Data.csv")

Now the data is divided into independent (x) and dependent variables (y)

x = data.iloc[:,0:8]
y = data.iloc[:,8:]

Splitting the data into training and test sets.

Using sklearn we split 80% of our data into training set and rest in test set. Setting random_state will give the same training and test set everytime on running the code.

from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 100)

Running linear regression using sklearn:

Using sklearn linear regression can be carried out using LinearRegression( ) class. sklearn automatically adds an intercept term to our model.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm = lm.fit(x_train,y_train) #lm.fit(input,output)

The coefficients are given by:

lm.coef_

array([[ 0.12415357,  0.10366839,  0.093371  , -0.13429401,  0.28804259,
         0.02065756,  0.02563037,  0.11461733]])

To store coefficients in a data frame along with their respective independent variables -

coefficients = pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)

0            Cement  0.124154
1             Blast  0.103668
2           Fly Ash  0.093371
3             Water -0.134294
4  Superplasticizer  0.288043
5                CA  0.020658
6                FA  0.025630
7               Age  0.114617

The intercept is:

lm.intercept_

array([-34.273527])

To predict the values of y on the test set we use lm.predict( )

y_pred = lm.predict(x_test)

Errors are the difference between observed and predicted values.

y_error = y_test - y_pred

R square can be obbtained using sklearn.metrics ( ):

from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.62252008774048395

Running linear regression using statsmodels:

It is to be noted that statsmodels does not add intercept term automatically thus we need to create an intercept to our model.

import statsmodels.api as sma
X_train = sma.add_constant(x_train) ## let's add an intercept (beta_0) to our model
X_test = sma.add_constant(x_test)

Linear regression can be run by using sm.OLS:

import statsmodels.formula.api as sm
lm2 = sm.OLS(y_train,X_train).fit()

The summary of our model can be obtained via:

lm2.summary()

"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMS   R-squared:                       0.613
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     161.0
Date:                Wed, 03 Jan 2018   Prob (F-statistic):          4.37e-162
Time:                        21:29:10   Log-Likelihood:                -3090.4
No. Observations:                 824   AIC:                             6199.
Df Residuals:                     815   BIC:                             6241.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const              -34.2735     29.931     -1.145      0.253     -93.025      24.478
Cement               0.1242      0.010     13.054      0.000       0.105       0.143
Blast                0.1037      0.011      9.229      0.000       0.082       0.126
Fly Ash              0.0934      0.014      6.687      0.000       0.066       0.121
Water               -0.1343      0.046     -2.947      0.003      -0.224      -0.045
Superplasticizer     0.2880      0.102      2.810      0.005       0.087       0.489
CA                   0.0207      0.011      1.966      0.050    2.79e-05       0.041
FA                   0.0256      0.012      2.131      0.033       0.002       0.049
Age                  0.1146      0.006     19.064      0.000       0.103       0.126
==============================================================================
Omnibus:                        3.757   Durbin-Watson:                   2.033
Prob(Omnibus):                  0.153   Jarque-Bera (JB):                3.762
Skew:                          -0.165   Prob(JB):                        0.152
Kurtosis:                       2.974   Cond. No.                     1.07e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

The predicted values for test set are given by:

y_pred2 = lm2.predict(X_test)

Note that both y_pred and y_pred2 are same. It's just these are calculated via different packages.

Calculate R-Squared and Adjusted R-Squared Manually on Test data

We can also calculate r-squared and adjusted r-squared via formula without using any package.

import numpy as np
y_test = pd.to_numeric(y_test.CMS, errors='coerce')
RSS = np.sum((y_pred2 - y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test - y_mean)**2)
R2 = 1 - RSS/TSS
R2

n=X_test.shape[0]
p=X_test.shape[1] - 1

adj_rsquared = 1 - (1 - R2) * ((n - 1)/(n-p-1))
adj_rsquared

R-Squared : 0.6225
Adjusted RSquared : 0.60719

Detecting Outliers:

Firstly we try to get the studentized residuals using get_influence( ). The studentized residuals are saved in resid_student.

influence = lm2.get_influence()
resid_student = influence.resid_studentized_external

Combining the training set and the residuals we have:

   Cement  Blast  Fly Ash  Water  Superplasticizer      CA     FA    Age  \
0   540.0    0.0      0.0  162.0               2.5  1040.0  676.0   28.0   
1   540.0    0.0      0.0  162.0               2.5  1055.0  676.0   28.0   
2   332.5  142.5      0.0  228.0               0.0   932.0  594.0  270.0   
3   332.5  142.5      0.0  228.0               0.0   932.0  594.0  365.0   
4   198.6  132.4      0.0  192.0               0.0   978.4  825.5  360.0   

   Studentized Residuals  
0               1.559672  
1              -0.917354  
2               1.057443  
3               0.637504  
4              -1.170290

resid = pd.concat([x_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
resid.head()

If the absolute value of studentized residuals is more than 3 then that observation is considered as an outlier and hence should be removed. We try to create a logical vector for the absolute studentized residuals more than 3

     Cement  Blast  Fly Ash  Water  Superplasticizer     CA     FA  Age  \
649   166.8  250.2      0.0  203.5               0.0  975.6  692.6  3.0   

     Studentized Residuals  
649               3.161183

resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]

The index of the outliers are given by ind:

ind = resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:].index
ind

 Int64Index([649], dtype='int64')

Dropping Outlier

Using the drop( ) function we remove the outlier from our training sets!

y_train.drop(ind,axis = 0,inplace = True)
x_train.drop(ind,axis = 0,inplace = True) #Interept column is not there
X_train.drop(ind,axis = 0,inplace = True) #Intercept column is there

Detecting and Removing Multicollinearity

We use the statsmodels library to calculate VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
[variance_inflation_factor(x_train.values, j) for j in range(x_train.shape[1])]

[15.477582601956859,
 3.2696650121931814,
 4.1293255012993439,
 82.210084751631086,
 5.21853674386234,
 85.866945489015535,
 71.816336942930675,
 1.6861600968467656]

We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.

def calculate_vif(x):
thresh = 5.0
output = pd.DataFrame()
k = x.shape[1]
vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
for i in range(1,k):
print("Iteration no.")
print(i)
print(vif)
a = np.argmax(vif)
print("Max VIF is for variable no.:")
print(a)
if vif[a] <= thresh :
break
if i == 1 :
output = x.drop(x.columns[a], axis = 1)
vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
elif i > 1 :
output = output.drop(output.columns[a],axis = 1)
vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
return(output)
train_out = calculate_vif(x_train)

Now we view the training set

train_out.head()

Cement Blast Fly Ash Superplasticizer Age

337   275.1    0.0    121.4               9.9   56
384   516.0    0.0      0.0               8.2   28
805   393.0    0.0      0.0               0.0   90
682   183.9  122.6      0.0               0.0   28
329   246.8    0.0    125.1              12.0    3

Removing the variables from the test set.

x_test.head()
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
x_test.head()

     Cement  Blast  Fly Ash  Superplasticizer  Age
173   318.8  212.5      0.0              14.3   91
134   362.6  189.0      0.0              11.6   28
822   322.0    0.0      0.0               0.0   28
264   212.0    0.0    124.8               7.8    3
479   446.0   24.0     79.0              11.6    7

Running linear regression again on our new training set (without multicollinearity)

import statsmodels.api as sma
import statsmodels.formula.api as sm
train_out = sma.add_constant(train_out) ## let's add an intercept (beta_0) to our model
x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
X_test = sma.add_constant(x_test)
lm2 = sm.OLS(y_train,train_out).fit()
lm2.summary()

"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMS   R-squared:                       0.570
Model:                            OLS   Adj. R-squared:                  0.567
Method:                 Least Squares   F-statistic:                     216.3
Date:                Wed, 10 Jan 2018   Prob (F-statistic):          6.88e-147
Time:                        15:14:59   Log-Likelihood:                -3128.8
No. Observations:                 823   AIC:                             6270.
Df Residuals:                     817   BIC:                             6298.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const              -11.1119      1.915     -5.803      0.000     -14.871      -7.353
Cement               0.1031      0.005     20.941      0.000       0.093       0.113
Blast                0.0721      0.006     12.622      0.000       0.061       0.083
Fly Ash              0.0614      0.009      6.749      0.000       0.044       0.079
Superplasticizer     0.7519      0.077      9.739      0.000       0.600       0.903
Age                  0.1021      0.006     16.582      0.000       0.090       0.114
==============================================================================
Omnibus:                        0.870   Durbin-Watson:                   2.090
Prob(Omnibus):                  0.647   Jarque-Bera (JB):                0.945
Skew:                           0.039   Prob(JB):                        0.623
Kurtosis:                       2.853   Cond. No.                     1.59e+03
==============================================================================

Checking normality of residuals

We use Shapiro Wilk test from scipy library to check the normality of residuals.

Null Hypothesis: The residuals are normally distributed.
Alternative Hypothesis: The residuals are not normally distributed.

from scipy import stats
stats.shapiro(lm2.resid)

(0.9983407258987427, 0.6269884705543518)

Since the pvalue is 0.6269 thus at 5% level of significance we can say that the residuals are normally distributed.

Checking for autocorrelation

To ensure the absence of autocorrelation we use Ljungbox test.

Null Hypothesis: Autocorrelation is absent.
Alternative Hypothesis: Autocorrelation is present.

from statsmodels.stats import diagnostic as diag
diag.acorr_ljungbox(lm2.resid , lags = 1)

(array([ 1.97177212]), array([ 0.16025989]))

Since p-value is 0.1602 thus we can accept the null hypothesis and can say that autocorrelation is absent.

Checking heteroscedasticity

Using Goldfeld Quandt we test for heteroscedasticity.

Null Hypothesis: Error terms are homoscedastic
Alternative Hypothesis: Error terms are heteroscedastic.

import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(lm2.resid, lm2.model.exog)
lzip(name, test)

[('F statistic', 0.9903), ('p-value', 0.539)]

The p-value is 0.539 hence we can say that the residuals have constant variance. Hence we can say that all the assumptions of our linear regression model are satisfied.

↧

SAS Date Formats and Informats

December 30, 2016, 7:17 am

≫ Next: Tutorial : Build Webapp in R using Shiny

≪ Previous: Linear Regression in Python

This tutorial describes the usage of SAS Date formats and informats. It includes practical real-world data problems related to SAS formats.

What are Formats and Informats?

Informats is used to tell SAS how to read a variable whereas Formats is used to tell SAS how to display or write values of a variable.

Informats is basically used when you read or import data from either an external file (Text/Excel/CSV) or read in sample data which was created using CARDS/DATALINES statement. It is also used when you create a new variable in a dataset.

Formats can be used in both Data Steps and PROC Steps whereas Informat can be used only in Data Steps. Let's understand by examples -

Example 1 - Read Dates in SAS

In the program below, we have used INFORMATS ddmmyy8. and ddymmyy10. to read dates in SAS. It creates a dataset called sampledata which is stored in WORK library.

DATA sampledata;
INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
CARDS;
30-12-16 30-12-2016
;
RUN;

The INFORMATS ddmmyy8. is used to read 30-12-16 date and ddmmyy10. to read 30-12-2016 date. In this case, 8 and 10 refers to width of the date.

The created dataset looks like below -

Read SAS Date

It returns 20818 as it is in SAS date value form. It is not meaningful if you look at the value. You cannot tell which date it is. To display in real date form, use FORMAT statement.

DATA sampledata;
INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
FORMAT date1 ddmmyy8. date2 ddmmyy10.;
cards;
30-12-16 30-12-2016
;
RUN;

Output : SAS Format

How to read DD-MMM-YY format

You can use date11. format for both DD-MMM-YY and DD-MMM-YYYY format.

DATA temp;
INPUT @6 dt date11.;
FORMAT dt date11.;
CARDS;
10-oct-14
;
PROC PRINT NOOBS;
RUN;

Result : 10-OCT-2014

Example 2 - Display Today's Date

The today() function can be used to generate current date.

data _null_;
dt=today();
format dt yymmdd10.;
put dt ;
run;

Result : It returns 2016-12-30 as 30DEC2016 is the today's date. It's in YYYY-MM-DD format because we've used yymmdd10. format. The 10 refers to the width of the date as 2016-12-30 contains 10 elements. The PUT statement is used to show value in log window.

To display date in WORD format

1. Short Word Date Format

The format date9. returns 30DEC2016.

format dt date9.;

2. Complete Word Date Format

The format WORDDATE. returns DECEMBER 30, 2016. No need to specify width in this format. It automatically adjusts the width depending on the month.

format dt WORDDATE.;

3. Including WEEK

The format WEEKDATE. gives Friday, December 30, 2016

format dt WEEKDATE.;

Display DAY / MONTH / YEAR

In this section, we will see how we can write only day, month, year and weekday.

data _null_;
dt=today();
put "Day :" dt DAY.;
put "Month :" dt MONTH.;
put "YEAR:" dt YEAR.;
put "WEEKDAY:" dt DOWNAME.;
run;

SAS : Date Formats

We can also use FORMAT in the PUT statement without specifying FORMAT statement explicitly. The DAY. format returned 30, MONTH. format returned 12 and YEAR. format returned 2016. In addition, we have used DOWNAME. format to extract weekday (Friday).

Other Popular Formats

Some of the commonly used date formats are listed below -

Formats	Result
DDMMYYP10.	30.12.2016
DDMMYYS10.	30/12/2016
MMDDYYP10.	12.30.2016
MMDDYYS10.	12/30/2016
WORDDATX19.	30 DECEMBER 2016

Endnotes

Hope you have a better understanding of the difference between SAS Date Formats and Informats and how they are used after completing this tutorial.

↧

Tutorial : Build Webapp in R using Shiny

February 14, 2018, 6:38 am

≫ Next: Web Scraping Matrimonial Website with R

≪ Previous: SAS Date Formats and Informats

In this tutorial, we will cover how to build shiny app from scratch in R. It includes various examples which would help you to get familiar with shiny package.

Shiny is a R package developed by RStudio that can be used to create interactive web pages with R. In simple words, you can build web page (online reporting tool) without knowing any web programming languages such as Javascript / PHP / CSS.

The best part about shiny package is that you can easily integrate R with webpage. Suppose you want your web page run machine learning algorithms like random forest, SVM etc and display summary of the model with the flexibility of selecting inputs from user. Shiny can do it very easily.

R : Shiny Tutorial

Shiny's prominent features

Customizable widgets like sliders, drop down lists, numeric inputs and many more.
Downloading datasets, graphs and tables in various formats.
Uploading files.
Provides utility to create brilliant plots.
In-built functions for viewing data or printing the text or summaries.
Reactive programs which makes data handling easier.
Conditional Panels for only when a particular condition is present.
Works in any R environment (Console R, RGUI for Windows or Mac, RStudio, etc)
No need to learn another software for online dashboarding
Can style your app with CSS / HTML (Optional)

Must things in shiny app code

User Interface (UI) : It controls the layout and appearance of various widgets on the web page. Every shiny app requires a user interface which is controlled by ui script.
Server: It contains the instructions that your computer needs when the user interacts with the app.

Example - You must have seen or created interactive charts in Excel. To make it interactive, we use drop downs, list boxes or some user controls. When user changes the values from drop downs, you will notice that the chart gets updated.

The UI is responsible for creating these drop downs, list boxes and telling Shiny where to place these user controls and where to place the charts, while the server is responsible for creating the chart and the data in the table

Basic layout of UI

User Interface: A simple shiny UI consists of a fluidpage which contains various panels. We can divide the display in two parts named sidebdarPanel( ) and mainPanel( ). Both of the panels can be accessed using sidebarLayout( ).

In the following image you can get an idea what is a title panel, sidebar panel and main panel in a shiny app.

Title panel is a place where the title of the app is displayed.
Sidebar panel is where special instructions or widgets (drop down / slider/ checkbox) are displayed to the app user. The sidebar panel appears on the left side of your app by default. You can move it to the right side by changing the position argument in the sidebar layout.
Main panel is the area where all the outputs are generally placed.

Shiny Elements

Installing Shiny

First we need to install shiny package by using command install.packages( ).

install.packages("shiny")

Loading Shiny

Shiny package can be loaded by using library( ).

library(shiny)

The first simple shiny app with basic layout

ui = fluidPage(sidebarLayout(sidebarPanel("Welcome to Shiny App"),
mainPanel("This is main Panel")))
server = function(input, output) { }
shinyApp(ui, server)

Sidebar Panel and Main Panel

Guidelines for beginners to run a shiny app

Step 1 : shinyApp(ui,server) : It is an in-built function in shiny package to run the app with ui and server as the arguments. Select the code and run it. Once you do it successfully, you would find the text Listening on http://127.0.0.1:4692 on console.

Step 2 : To create your app you need to save the code as an app.R file and a RunApp icon will get displayed on your screen. Click on it and a new prompt window as your app will appear.

Shiny App

Some more panels...

There are some additional panels which can be added to sidebarPanel and mainPanel depending upon the layout and requirements of the app. Some of them which shall be explained later in this tutorial are:

Shiny : Panels

Adding a title to your App!

Using titlePanel( ) once can provide an appropriate title for the app. Note that after titlePanel a comma(,) is placed.

ui = fluidPage(titlePanel("My favorite novels:"),
sidebarLayout(sidebarPanel(),
mainPanel()))
server = function(input, output) {
}
shinyApp(ui, server)

Title : Shiny App

Using HTML tags in Shiny

Content can be added in various panels. To change the appearance of the text by bolds, italics, images, changing the fonts and colors, adding heading etc. we can use various HTML functions in shiny. Some of them being the same in both of them are:

HTML Tags

Creating a hyperlink

A hyperlink can be created using a( ) where the first argument is the text with which the link is attached. href contains the link for our website which we want to attach.

ui = fluidPage(sidebarLayout(sidebarPanel(
a("Click here!", href = "http://www.listendata.com/")),
mainPanel()))
server = function(input, output) {}
shinyApp(ui, server)

Modifying the text presentation using HTML tags.

We create an app containing the list of the favorite novels . You can refer to the above mentioned table of HTML and shiny functions.

ui = fluidPage(titlePanel("My favorite novels:"),
sidebarLayout(sidebarPanel(
("My favorite novels are:"),
br(),
h4(strong("The Kiterunner"), "a novel by", em("Khaled Hoseinni")),
h3(strong("Jane Eyre"), "a novel by", code("Charolette Bronte")),
strong(
"The diary of a young girl",
"by",
span("Anne Frank", style = "color:blue")
),
div(strong("A thousand splendid suns"), "by Khaled Hoseinni", style = "color: red")
),
mainPanel()))
server = function(input, output) { }
shinyApp(ui, server)

Note that "Charolette Bronte"in the app would be written in a coded style;

Difference between span( ) and div( ) span( ) wrote "Anne Frank" on the same line with "blue" color. div( ) is similar to span( ), it is just that it creates the text in a different line.

Shiny : HTML Formating

Introducing widgets

Various widgets are used in shiny to select various outputs. These widgets can be inserted in the ui function (anywhere in the main panel and sidebar panel).
The most commonly used widgets are:

Shiny Widgets

The following image tells how various widgets appear on running an app.

Shiny Widgets

'Buttons' can be created using an actionButton and submitButton widgets
Single check box, multiple check box and date inputs are created using checkboxInput, checkboxGroupInput and dateInput respectively.
Date range is created using dateRangeInput.

Most commonly used widgets

All the widgets demand an input ID which we will use to retrieve the values. This input ID is not accessible by the app user. labels is the heading for our widget which be visible on when the app is being run. In order to understand more we create an app to get the details of the user by the widgets provided by shiny.

HelpText and TextInput

ui = fluidPage(sidebarLayout(
sidebarPanel(helpText("This questionnaire is subject to privacy."),

textInput(inputId = "name", label = "Enter your name.")
),

mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)

helptext() and Text Input

helpText( ) creates a disclaimer which will be displayed on the sidebarPanel.

Adding SliderInput

ui = fluidPage(sidebarLayout(
sidebarPanel(
helpText("This questionnaire is subject to privacy."),

textInput(inputId = "name", label = "Enter your name."),
sliderInput(
inputId = "age",
label = "What is your age?",
min = 1,
max = 100,
value = 25
)
),

mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)

In sliderInput we use the ID as "age" and the label which will be displayed in our app is "What is your age?"min = 1 and max = 100 depict the minimum and maximum values for our sliders and value = 25 denotes the default selected value.

SliderInput

RadioButtons, NumericInput and CheckBoxInput

ui = fluidPage(sidebarLayout(
sidebarPanel(
radioButtons(
inputId = "month",
label = "In which month are you born?",
choices = list(
"Jan - March" = 1,
"April - June" = 2,
"July - September" = 3,
"October - November" = 4
),
selected = 1
),

numericInput(
inputId = "children_count",
label = "How many children do you have?",
value = 2,
min = 0,
max = 15
),

selectInput(
inputId = "smoker",
label = "Are you a smoker?",
choices = c("Yes", "No", "I smoke rarely"),
selected = "Yes"
),

checkboxInput(
inputId = "checkbox",
label = "Are you a human?",
value = FALSE
),

checkboxGroupInput(
inputId = "checkbox2",
label = "2 + 2 = ?",
choices = list(4, "Four", "IV", "None of the above")
)

),

mainPanel()

))
server = function(input, output) { }
shinyApp(ui, server)

Other common Widgets

In radioButtons or selectInput widgets we define the list of options in choices parameter. The parameter selected implies the default selected option.

Using fluidRow

The output of our above app is a bit weird. Right? Everything comes in the sidepanel and nothing on the mainPanel. We can make it a bit more sophisticated by removing the mainPanel and creating the widgets in a row.

We use fluidRow for such things. It is to be kept in mind that the width of the row is 12 thus if a row contains the widgets which require in more than 12 units of width then the last widget will be displayed on the other row.

Let us create the above same app using fluidRow.

Our app creates textInput, sliderInput and radioButtons in one row.

ui =  fluidPage(
  helpText(
"This questionnaire is subject to privacy. All the information obtained will be confidential."
  ),


  fluidRow(
    column(4,textInput(inputId = "name", label = "Enter your name.")),


    column(
      4, sliderInput(
        inputId = "age",
        label = "What is your age?",
        min = 1,
        max = 100,
        value = 25
      )
    ),

    column(
      4, radioButtons(
        inputId = "month",
        label = "In which month are you born?",
        choices = list(
"Jan - March" = 1,
"April - June" = 2,
"July - September" = 3,
"October - November" = 4
        ),

        selected = 1
      )
    )
  ),


  fluidRow(column(
    6, numericInput(
      inputId = "children_count",
      label = "How many children do you have?",
      value = 2,
      min = 0,
      max = 15
    )
  ) )
  )

server = function(input, output) { }

shinyApp(ui, server)

fluidrow

In column(6,...) 6 denotes the width required by one widget. To move to the next row another fluidRow command is used.

Time to get some output!

So far we have been providing the input to our server function but note that server function also has an output as an argument. Thus we can have various outputs like:

The above functions are defined in ui and are given a key and using that key we denote them in the server function.

In the server function we use render* to display various outputs. Some of the most common render* commands are:

Dealing with dates

Using dateInput( ) we can select the dates from our calendar.

The inputID is "bday", and the label which will be displayed in our app is "Enter your Date of Birth" and by default value is 1st January, 1985.

The verbatimTextOutput is used in the ui and it will be referred in the server as "text".

In the server function we use output$text to tell shiny that the following output will be displayed in verbatimTextOutput("text").

The renderPrint( ) denotes our output to be printed and we get the date of birth printed using input$bday (Recall bday was the inputID in our dateInput).

ui = fluidPage(dateInput(
"bday",
label = h3("Enter your Date of Birth"),
value = "1985-01-01"
),
verbatimTextOutput("text"))

server = function(input, output) {
output$text <- renderPrint({
paste(input$bday, "was a blessed day!")

})
}
shinyApp(ui, server)

Viewing Data

Here we are using the iris dataset and we want to display only the data for the particular specie selected by the user.

Using selectInput( ) we choose the specie with inpuID as "specie". In the main panel we want out output as a table thus we use tableOutput( ). In the server( ) output$data matches tableOutput("data") and renders a table using renderTable.

ui = fluidPage(titlePanel("Viewing data"),

sidebarLayout(sidebarPanel(
selectInput(
inputId = "specie",
label = "Select the flower specie",
choices = c("setosa", "versicolor", "virginica"),
selected = "setosa"
)
),

mainPanel(tableOutput("data"))))
server = function(input, output) {
output$data = renderTable({
iris[iris$Species == input$specie, ]
})
}
shinyApp(ui, server)

Reactivity in Shiny

Shiny apps use a functionality called reactivity that means that shiny app will be responsive to changes to inputs. It's similar to MS Excel where changing one cell have effect on the whole workbook.

It is quite useful to define reactive( ) function when there are multiple widgets.

Suppose we have two widgets with inputID 'a' and 'b'. We have two reactive functions say 'X' and 'Y' for one each. Thus is the value in 'a' changes then reactive function 'X' will be updated and for 'b' reactive function 'Y' will be updated.

If a change is made only in one of the input values say 'a' and 'b' is the same then reactive function 'X' will be updated but 'Y' will be skipped. Hence it reduces a lot of time and saves shiny from confusion.

Creating Plots

Here we want to display the histogram by selecting any one variable in the iris dataset available in R.

Using plotOutput in main panel we refer to the server function.

In the server function we are using reactive. It means that it will change the value only when the value input$characterstic is changed.

The output$myplot matches to plotOutput("myplot") and hence draws the histogram using renderPlot( )

ui = fluidPage(titlePanel("Creating the plots!"),
sidebarLayout(sidebarPanel(
selectInput(
inputId = "characterstic",
label = "Select the characterstic for which you want the histogram",
choices = c("Sepal Length", "Sepal Width" ,
"Petal Length", "Petal Width"),
selected = "Sepal Length"
)
),

mainPanel(plotOutput("myplot"))))
server = function(input, output) {
char = reactive({
switch(
input$characterstic,
"Sepal Length" = "Sepal.Length",
"Sepal Width" = "Sepal.Width",
"Petal Length" = "Petal.Length",
"Petal Width" = "Petal.Width"
)
})

output$myplot = renderPlot({
hist(
iris[, char()],
xlab = input$characterstic,
main = paste("Histogram of", input$characterstic)
)
})

}
shinyApp(ui, server)

Well Panel and Vertical Layout

Vertical Layout creates a layout in which each element passed in the UI will appear in its own line.
WellPanel creates a panel with a border and a grey background.

In the following example we are trying to create an app where we draw the QQ plot for random sample from normal distribution.

Using the sliders we define the size of the sample. By default it is 500.

ui = fluidPage(titlePanel("Vertical layout"),
               verticalLayout(wellPanel(
                 sliderInput("n", "QQ Plot of normal distribution", 100, 1000,
                             value = 500)
               ),
               plotOutput("plot1")))
server = function(input, output) {
  output$plot1 = renderPlot({
    qqnorm(rnorm(input$n))
  })
}
shinyApp(ui, server)

Creating tabs

We can create various tabs in shiny where some particular output is displayed in a particular tab. This can be done using tabsetPanel.

We are creating an app in which the user selects the columns for which he wants the summary and the boxplot.

In the main panel we are creating the tabs. each tab has a label and the output to be shown.
For instance the first tab label is 'Summary' and it will show the verbatimTextOutput and the other tab will have label displayed as 'Boxplot' with output being plotted using plotOutput.

ui =  fluidPage(titlePanel("Creating the tabs!"),
                sidebarLayout(sidebarPanel(
                  radioButtons(
                    inputId  = "characterstic",
                    label = "Select the characterstic for which you want the summary",
                    choices = c(
"Mileage" = "mpg",
"Displacement" = "disp",
"Horsepower" = "hp",
"Rear axle ratio" = "drat",
"Weight" = "wt"
                    ),
                    selected = "mpg"
                  )
                ),
                mainPanel(tabsetPanel(
                  tabPanel("Summary", verbatimTextOutput("mysummary")),
                  tabPanel("Boxplot", plotOutput("myplot"))
                ))))

server = function(input, output) {
  output$mysummary = renderPrint({
    summary(mtcars[, input$characterstic])
  })

  output$myplot  = renderPlot({
    boxplot(mtcars[, input$characterstic], main = "Boxplot")
  })
}
shinyApp(ui, server)

Creating tabs in Shiny

Some more plots...

In this example we are using VADeaths data. We firstly select the area (Rural or Urban) and gender( Male or Female) and hence plot the barplot denoting the death rate for different age groups.

ui = fluidPage(
  titlePanel("Death rates by Gender and area"),  

  sidebarLayout(
    sidebarPanel(
      selectInput("area", "Choose the area",
                  choices = c("Rural", "Urban")),
      br(),
      selectInput("gender", "Choose the gender", choices = c("Male", "Female"))
    ),

    mainPanel(plotOutput("deathrate"))

  )
)

server = function(input, output) {
  output$deathrate <- renderPlot({
    a = barplot(VADeaths[, paste(input$area, input$gender)],
                main = "Death Rates in Virginia",
                xlab = "Age Groups")
    text(a,
         y = VADeaths[, paste(input$area, input$gender)] - 2,
         labels = VADeaths[, paste(input$area, input$gender)],
         col = "black")
  })
}

shinyApp(ui, server)

Conditional Panels

Suppose you wish to create outputs only when a particular option is selected or if a particular condition is satisfied. For such a purpose we can use conditional panels where we define the condition in a JavaScript format and then define the output or the widget to appear if the condition is met. A simple example of a conditional panel is given as follows: Firstly we seek the number of hours one sleeps and then if someone sleeps for less than 7 hours then he needs more sleep and if someone sleeps more than or equal to 9 hours then he sleeps a lot.

ui = fluidPage(
  titlePanel("Conditional Panels"),
  sidebarPanel(
    numericInput("num","How many hours do you sleep?",min = 1,max = 24,value = 6)),
  mainPanel(
    conditionalPanel("input.num < 7","You need more sleep"),
    conditionalPanel("input.num >= 9","You sleep a lot")
  )
)
server = function(input,output){

}
shinyApp(ui,server)

Note: The first argument in conditional panel is a JavaScript expression thus we write input.num and not input$num to access the input value of sleeping hours.

Conditional Panel : Example 2

In the following example we are using the income.csv file. Firstly we ask for which variable the user wants to work with and save the data in 'a' using reactive( ) . Then we using uiOutput we insert a widget asking for whether the user wants the summary or to view the data or the histogram. Based on the option selected by the user we create conditional panels for summary, viewing the data and plotting the histogram.

income = read.csv("income.csv", stringsAsFactors = FALSE)

ui = fluidPage(titlePanel(em("Conditional panels")),
                sidebarLayout(
                  sidebarPanel(
                    selectInput(
"Choice1",
"Select the variable",
                      choices = colnames(income)[3:16],
                      selected = "Y2008"
                    ),
                    uiOutput("Out1")
                  ),
                  mainPanel(
                    conditionalPanel("input.Choice2 === 'Summary'", verbatimTextOutput("Out2")),
                    conditionalPanel("input.Choice2 === 'View data'", tableOutput("Out3")),
                    conditionalPanel("input.Choice2 === 'Histogram'", plotOutput("Out4"))
                  )
                ))

server = function(input, output) {
  a = reactive({
    income[, colnames(income) == input$Choice1]
  })
  output$Out1 = renderUI({
    radioButtons(
"Choice2",
"What do you want to do?",
      choices = c("Summary", "View data", "Histogram"),
      selected = "Summary"
    )
  })
  output$Out2 = renderPrint({
    summary(a())
  })
  output$Out3 = renderTable({
    return(a())
  })
  output$Out4 = renderPlot({
    return(hist(a(), main  = "Histogram", xlab = input$Choice1))
  })
}
shinyApp(ui = ui, server = server)

Downloading Data

shiny allows the users to download the datasets. This can be done by using downloadButton in UI and downloadHandler in server. Firstly we select the data using radioButtons and hence save the dataset using reactive( ) in server. Then in the UI we create a downloadButton where the first argument is the inputID and the other one is the label. downloadHandler needs two arguments: filename and content. In 'filename' we specify by which name the file should be saved and using content we write the dataset into a csv file.

ui =  fluidPage(titlePanel("Downloading the data"),
                sidebarLayout(sidebarPanel(
                  radioButtons(
"data",
"Choose a dataset to be downloaded",
                    choices = list("airquality", "iris", "sleep"),
                    selected = "airquality"
                  ),
                  downloadButton("down", label = "Download the data.")
                ),
                mainPanel()))

server = function(input, output) {

  # Reactive value for selected dataset ----
  datasetInput = reactive({
    switch(input$data,
"airquality" = airquality,
"iris" = iris,
"sleep" = sleep)
  })

  # Downloadable csv of selected dataset ----
  output$down = downloadHandler(
    filename = function() {
      paste(input$data, ".csv", sep = "")
    },
    content = function(file) {
      write.csv(datasetInput(), file, row.names = FALSE)
    }
  )

}
shinyApp(ui, server)

Uploading a file

So far we were dealing with inbuilt datasets in R. In order to allow the users to upload their own datasets and do the analysis on them, fileInput function in UI in shiny allows users to upload their own file. Here we are creating an app to upload the files. In fileInput 'multiple = F' denotes that only one file can be uploaded by the user and 'accept = csv' denotes the type of files which can be uploaded. Then we ask the user whether he wants to view the head of the data or the entire dataset which is then viewed by using renderTable.

library(shiny)
ui = fluidPage(titlePanel("Uploading file in Shiny"),
                sidebarLayout(
                  sidebarPanel(
                    fileInput(
"myfile",
"Choose CSV File",
                      multiple = F,
                      accept = ".csv"
                    ),

                    checkboxInput("header", "Header", TRUE),

                    radioButtons(
"choice",
"Display",
                      choices = c(Head = "head",
                                  All = "all"),
                      selected = "head"
                    )
                  ),

                  mainPanel(tableOutput("contents"))

                ))
server = function(input, output) {
  output$contents = renderTable({
    req(input$myfile)

    data = read.csv(input$myfile$datapath,
                     header = input$header)

    if (input$choice == "head") {
      return(head(data))
    }
    else {
      return(data)
    }

  })
}
shinyApp(ui, server)

Sharing the app with others

Method I :Sharing the R code: You can share you app with others by sharing your R code. To make it work, users need to have R installed on their system.

Method II :Share your app as a web page: You need to create an account on shinyapps.io and follow the instructions below to share your app.R file.

Deploying shiny app on shinyapps.io

First you need to have an account on shinyapps.io.

Import library rsconnect by using

library(rsconnect)

Then you need to configure the rsconnect package to your account using the code below -

rsconnect::setAccountInfo(name="<ACCOUNT>", token="<TOKEN>", secret="<SECRET>")

To deploy the app you can write:

rsconnect::deployApp(' Folder path in which your app.R file is saved')

As a result a new web page of your app link will be opened.

Shiny App for Normality

In this app the user first selects the variable for which he wants to test the normality. Then he is asked whether he wants to check normality via plots or some statistical tests. If the user selects plots then he will be asked whether he wants a Histogram or a QQ-Plot. The link for the shiny app is: My Shiny App

ui =  fluidPage(titlePanel("My first App"),
                sidebarLayout(
                  sidebarPanel(
                    selectInput(
"varchoice",
"Choose the variable for which you want to check the normality",
                      choices = c("mpg", "disp", "drat", "qsec", "hp", "wt")
                    ),
                    radioButtons(
"normchoice",
"How do you want to check the normality?",
                      choices = c("Plots", "Tests"),
                      selected = "Plots"
                    ),
                    conditionalPanel(
"input.normchoice == 'Plots'",
                      selectInput(
"plotchoice",
"Choose which plot you want?",
                        choices = c("Histogram", "QQ-Plot")
                      )
                    )


                  ),
                  mainPanel(
                    conditionalPanel("input.normchoice == 'Plots'", plotOutput("myplot")),
                    conditionalPanel("input.normchoice == 'Tests'", verbatimTextOutput("mytest"))
                  )
                ))
server = function(input, output) {
  var = reactive({
    mtcars[, input$varchoice]

  })
  output$myplot =  renderPlot({
    if (input$plotchoice == "Histogram")
      return(hist(var(), main = "Histogram", xlab = input$varchoice))
    else
      return(qqnorm(var(), main = paste("QQ plot of", input$varchoice)))
  })
  output$mytest = renderPrint({
    shapiro.test(var())
  })
}

shinyApp(ui, server)

Following is the clip of how the app will look when opened the link:

My First Shiny App

↧

Web Scraping Matrimonial Website with R

February 26, 2018, 3:35 am

≫ Next: 15 Types of Regression you should know

≪ Previous: Tutorial : Build Webapp in R using Shiny

In this tutorial, we will cover how to extract information from a matrimonial website using R. We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.

Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.

install.packages("rvest")
install.packages("dplyr")

Load the required Libraries

To make the libraries in use, you need to submit the program below.

library(rvest)
library(dplyr)

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.
https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

We need to split the main URL into different elements so that we can access it.

Main_URL = Static_URL + Mother_Tongue + Brides_Grooms

Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -

Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
Select Mother Tongue. For example, punjabi, tamil etc.

# Looking for bride/groom
Bride_Groom = "bride"
# Possible Values : bride, groom

# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi

# URL
if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
}

See the output :

[1] "https://www.jeevansathi.com/punjabi-brides-girls"

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.

text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)

         ID
1  ZARX0345
2  ZZWX5573
3  ZWVT2173
4  ZAYZ6100
5  ZYTS6885
6  ZXYV9849
7   TRZ8475
8   VSA7284
9  ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18  WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043

The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -

read_html() : you can create a html document from a URL
html_nodes() : extracts pieces out of HTML documents.
html_nodes(".class") : calls node based on CSS class
html_nodes("#class") : calls node based on <div>, <span>, <pre> id
html_text() : extracts only the text from HTML tag
html_attr() : extracts contents of a single attribute

Difference between .class and #class

1..class targets the following element:

<div class="class"></div>

2.#class targets the following element:

<div id="class"></div>

Get Detailed Information of Profiles

The following program performs the following tasks -

Loop through profile IDs
Pull information about Age, Height, Qualification etc.
Extract details about appearance
Fetch 'About Me' section of profiles

# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
                               html_text()))
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
                             Description = read_html(link) %>% 
                             html_nodes("#myinfoView") %>%
                             html_text(), 
                             Appearance = read_html(link) %>% 
                             html_nodes("#section-lifestyle #appearanceView") %>%
                             html_text(),
                             FormattedInfo)

finaldf = bind_rows(finaldf, FormattedInfo)
}

# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")

Download Display Pic

To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.

# Download Profile Pic of a particular Profile
ID = "SWU6683"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg

Disclaimer

We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.

Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.

library(rvest)
url = "http://www.google.com"
pgsession = html_session(url)
pgform = html_form(pgsession)[[1]]

# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)

# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()

 [1] "Data science - Wikipedia"
 [2] "Data Science Courses | Coursera"
 [3] "Data Science | edX"
 [4] "Data science - Wikipedia"
 [5] "DataScience.com | Enterprise Data Science Platform Provider"
 [6] "Top Data Science Courses Online - Updated February 2018 - Udemy"
 [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"
 [8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
 [9] "Online Data Science Courses | Microsoft Professional Program"
[10] "News for Datascience"
[11] "Data Science Course - Cognitive Class"

↧

15 Types of Regression you should know

March 25, 2018, 11:50 am

≫ Next: Run Python from R

≪ Previous: Web Scraping Matrimonial Website with R

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

Table of Contents

What is Regression Analysis?
Terminologies related to Regression
Types of Regressions

Linear Regression
Polynomial Regression
Logistic Regression
Quantile Regression
Ridge Regression
Lasso Regression
ElasticNet Regression
Principal Component Regression
Partial Least Square Regression
Support Vector Regression
Ordinal Regression
Poisson Regression
Negative Binomial Regression
Quasi-Poisson Regression
Cox Regression

How to choose the correct Regression Model?

Regression Analysis Simplified

What is Regression Analysis?

Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors (drivers) that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem.

In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

It helps us to answer the following questions -

Which of the drivers have a significant impact on sales.
Which is the most important driver of sales
How do the drivers interact with each other
What would be the annual sales next year.

Terminologies related to regression analysis

1. Outliers
Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

3. Heteroscedasticity
When dependent variable's variability is not equal across values of an independent variable, it is called heteroscedasticity. Example - As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

4. Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of high variance.

When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of high bias.

In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.

Regression : Underfitting and Overfitting

Types of Regression

Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution.

1. Linear Regression

It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. The green points are the actual observations while the black line fitted is the line of regression

Regression Analysis

When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.

The equation of multiple linear regression is listed below -

Multiple Regression Equation
Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term. βi’s are the regression coefficients.

Assumptions of linear regression:

There must be a linear relation between independent and dependent variables.
There should not be any outliers present.
No heteroscedasticity
Sample observations should be independent.
Error terms should be normally distributed with mean 0 and constant variance.
Absence of multicollinearity and auto-correlation.

Estimating the parametersTo estimate the regression coefficients βi’s we use principle of least squares which is to minimize the sum of squares due to the error terms i.e.

On solving the above equation mathematically we obtain the regression coefficients as:

Interpretation of regression coefficients
Let us consider an example where the dependent variable is marks obtained by a student and explanatory variables are number of hours studied and no. of classes attended. Suppose on fitting linear regression we got the linear regression as:

Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)

Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:

If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
Keeping no. of classes attended constant, if student studies for one hour more then he will score 2 more marks in the examination.
Similarly keeping no. of hours studied constant, if student attends one more class then he will attain 0.5 marks more.

Linear Regression in R

We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.

library(datasets)
model = lm(Fertility ~ .,data = swiss)
lm_coeff = model$coefficients
lm_coeff
summary(model)

The output we get is:

> lm_coeff

     (Intercept)      Agriculture      Examination        Education         Catholic 
      66.9151817       -0.1721140       -0.2580082       -0.8709401        0.1041153 
Infant.Mortality 
       1.0770481 
> summary(model)

Call:
lm(formula = Fertility ~ ., data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2743  -5.2617   0.5032   4.1198  15.3213 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
Examination      -0.25801    0.25388  -1.016  0.31546    
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 ** 
Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

Hence we can see that 70% of the variation in Fertility rate can be explained via linear regression.

2. Polynomial Regression

It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.
In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relation between the dependent and independent variable seems to be non-linear we can deploy Polynomial Regression Models.

Thus a polynomial of degree k in one variable is written as:

Here we can create new features like

and can fit linear regression in the similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.

Disclaimer: It is to be kept in mind that creating unnecessary extra features or fitting polynomials of higher degree may lead to overfitting.

Polynomial regression in R:

We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area.

Firstly we read the data using read.csv( ) and divide it into the dependent and independent variable

data = read.csv("poly.csv")
x = data$Area
y = data$Price

In order to compare the results of linear and polynomial regression, firstly we fit linear regression:

model1 = lm(y ~x)
model1$fit
model1$coeff

The coefficients and predicted values obtained are:

> model1$fit
       1        2        3        4        5        6        7        8        9       10 
169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454 
> model1$coeff
 (Intercept)            x 
120.05663769   0.09808581

We create a dataframe where the new variable are x and x square.

new_x = cbind(x,x^2)

new_x

         x        
 [1,]  500  250000
 [2,]  600  360000
 [3,]  700  490000
 [4,] 1000 1000000
 [5,] 1050 1102500
 [6,] 1495 2235025
 [7,] 1750 3062500
 [8,] 1800 3240000
 [9,] 2000 4000000
[10,] 2200 4840000

Now we fit usual OLS to the new data:

model2 = lm(y~new_x)
model2$fit
model2$coeff

The fitted values and regression coefficients of polynomial regression are:

> model2$fit
       1        2        3        4        5        6        7        8        9       10 
122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110 
> model2$coeff
  (Intercept)        new_xx         new_x 
-7.684980e+01  4.689175e-01 -1.402805e-04

Using ggplot2 package we try to create a plot to compare the curves by both linear and polynomial regression.

library(ggplot2)
ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
geom_line(aes(x = Area,y = model1$fit),color = "red") +
geom_line(aes(x = Area,y = model2$fit),color = "blue") +
theme(panel.background = element_blank())

3. Logistic Regression

In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable.

Here my model is:

logistic regression equation

Why don't we use linear regression in this case?

The homoscedasticity assumption is violated.
Errors are not normally distributed
y follows binomial distribution and hence is not normal.

Examples

HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome - Join / Not Join).

Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Predicting the category of dependent variable for a given vector X of independent variables

Through logistic regression we have

Thus we choose a cut-off of probability say 'p' and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

Interpreting the logistic regression coefficients (Concept of Odds Ratio)

If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

Logistic Regression in R:

In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.

We fit logistic regression with glm( ) function and we set family = "binomial"

model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = "binomial")

The predicted probabilities are given by:

#Predicted Probablities

model$fitted.values

        1         2         3         4         5         6         7         8         9 
0.4545455 0.4545455 0.6428571 0.6428571 0.4545455 0.4545455 0.4545455 0.4545455 0.6428571 
       10        11        12        13        14        15        16        17        18 
0.6428571 0.4545455 0.4545455 0.6428571 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571 
       19        20        21        22        23        24        25 
0.6428571 0.4545455 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571

Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5

data$prediction <- model$fitted.values>0.5

> data$prediction
 [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[16] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

4. Quantile Regression

Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data.

In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.

The term “quantile” is the same as “percentile”

Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile of the dependent variable given the values of X's. Note that the dependent variable should be continuous.

The quantile regression model:
For qth quantile we have the following regression model:

This seems similar to linear regression model but here the objective function we consider to minimize is:

where q is the qth quantile.

If q = 0.5 i.e. if we are interested in the median then it becomes median regression (or least absolute deviation regression) and substituting the value of q = 0.5 in above equation we get the objective function as:

Interpreting the coefficients in quantile regression:

Suppose the regression equation for 25th quantile of regression is:

y = 5.2333 + 700.823 x

It means that for one unit increase in x the estimated increase in 25th quantile of y by 700.823 units.

Advantages of Quantile over Linear Regression

Quite beneficial when heteroscedasticity is present in the data.
Robust to outliers
Distribution of dependent variable can be described via various quantiles.
It is more useful than linear regression when the data is skewed.

Disclaimer on using quantile regression!

It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

Quantile Regression in R

We need to install quantreg package in order to carry out quantile regression.

install.packages("quantreg")
library(quantreg)

Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set tau = 0.25.

model1 = rq(Fertility~.,data = swiss,tau = 0.25)
summary(model1)

tau: [1] 0.25

Coefficients:
                 coefficients lower bd upper bd
(Intercept)      76.63132      2.12518 93.99111
Agriculture      -0.18242     -0.44407  0.10603
Examination      -0.53411     -0.91580  0.63449
Education        -0.82689     -1.25865 -0.50734
Catholic          0.06116      0.00420  0.22848
Infant.Mortality  0.69341     -0.10562  2.36095

Setting tau = 0.5 we run the median regression.

model2 = rq(Fertility~.,data = swiss,tau = 0.5)
summary(model2)

tau: [1] 0.5

Coefficients:
                 coefficients lower bd upper bd
(Intercept)      63.49087     38.04597 87.66320
Agriculture      -0.20222     -0.32091 -0.05780
Examination      -0.45678     -1.04305  0.34613
Education        -0.79138     -1.25182 -0.06436
Catholic          0.10385      0.01947  0.15534
Infant.Mortality  1.45550      0.87146  2.21101

We can run quantile regression for multiple quantiles in a single plot.

model3 = rq(Fertility~.,data = swiss, tau = seq(0.05,0.95,by = 0.05))
quantplot = summary(model3)
quantplot

We can check whether our quantile regression results differ from the OLS results using plots.

plot(quantplot)

We get the following plot:

Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS coefficients and the dotted red lines are the confidence intervals around those OLS coefficients for various quantiles. The black dotted line are the quantile regression estimates and the gray area is the confidence interval for them for various quantiles. We can see that for all the variable both the regression estimated coincide for most of the quantiles. Hence our use of quantile regression is not justifiable for such quantiles. In other words we want that both the red and the gray lines should overlap as less as possible to justify our use of quantile regression.

5. Ridge Regression

It's important to understand the concept of regularization before jumping to ridge regression.

1. Regularization

Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

Regularization is generally useful in the following situations:

Large number of variables
Low ratio of number observations to number of variables
High Multi-Collinearity

2. L1 Loss function or L1 Regularization

In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

3. L2 Loss function or L2 Regularization

In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.

In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces. For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:

Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

Very Important Note:

We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X's.

We can see that ridge regression makes use of L2 regularization.

On solving the above objective function we can get the estimates of β as:

How can we choose the regularization parameter λ?

If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

R code for Ridge Regression

Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.

X = swiss[,-1]
y = swiss[,1]

We need to load glmnet library to carry out ridge regression.

library(glmnet)

Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.

set.seed(123) #Setting the seed to get similar results.
model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.

best_lambda = model$lambda.min

ridge_coeff = predict(model,s = best_lambda,type = "coefficients")
ridge_coeff The coefficients obtained using ridge regression are:

6 x 1 sparse Matrix of class "dgCMatrix"
                           1
(Intercept)      64.92994664
Agriculture      -0.13619967
Examination      -0.31024840
Education        -0.75679979
Catholic          0.08978917
Infant.Mortality  1.09527837

6. Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularization technique in the objective function. Thus the objective function in LASSO regression becomes:

λ is the regularization parameter and the intercept term is not regularized.

We do not assume that the error terms are normally distributed.

For the estimates we don't have any specific mathematical formula but we can obtain the estimates using some statistical software.

Note that lasso regression also needs standardization.

Advantage of lasso over ridge regression

Lasso regression can perform in-built variable selection as well as parameter shrinkage. While using ridge regression one may end up getting all the variables but with Shrinked Paramaters.

R code for Lasso Regression

Considering the swiss dataset from "datasets" package, we have:

#Creating dependent and independent variables.
X = swiss[,-1]
y = swiss[,1]

Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set alpha = 1. By default standardize = TRUE hence we do not need to standardize the variables seperately.

#Setting the seed for reproducibility
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 1,lambda = 10^seq(4,-1,-0.1))
#By default standardize = TRUE

We consider the best value of lambda by filtering out lamba.min from the model and hence get the coefficients using predict function.

#Taking the best lambda
best_lambda = model$lambda.min
lasso_coeff = predict(model,s = best_lambda,type = "coefficients")
lasso_coeff The lasso coefficients we got are:

6 x 1 sparse Matrix of class "dgCMatrix"
                           1
(Intercept)      65.46374579
Agriculture      -0.14994107
Examination      -0.24310141
Education        -0.83632674
Catholic          0.09913931
Infant.Mortality  1.07238898

Which one is better - Ridge regression or Lasso regression?

Both ridge regression and lasso regression are addressed to deal with multicollinearity.

Ridge regression is computationally more efficient over lasso regression. Any of them can perform better. So the best approach is to select that regression model which fits the test set data well.

7. Elastic Net Regression

Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.

It is a combination of both L1 and L2 regularization.

The objective function in case of Elastic Net Regression is:

Like ridge and lasso regression, it does not assume normality.

R code for Elastic Net Regression

Setting some different value of alpha between 0 and 1 we can carry out elastic net regression.

set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
#Taking the best lambda
best_lambda = model$lambda.min
en_coeff = predict(model,s = best_lambda,type = "coefficients")
en_coeff

The coeffients we obtained are:

6 x 1 sparse Matrix of class "dgCMatrix"
                          1
(Intercept)      65.9826227
Agriculture      -0.1570948
Examination      -0.2581747
Education        -0.8400929
Catholic          0.0998702
Infant.Mortality  1.0775714

8. Principal Components Regression (PCR)
PCR is a regression technique which is widely used when you have many independent variables OR multicollinearity exist in your data. It is divided into 2 steps:

Getting the Principal components
Run regression analysis on principal components

The most common features of PCR are:

Dimensionality Reduction
Removal of multicollinearity

Getting the Principal components

Principal components analysis is a statistical method to extract new features when the original features are highly correlated. We create new features with the help of original features such that the new features are uncorrelated.

Let us consider the first principle component:

The first PC is having the maximum variance.
Similarly we can find the second PC U2 such that it is uncorrelated with U1 and has the second largest variance.
In a similar manner for 'p' features we can have a maximum of 'p' PCs such that all the PCs are uncorrelated with each other and the first PC has the maximum variance, then 2nd PC has the maximum variance and so on.

Drawbacks:

It is to be mentioned that PCR is not a feature selection technique instead it is a feature extraction technique. Each principle component we obtain is a function of all the features. Hence on using principal components one would be unable to explain which factor is affecting the dependent variable to what extent.

Principal Components Regression in R

We use the longley data set available in R which is used for high multicollinearity. We excplude the Year column.

data1 = longley[,colnames(longley) != "Year"]

View(data) This is how some of the observations in our dataset will look like:

We use pls package in order to run PCR.

install.packages("pls")
library(pls)

In PCR we are trying to estimate the number of Employed people; scale = T denotes that we are standardizing the variables; validation = "CV" denotes applicability of cross-validation.

pcr_model <- pcr(Employed~., data = data1, scale = TRUE, validation = "CV")
summary(pcr_model)

We get the summary as:

Data:  X dimension: 16 5 
 Y dimension: 16 1
Fit method: svdpc
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV           3.627    1.194    1.118   0.5555   0.6514   0.5954
adjCV        3.627    1.186    1.111   0.5489   0.6381   0.5819

TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps
X           72.19    95.70    99.68    99.98   100.00
Employed    90.42    91.89    98.32    98.33    98.74

Here in the RMSEP the root mean square errors are being denoted. While in 'Training: %variance explained' the cumulative % of variance explained by principle components is being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
We can also create a plot depicting the mean squares error for the number of various PCs.

validationplot(pcr_model,val.type = "MSEP")

By writing val.type = "R2"we can plot the R square for various no. of PCs.

validationplot(pcr_model,val.type = "R2")

If we want to fit pcr for 3 principal components and hence get the predicted values we can write:

pred = predict(pcr_model,data1,ncomp = 3)

9. Partial Least Squares (PLS) Regression

It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables.

Difference between PLS and PCR

Both techniques create new independent variables called components which are linear combinations of the original predictor variables but PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. While PLS takes the dependent variable into account, and therefore often leads to models that are able to fit the dependent variable with fewer components.

PLS Regression in R

library(plsdepot)
data(vehicles)
pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
# R-Square
pls.model$R2

10. Support Vector Regression

Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.

The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.

library(e1071)
svr.model <- svm(Y ~ X , data)
pred <- predict(svr.model, data)
points(data$X, pred, col = "red", pch=4)

11. Ordinal Regression

Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

Why we can't use linear regression when dealing with ordinal target variable?

In linear regression, the dependent variable assumes that changes in the level of the dependent variable are equivalent throughout the range of the variable. For example, the difference in weight between a person who is 100 kg and a person who is 120 kg is 20kg, which has the same meaning as the difference in weight between a person who is 150 kg and a person who is 170 kg. These relationships do not necessarily hold for ordinal variables.

library(ordinal)
o.model <- clm(rating ~ ., data = wine)
summary(o.model)

12. Poisson Regression

Poisson regression is used when dependent variable has count data.

Application of Poisson Regression -

Predicting the number of calls in customer care related to a particular product
Estimating the number of emergency service calls during an event

The dependent variable must meet the following conditions

The dependent variable has a Poisson distribution.
Counts cannot be negative.
This method is not suitable on non-whole numbers

In the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.

pos.model<-glm(breaks~wool*tension, data = warpbreaks, family=poisson)
summary(pos.model)

13. Negative Binomial Regression

Like Poisson Regression, it also deals with count data. The question arises "how it is different from poisson regression". The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While poisson regression assumes the variance equal to its mean.

When the variance of count data is greater than the mean count, it is a case of overdispersion. The opposite of the previous statement is a case of under-dispersion.

library(MASS)
nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
summary(nb.model)

14. Quasi Poisson Regression

It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.

qs.pos.model <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine, family = "quasipoisson")

Quasi-Poisson regression can handle both over-dispersion and under-dispersion.

15. Cox Regression

Cox Regression is suitable for time-to-event data. See the examples below -

Time from customer opened the account until attrition.
Time after cancer treatment until death.
Time from first heart attack to the second.

Logistic regression uses a binary dependent variable but ignores the timing of events.

As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups.

Dual targets are set for the survival model

1. A continuous variable representing the time to event.

2. A binary variable representing the status whether event occurred or not.

library(survival)
# Lung Cancer Data
# status:2=death
lung$SurvObj <- with(lung, Surv(time, status == 2))
cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
cox.reg

How to choose the correct regression model?

If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.
If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.
Try support vector regression when you have non-linear model.

↧

Run Python from R

March 27, 2018, 7:50 am

≫ Next: Case Study : Sentiment analysis using Python

≪ Previous: 15 Types of Regression you should know

This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.

Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.

# Install reticulate package
install.packages("reticulate")

# Load reticulate package
library(reticulate)

Check whether Python is available on your system

py_available()

It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.

os <- import("os")
os$getcwd()

The above program returns working directory.

[1] "C:\\Users\\DELL\\Documents"

You can use listdir( ) function from os package to see all the files in working directory

os$listdir()

 [1] ".conda"".gitignore"".httr-oauth"
  [4] ".matplotlib"".RData"".RDataTmp"
  [7] ".Rhistory""1.pdf""12.pdf"
 [10] "122.pdf""124.pdf""13.pdf"
 [13] "1403.2805.pdf""2.pdf""3.pdf"
 [16] "AIR.xlsx""app.r""Apps"
 [19] "articles.csv""Attrition_Telecom.xlsx""AUC.R"

Install Python Package

Step 1 : Create a new environment

conda_create("r-reticulate")

Step 2 : Install a package within a conda environment

conda_install("r-reticulate", "numpy")

Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

Step 3 : Load the package

numpy <- import("numpy")

Working with numpy array

Let's create a sample numpy array

y <- array(1:4, c(2, 2))
x <- numpy$array(y)

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Transpose the above array

numpy$transpose(x)

    [,1] [,2]
[1,]    1    2
[2,]    3    4

Eigenvalues and eigen vectors

numpy$linalg$eig(x)

[[1]]
[1] -0.3722813  5.3722813

[[2]]
           [,1]       [,2]
[1,] -0.9093767 -0.5657675
[2,]  0.4159736 -0.8245648

Mathematical Functions

numpy$sqrt(x)
numpy$exp(x)

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

By using repl_python() function, you can make it interactive. Download the datasetused in the program below.

repl_python()

# Load Pandas package
import pandas as pd

# Importing Dataset
travel = pd.read_excel("AIR.xlsx")

# Number of rows and columns
travel.shape

# Select random no. of rows
travel.sample(n = 10)

# Group By
travel.groupby("Year").AIR.mean()

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
exit

Note : You need to enter exit to return to the R environment.

Run Python from R

How to access objects created in python from R

You can use the pyobject to access objects created within python.

summary(py$t)

In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.

# Line chart using ggplot2
library(ggplot2)
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task.

1. Let's create a object in R

mydata = head(cars, n=15)

2. Use the R created object within Python REPL

repl_python()
import pandas as pd
r.mydata.describe()
pd.isnull(r.mydata.speed)
exit

Building Logistic Regression Model using sklearn package

The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.

repl_python()

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

# load the iris datasets
iris = datasets.load_iris()

# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)

# Scoring
actual = iris.target
predicted = model.predict(iris.data)

# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))

Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of R installed on your system.It also shows details about anaconda and numpy.

py_config()

python:         C:\Users\DELL\ANACON~1\python.exe
libpython:      C:/Users/DELL/ANACON~1/python36.dll
pythonhome:     C:\Users\DELL\ANACON~1
version:        3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:\Users\DELL\ANACON~1\lib\site-packages\numpy
numpy_version:  1.14.2

To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.

py_module_available("pandas")

↧