Quantcast
Channel: ListenData
Viewing all 425 articles
Browse latest View live

When and why to standardize a variable

$
0
0
This tutorial explains when, why and how to standardize a variable in statistical modeling. Variable Standardization is one of the most important concept of predictive modeling. It is a preprocessing step in building a predictive model. Standardization is also called Normalization and Scaling.

Standardization / Scaling

The concept of standardization comes into picture when continuous independent variables are measured at different scales. It means these variables do not give equal contribution to the analysis. For example, we are performing customer segmentation analysis in which we are trying to group customers based on their homogenous (similar) attributes. A variable called 'transaction amount' that ranges between $100 and $10000 carries more weightage as compared to a variable i.e. number of transactions that in general ranges between 0 and 30. Hence, it is required to transform the data to comparable scales. The idea is to rescale an original variable to have equal range and/or variance.

Standardization / Scaling

Methods of Standardization / Normalization

There are main four methods of standardization. They are as follows -

1. Z score

Z score standardization is one of the most popular method to normalize data. In this case, we rescale an original variable to have a mean of zero and standard deviation of one.
Z score
Mathematically, scaled variable would be calculated by subtracting mean of the original variable from raw vale and then divide it by standard deviation of the original variable.

R Code : Standardize a variable using Z-score
# Creating a sample data
set.seed(123)
X =data.frame(k1 = sample(100:1000,1000, replace=TRUE),
              k2 = sample(10:100,1000, replace=TRUE))
X.scaled = scale(X, center= TRUE, scale=TRUE)
In scale() function, center= TRUE implies subtracting the mean from its original variable. The scale = TRUE implies dividing the centered column by its standard deviations.

Check Mean and Variance of Standardized Variable
colMeans(X.scaled)
Result : 0 for both k1 and k2
var(X.scaled)
Result : 1 for both k1 and k2

Interpretation
A value of 1 implies that the value for that case is one standard deviation above the mean, while a value of -1 indicates that a case has a value one standard deviations lower than the mean.
Important Point
The standardized values do not lie in a particular interval. It can be any real number.

2. Min-Max Scaling

It is also called 0-1 scaling because the standardized value using this method lies between 0 and 1.

The formula is shown below -
x-min(x)/(max(x)-min(x))
This method is used to make equal ranges but different means and standard deviations.
library(dplyr)
mins= as.integer(summarise_all(X, min))
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= mins, scale=rng))
Check Min and Max of standardized variables
summarise_all(X.scaled, funs(min, max))
k1_min k2_min k1_max k2_max
 0      0      1      1

3. Standard Deviation Method

In this method, we divide each value by the standard deviation. The idea is to have equal variance, but different means and ranges. Formula : x/stdev(x)
X.scaled = data.frame(scale(X, center= FALSE , scale=apply(X, 2, sd, na.rm = TRUE)))
Check Equal Variance
summarise_all(X.scaled, var)
Result : 1 for both the variables

4. Range Method

In this method, we dividing each value by the range. Formula : x /(max(x) - min(x)). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar.
library(dplyr)
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= FALSE, scale=rng))
summarise_all(X.scaled, var)
var_x1 = 0.08614377
var_x2 = 0.08833861

What is Centering?

Centering means subtracting a constant value from every value of a variable. The constant value can be average, min or max. Most of the times we use average value to subtract it from every value.
X=sample(1:100,1000, replace=TRUE)
scale(X,center = TRUE, scale=FALSE)
By default, scale() function with center=TRUE subtract mean value from values of a variable.


When it is important to standardize variables?

1. It is important to standardize variables before running Cluster Analysis. It is because cluster analysis techniques depend on the concept of measuring the distance between the different observations we're trying to cluster. If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable.

2. Prior to Principal Component Analysis, it is critical to standardize variables. It is because PCA gives more weightage to those variables that have higher variances than to those variables that have very low variances. In effect the results of the analysis will depend on what units of measurement are used to measure each variable. Standardizing raw values makes equal variance so high weight is not assigned to variables having higher variances.

3. It is required to standardize variable before using k-nearest neighbors with an Euclidean distance measure. Standardization makes all variables to contribute equally.

4. All SVM kernel methods are based on distance so it is required to scale variables prior to running final Support Vector Machine (SVM) model.

5. It is necessary to standardize variables before using Lasso and Ridge Regression. Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression.

6. In regression analysis, we can calculate importance of variables by ranking independent variables based on the descending order of absolute value of standardized coefficient.

7. In regression analysis, when an interaction is created from two variables that are not centered on 0, some amount of collinearity will be induced. Centering first addresses this potential problem. In simple terms, having non-standardized variables interact simply means that when X1 is big, then X1X2 is also going to be bigger on an absolute scale irrespective of X2, and so X1 and X1X2 will end up correlated.

8. In regression analysis, it is also helpful to standardize a variable when you include power terms X². Standardization removes collinearity.

When it is not required to standardize variables

1. If you think model performance of linear regression model would improve if you standardize variables, it is absolutely incorrect! It does not change RMSE, R-squared value, Adjusted R-squared value, p-value of coefficients. See the detailed R script below. It shows standardization does not affect model performance at all.

Without Standardization -
# Create Sample Data
set.seed(123)
train <- data.frame(X1=sample(1:100,1000, replace=TRUE),
                    X2=1e2*sample(1:500,1000, replace=TRUE),
                    X3=1e-2*sample(1:100,1000, replace=TRUE))
train$y <- with(train,2*X1 + 3*1e-2*X2 - 5*1e2*X3 + 1 + rnorm(1000,sd=10))

#Fit linear regression model
fit  <- lm(y~X1+X2+X3,train)
summary(fit)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  1.216e+00  9.732e-01    1.25    0.212  
X1           1.984e+00  1.089e-02  182.19   <2e-16 ***
X2           3.000e-02  2.188e-05 1371.21   <2e-16 ***
X3          -4.990e+02  1.070e+00 -466.21   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.849 on 996 degrees of freedom
Multiple R-squared:  0.9995,Adjusted R-squared:  0.9995
F-statistic: 6.799e+05 on 3 and 996 DF,  p-value: < 2.2e-16
# create test dataset
test <- data.frame(X1=sample(-5:5,100,replace=TRUE),
                   X2=1e2*sample(-5:5,100, replace=TRUE),
                   X3=1e-2*sample(-5:5,100, replace=TRUE))
# predict y based on test data without standardization
pred   <- predict(fit,newdata=test)
With Standardization
# Standardize predictors
means   <- sapply(train[,1:3],mean)
stdev <- sapply(train[,1:3],sd)
train.scaled <- as.data.frame(scale(train[,1:3],center=means,scale=stdev))
head(train.scaled)
train.scaled$y <- train$y
# Check mean and Variance of Standardized Variables
library(dplyr)
summarise_at(train.scaled, vars(X1,X2,X3), funs(round(mean(.),4)))
summarise_at(train.scaled, vars(X1,X2,X3), var)
Result : Mean is 0 and Variance is 1 for all the standardized variables
#Fit Scaled Data
fit.scaled <- lm(y ~ X1 + X2 + X3, train.scaled)
summary(fit.scaled)
Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  598.4244     0.3114  1921.4   <2e-16 ***
X1            57.0331     0.3130   182.2   <2e-16 ***
X2           428.6441     0.3126  1371.2   <2e-16 ***
X3          -145.8587     0.3129  -466.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.849 on 996 degrees of freedom
Multiple R-squared:  0.9995,Adjusted R-squared:  0.9995
F-statistic: 6.799e+05 on 3 and 996 DF,  p-value: < 2.2e-16

Compare Coefficients, R-Squared and Adjusted R-Squared
The value of coefficients are not same when we run regression analysis with and without standardizing independent variables. It does not mean they are affected by scaling / standardization. The values are different because of these are the slopes - how much the target variable changes if you change independent variable by 1 unit. In other words, standardization can be interpreted as scaling the corresponding slopes. The adjusted r-squared and multiple r-squared value is exactly same.

How to standardize validation / test dataset

To standardize validation and test dataset, we can use mean and standard deviation of independent variables from training data. Later we apply them to test dataset using Z-score formula. See the formula below -
Z = (X_test - Xbar_training) / Stdev_training 
R Script - Standardize Test Data
test.scaled <- as.data.frame(scale(test,center=means,scale=stdev))
head(test.scaled)
Compare Prediction and RMSE
# predict y based on new data scaled, with fit from scaled dataset
pred.scaled   <- predict(fit.scaled,newdata=test.scaled)
# Compare Prediction - unscaled vs. scaled fit
all.equal(pred,pred.scaled)
# RMSE on train data with un-scaled fit
pred_train   <- predict(fit,newdata=train)
rmse <- sqrt(mean((train$y - pred_train)^2))
# RMSE on train data with scaled fit
pred_train.scaled   <- predict(fit.scaled,newdata=train.scaled)
rmse.scaled <- sqrt(mean((train$y - pred_train.scaled)^2))
# Compare RMSE
all.equal(rmse,rmse.scaled)
RMSE is same in both the cases 9.829196. It is because RMSE is associated with scale of Y (target variable). Prediction is also unchanged.

Interpretation of Standardized Regression Coefficient
Most of modern statistical softwares automatically produces standardized regression coefficient. It is important metrics to rank predictors. Its interpretation is slightly different from unstandardized estimates. Standardized coefficients are interpreted as the number of standard deviation units Y changes with an increase in one standard deviation in X.

Correlation with or without Centering / Standardization

The correlation score does not change if you perform correlation analysis on centered and uncentered data.
X=sample(1:100,1000, replace=TRUE)
Y=1e2*sample(1:500,1000, replace=TRUE)
cor(X,Y)
cor(X-mean(X),Y-mean(X))
Standardization after missing imputation and outlier treatment
Centering and Scaling data should be done after imputing missing values. It is because the imputation could influence correct center and scale to use. Similarly, outlier treatment should be done prior to standardization.

Standardize Binary (Dummy) Variables
  1. Standardizing binary variables makes interpretation of binary variables vague as it cannot be increased by  a standard deviation. The simplest solution is : not to standardize binary variables but code them as 0/1, and then standardize all other continuous variables by dividing by two standard deviation. It would make them approximately equal scale. The standard deviation of both the variables would be approx. 0.5
  2. Some researchers are in favor of standardizing binary variables as it would make all predictors on same scale. It is a standard practice in penalized regression (lasso). In this case, researchers ignore the interpretation of variables.

Standardization and Tree Algorithms and Logistic Regression
Standardization does not affect logistic regression, decision tree and other ensemble techniques such as random forest and gradient boosting.

How to Integrate R with PHP

$
0
0
This tutorial explains how to integrate R with PHP.

Online reporting tools have gained popularity in recent years. There is a growing demand to implement advanced analytics in these tools. Use of advanced analytics help to solve various organization problems such as retaining existing customers or acquiring new customers, increasing customer satisfaction etc.

PHP is one of the most popular programming language to develop websites and online reporting tools. It has rich functionality to write business logic, however they are not effective when it comes to data science and machine learning. In the field of data science, R dominates in terms of popularity among statisticians and data scientists with over 10k number of packages.

How to make PHP communicate with R

There are times when you want to showcase the output of R program like charts that you create based on the user inputted data from a web page. In that case you might want your PHP based web application to communicate with the R script.

When it comes to PHP, it has a very useful function calledexec(). It lets you execute the outside program you provide as the source. We will be using the very same function to execute the R script you created. The then generates the graph and we will show the graph in our web page.

The exec function can be used on both the Linux and Windows environments.

  1. On the Linux environment it will open the terminal window to execute the command you set and arguments you specify.  
  2. While on the Windows environment it will open the CMD to execute the command you provide along with the arguments you specify.

I will walk you through the process of integrating the R code with PHP web page with code and explanation. 

Let’s first create a PHP based web form:

index.php:
<html>
  <head>
    <title>PHP and R Integration Sample</title>
  </head>
  <body>
    <div id=”r-output” id=”width: 100%; padding: 25px;”>
    <?php
      // Execute the R script within PHP code
      // Generates output as test.png image.
      exec("sample.R");
    ?>
    <img src=”test.png?var1.1” alt=”R Graph” />
    </div>
  </body>
</html>
Now save the file as index.php under your /htdocs/PROJECT-NAME/index.php.

Let’s create a sample chart using R code.

Write the following code and save it as sample.R file.
x <- rnorm(6,0,1)
png(filename="test.png", width=500, height=500)
hist(x, col="red")
dev.off()
Histogram
rnorm(6, 0 ,1) means generating 6 random values with mean 0 and standard deviation 1. The dev.off() command is used to close the graph. Once chart created it will save it as the test.png file.
The only downside of this code is that it will create the same test.png file for all the incoming requests. Meaning if you are creating charts based on user specified inputs, there will always be one test.png file created for various purpose.
Let’s understand the code

As specified earlier the exec('sample.R'); will execute the R script. It in turn generates the test.png graph image.

In the very next line we used the HTML <img /> tag to display the R program generated image on the page. We used the src=test.png?ver1.1 where ver1.1 is used to invalidate the browser cache and download the new image from server.

All modern browsers supports the browser caching. You might have experienced some website loads way faster on your repetitive visits. It’s due to the fact that browsers cache the image and other static resources for brief period of time.

How to serve concurrent requests?

sample2.R
args <- commandArgs(TRUE)
cols <- args[1]
fname <- args[2]
x <- rnorm(cols,0,1)
fname = paste(fname, "png", sep = ".")
png(filename=fname, width=500, height=500)
hist(x, col="red") dev.off()
Index.php
<html>
  <head>
    <title>PHP and R Integration Sample</title>
  </head>
  <body>
    <div id=”r-output” id=”width: 100%; padding: 25px;”>
    <?php
      // Execute the R script within PHP code
      // Generates output as test.png image.
      $filename = “samplefile”.rand(1,100);
      exec("sample2.R 6 “.$filename.");
    ?>
    <img src=”.$filename.”.png?var1.1” alt=”R Graph” />
    </div>
  </body>
</html>
It will help you eliminate the need to using the same test.png file name. I have used the $filename=”samplefile”. You can use any random sequence as I have used in the end of the samplefile name. rand(min, max) will help you generate a random number.

It will help you fix the file overwriting issue. And you will be able to handle the concurrent requests and server each with unique set of image(s).

You might need to take care of old file removals. If you are on a linux machine you can setup a cron job which will find and delete the chart image files those are older than 24 hours.

Here is the code to find and remove files:

Delete.php
<?php
// set the path to your chart image directory
$dir = "images/temp/";
// loop through all the chart png files inside the directory.
foreach (glob($dir."*.png") as $file) {
// if file is 24 hours old then delete it
if (filemtime($file) < time() - 86400) {
    unlink($file);
    }
}
?>

Conclusion

Making PHP communicate with R and showcase the result is very simple. You might need to understand the exec() function and some PHP code if in-case you want to delete those residual files/images generated by your R program.

Author Bio
This article was originally written by Darshan Joshi, later Deepanshu gave final touch to the post. Darshan is a programming enthusiast. He loves to help developers in every possible way. He is a founder of AlphansoTech : Web Application Development company. You can connect with him on Twitter and LinkedIn.

Check number of observations in SAS dataset

$
0
0
This post explains how to determine the number of observations in a SAS dataset. Most of the times we need to check whether a SAS dataset is empty or not. In macro, we generally tell SAS to go to the next iteration only when SAS dataset is non-empty. In this post, we will see various methods to count number of  rows (records) in SAS table.

Method I : Proc SQL Count (Not Efficient)

In the example below, we will use CARS dataset from SASHELP library. This dataset contains 428 observations and 15 columns.

The easiest method is to use count(*) in Proc SQL. It returns all rows (missing plus non-missing rows) in a dataset.
proc sql;
 select count(*) as N from sashelp.cars;
quit;
Result : 428

In case you want to store it in a macro variable, you can use INTO : keyword.
proc sql noprint;
 select count(*) into :N from sashelp.cars;
quit;

%put &N;
This will print the number of records in SAS log. Check log after running the above program.

Is it an efficient method?
No, it is not efficient at all. It does not use metadata information of SAS dataset. Instead it reads through each record (row) of your SAS dataset. It takes a long time to do it in big SAS tables. However, it is a simple and handy trick to calculate the number of rows in a SAS dataset.

Method 2 : Descriptor Portion (Efficient)

Before getting into detail, we need to understand the descriptor portion and how it works -

SAS dataset consists of the following two portion -
  1. Descriptor portion. It constitutes information about name of dataset, number of observations and variables, creation date, engine type.
  2. Data portion. It stores values of data.
This method is one of the most efficient way to count observations in a SAS table as it uses metadata information and does not search in dataset.
data _NULL_;
 if 0 then set sashelp.cars nobs=n;
put "no. of observations =" n;
 stop;
run;
Explanation
  1. The 'if 0' statement  does not process at execution time because IF statement does not hold TRUE. The whole IF THEN statement is used to pull the header information of the data set and later hand over to the compiler to adjust it to the PDV.
  2. NOBS is a SAS automatic variable which contains the number of rows in a dataset i.e. SASHELP.CARS dataset.
  3. NOBS = N puts the returns count of records in the variable n.
  4. The STOP statement is used to stop an endless loop.
Like the first method, we can keep it in a macro variable. See the implementation below -
data _NULL_;
 if 0 then set sashelp.cars nobs=n;
 call symputx('totobs',n);
 stop;
run;
%put no. of observations = &totobs;
SAS Output

CALL SYMPUT is one of the method to create a SAS macro variable in data step. In this case, we have used a newer function i.e. CALL SYMPUTX which left justifies and trims trailing blanks from a numeric value. If you want to stick to the old style CALL SYMPUT, you can write like below -
call symput('totobs',left(n));

3. Proc SQL Dictionary Method (Efficient)

Like second method, we can use metadata information of a dataset with PROC SQL Dictionary.Tables.
proc sql noprint;
 select nobs into :totobs separated by '' from dictionary.tables
 where libname='SASHELP' and memname='CARS';
quit;
%put total records = &totobs.;
Proc SQL Dictionary.Tables

It is an efficient method as it does not look into each values of  a dataset to determine the count. The LIBNAME= refers to the name of the library in which data is stored. The MEMNAME= refers to SAS table (dataset). The separated by '' is used in this case to left align the numeric value.

4. Macro Language Method (Efficient)

This method also uses metadata information but it is via the macro language using DATA step functions. The OPEN function is used to open a data. The ATTRN function returns the value of a numeric attribute for a SAS data set. When it is used with the NOBS argument, it returns the number of observations. Later we are closing the opened dataset using CLOSE function.
%macro totobs(mydata);
    %let mydataID=%sysfunc(OPEN(&mydata.,IN));
    %let NOBS=%sysfunc(ATTRN(&mydataID,NOBS));
    %let RC=%sysfunc(CLOSE(&mydataID));
    &NOBS
%mend;
%put %totobs(sashelp.cars);

SAS : Check if it is empty table

Suppose you only need to check whether a table is empty or not. You can use the same logic as explained above. And if the returned value is 0, write 'Empty Data' in log. Otherwise, count the number of records.
data _NULL_;
if 0 then set sashelp.cars nobs=n;
if n = 0 then put 'empty dataset';
else put 'Not empty. Total records=' n;
stop;
run;
Result : Not Empty. Total records = 428

Let's create a blank dataset to check the above code. The following program returns empty dataset as 1=2 condition does not meet.
proc sql noprint;
create table temp as
select * from sashelp.cars
where 1 = 2;
quit;
Try it yourself!

Let's wrap the above code in a SAS macro
%macro emptydataset (inputdata=);
data _NULL_;
 if 0 then set &inputdata. nobs=n;
 call symputx('totobs',n);
 stop;
run;
%if &totobs. = 0 %then %put Empty dataset;
%else %do;
%put TotalObs=&totobs;
%end;
%mend;
%emptydataset(inputdata=sashelp.cars);
Result : TotalObs=428

%emptydataset(inputdata=work.temp);
Result : Empty dataset

If you think it's difficult to memorize sas code of descriptor portion method, you can use the code below.
data _NULL_;
set sashelp.cars nobs=N;
if _N_ = 2 then stop;
put N;
run;
SAS log
It reads only first two observations from the dataset. See log above.

Python Tutorial for Beginners: Learn in 3 Days

$
0
0
This tutorial helps you to get started with Python. It's a step by step practical guide to learn Python by examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
Learn Python : Tutorial for Beginners

Table of Contents
  1. Getting Started with Python
    • Python 2.7 vs. 3.6
    • Python for Data Science
    • How to install Python?
    • Spyder Shortcut keys
    • Basic programs in Python
    • Comparison, Logical and Assignment Operators
  2. Data Structures and Conditional Statements
    • Python Data Structures
    • Python Conditional Statements
  3. Python Libraries
    • List of popular packages (comparison with R)
    • Popular python commands
    • How to import a package
  4. Data Manipulation using Pandas
    • Pandas Data Structures - Series and DataFrame
    • Important Pandas Functions (vs. R functions)
    • Examples - Data analysis with Pandas
  5. Data Science with Python
    • Logistic Regression
    • Decision Tree
    • Random Forest
    • Grid Search - Hyper Parameter Tuning
    • Cross Validation
    • Preprocessing Steps

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

Key Takeaway
You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

Python for Data Science

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows -
  1. Python runs well in automating various steps of a predictive model. 
  2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. 
  3. Python wins over R when it comes to deploying machine learning models in production.
  4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
  5. Python has a great online community support.
Do you know these sites are developed in Python?
  1. YouTube
  2. Instagram
  3. Reddit
  4. Dropbox
  5. Disqus

How to Install Python

There are two ways to download and install Python
  1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
  2. Download Python from its official website. You have to manually install libraries.

Recommended :Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :
  1. Jupyter (Ipython) Notebook
  2. Spyder
Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
Spyder - Python Coding Environment
Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys which makes you more productive.
  1. Press F5 to run the entire script
  2. Press F9 to run selection or line 
  3. Press Ctrl + 1 to comment / uncomment
  4. Go to front of function and then press Ctrl + I to see documentation of the function
  5. Run %reset -f to clean workspace
  6. Ctrl + Left click on object to see source code 
  7. Ctrl+Enter executes the current cell.
  8. Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic OperatorsOperationExample
+Addition10 + 2 = 12
Subtraction10 – 2 = 8
*Multiplication10 * 2 = 20
/Division10 / 2 = 5.0
%Modulus (Remainder)10 % 3 = 1
**Power10 ** 2 = 100
//Floor17 // 3 = 5
(x + (d-1)) // dCeiling(17 +(3-1)) // 3 = 6

Basic Programs

Example 1
#Basics
x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)
Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2
x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 80 and x <=95
Out[45]: False
x > 35 or x < 60
Out[46]: True
Comparison & Logical OperatorsDescriptionExample
>Greater than5 > 3 returns True
<Less than5 < 3 returns False
>=Greater than or equal to5 >= 3 returns True
<=Less than or equal to5 <= 3 return False
==Equal to5 == 3 returns False
!=Not equal to5 != 3 returns True
andCheck both the conditionsx > 18 and x <=35
orIf atleast one condition hold Truex > 35 or x < 60
notOpposite of Conditionnot(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
x = 100
y = 10
x += y
print(x)
print(x)
110
In this case, x+=y implies x=x+y which is x = 100 + 10.
Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

1. List

It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
  1. x = [1, 2, 3, 4, 5]
  2. y = [‘A’, ‘O’, ‘G’, ‘M’]
  3. z = [‘A’, 4, 5.1, ‘M’]
Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]
x[0]
Out[68]: 1

x[1]
Out[69]: 2

x[4]
Out[70]: 5

x[-1]
Out[71]: 5

x[-2]
Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method
x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -
  1. A tuple cannot be changed once created whereas list can be modified.
  2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
Examples
K = (1,2,3)
City = ('Delhi','Mumbai','Bangalore')
Perform for loop on Tuple
for i in City:
    print(i)
Delhi
Mumbai
Bangalore

Functions

Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

Rules to define a function
  1. Function starts with def keyword followed by function name and ( )
  2. Function body starts with a colon (:) and is indented
  3. The keyword return ends a function  and give value of previous expression.
def sum_fun(a, b):
    result = a + b
    return result 
z = sum_fun(10, 15)
Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.
def sum_fun(a, b=0):
    result = a + b
    return result
z = sum_fun(10)
In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

Example
k = 27
if k%5 == 0:
  print('Multiple of 5')
else:
  print('Not a Multiple of 5')
Result : Not a Multiple of 5

Popular python packages for Data Analysis & Visualization

Some of the leading packages in Python along with equivalent libraries in R are as follows-
  1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
  2. NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R.
  3. Scipy.  For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
  4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
  5. Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
  6. Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
  7. pandasql.  It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
Maximum of the above packages are already preinstalled in Spyder.
    Comparison of Python and R Packages by Data Mining Task

    TaskPython PackageR Package
    IDERodeo / SpyderRstudio
    Data Manipulationpandasdplyr and reshape2
    Machine LearningScikit-learnglm, knn, randomForest, rpart, e1071
    Data Visualizationggplot + seaborn + bokehggplot2
    Character FunctionsBuilt-In Functionsstringr
    ReproducibilityJupyterKnitr
    SQL Queriespandasqlsqldf
    Working with Datesdatetimelubridate
    Web Scrapingbeautifulsouprvest

    Popular Python Commands

    The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

    Install Package
    !pip install pandas

    Uninstall Package
    !pip uninstall pandas

    Show Information about Installed Package
    !pip show pandas

    List of Installed Packages
    !pip list

    Upgrade a package
    !pip install --upgrade pandas

      How to import a package

      There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

      1. import pandas as pd
      It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

      2. import pandas
      It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

      3. from pandas import *
      It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

      Pandas Data Structures : Series and DataFrame

      In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -
      1. Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.
      In the example below, we are generating 5 random values.
      import pandas as pd
      s1 = pd.Series(np.random.randn(5))
      s1
      0   -2.412015
      1 -0.451752
      2 1.174207
      3 0.766348
      4 -0.361815
      dtype: float64

      Extract first and second value

      You can get a particular element of a series using index value. See the examples below -

      s1[0]
      -2.412015
      s1[1]
      -0.451752
      s1[:3]
      0   -2.412015
      1 -0.451752
      2 1.174207

      2. DataFrame

      It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

      Comparison of Data Type in Python and Pandas

      The following table shows how Python and pandas package stores data.

      Data TypePandasStandard Python
      For character variableobjectstring
      For categorical variablecategory-
      For Numeric variable without decimalsint64int
      Numeric characters with decimalsfloat64float
      For date time variablesdatetime64-

      Important Pandas Functions

      The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorise pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

      FunctionsRPython (pandas package)
      Installing a packageinstall.packages('name')!pip install name
      Loading a packagelibrary(name)import name as other_name
      Checking working directorygetwd()import os
      os.getcwd()
      Setting working directorysetwd()os.chdir()
      List files in a directorydir()os.listdir()
      Remove an objectrm('name')del object
      Select Variablesselect(df, x1, x2) df[['x1', 'x2']]
      Drop Variablesselect(df, -(x1:x2))df.drop(['x1', 'x2'], axis = 1)
      Filter Datafilter(df, x1 >= 100)df.query('x1 >= 100')
      Structure of a DataFramestr(df)df.info()
      Summarize dataframesummary(df)df.describe()
      Get row names of dataframe "df"rownames(df)df.index
      Get column namescolnames(df)df.columns
      View Top N rowshead(df,N)df.head(N)
      View Bottom N rowstail(df,N)df.tail(N)
      Get dimension of data framedim(df)df.shape
      Get number of rowsnrow(df)df.shape[0]
      Get number of columnsncol(df)df.shape[1]
      Length of data framelength(df)len(df)
      Get random 3 rows from dataframesample_n(df, 3)df.sample(n=3)
      Get random 10% rowssample_frac(df, 0.1)df.sample(frac=0.1)
      Check Missing Valuesis.na(df$x)pd.isnull(df.x)
      Sortingarrange(df, x1, x2)df.sort_values(['x1', 'x2'])
      Rename Variablesrename(df, newvar = x1)df.rename(columns={'x1': 'newvar'})


      Data Manipulation with pandas - Examples

      1. Import Required Packages

      You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.
      import numpy as np
      import pandas as pd

      2. Build DataFrame

      We can build dataframe using DataFrame() function of pandas package.
      mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
              'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
              'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
      df = pd.DataFrame(mydata)
       In this dataframe, we have three variables - productcode, sales, cost.
      Sample DataFrame

      To import data from CSV file


      You can use read_csv() function from pandas package to get data into python from CSV file.
      mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
      Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

      Detailed Tutorial : Import Data in Python

      3. To see number of rows and columns

      You can run the command below to find out number of rows and columns.
      df.shape
       Result : (6, 3). It means 6 rows and 3 columns.

      4. To view first 3 rows

      The df.head(N) function can be used to check out first some N rows.
      df.head(3)
           cost productcode   sales
      0 1020.0 AA 1010.0
      1 1625.2 AA 1025.2
      2 1204.0 AA 1404.2

      5. Select or Drop Variables

      To keep a single variable, you can write in any of the following three methods -
      df.productcode
      df["productcode"]
      df.loc[: , "productcode"]
      To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
      df.iloc[: , 1]
      We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.
      df[["productcode", "cost"]]
      df.loc[ : , ["productcode", "cost"]]

      Drop Variable

      We can remove variables by using df.drop() function. See the example below -
      df2 = df.drop(['sales'], axis = 1)

      6. To summarize data frame

      To summarize or explore data, you can submit the command below.
      df.describe()
                    cost       sales
      count 6.000000 6.00000
      mean 1166.150000 1242.65000
      std 237.926793 230.46669
      min 1003.700000 1010.00000
      25% 1020.000000 1058.90000
      50% 1072.000000 1205.85000
      75% 1184.000000 1366.07500
      max 1625.200000 1604.80000

      To summarise all the character variables, you can use the following script.
      df.describe(include=['O'])
      Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

      To select only a particular variable, you can write the following code -
      df.productcode.describe()
      OR
      df["productcode"].describe()
      count      6
      unique 2
      top BB
      freq 3
      Name: productcode, dtype: object

      7. To calculate summary statistics

      We can manually find out summary statistics such as count, mean, median by using commands below
      df.sales.mean()
      df.sales.median()
      df.sales.count()
      df.sales.min()
      df.sales.max()

      8. Filter Data

      Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.
      df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]
      It can also be written like :
      df1 = df.query('(productcode == "AA") & (sales >= 1250)')
      In the second query, we do not need to specify DataFrame along with variable name.

      9. Sort Data

      In the code below, we are arrange data in ascending order by sales.
      df.sort_values(['sales'])

      10.  Group By : Summary by Grouping Variable

      Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
      df.groupby(df.productcode).mean()
                          cost        sales
      productcode
      AA 1283.066667 1146.466667
      BB 1049.233333 1338.833333
      Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.
      df["sales"].groupby(df.productcode).mean()

      11. Define Categorical Variable

      Let's create a classification variable - id which contains only 3 unique values - 1/2/3.
      df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})
      Let's define as a categorical variable.
      We can use astype() function to make id as a categorical variable.
      df0.id = df0["id"].astype('category')
      Summarize this classification variable to check descriptive statistics.
      df0.describe()
             id
      count 7
      unique 3
      top 2
      freq 3

      Frequency Distribution

      You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
      df['productcode'].value_counts()
      BB    3
      AA 3

      12. Generate Histogram

      Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.
      df['sales'].hist()
      Histogram

      13. BoxPlot

      Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.
      df.boxplot(column='sales')
      BoxPlot

      Data Science using Python - Examples

      In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

      1. Install the required libraries

      Import the following libraries before reading or exploring data
      #Import required libraries
      import pandas as pd
      import statsmodels.api as sm
      import numpy as np

      2. Download and import data into Python

      With the use of python library, we can easily get data from web into python.
      # Read data from web
      df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
      Variables Type Description
      gre Continuous Graduate Record Exam score
      gpa Continuous Grade Point Average
      rank Categorical Prestige of the undergraduate institution
      admit Binary Admission in graduate school

      The binary variable admit is a target variable.

      3. Explore Data

      Let's explore data. We'll answer the following questions -
      1. How many rows and columns in the data file?
      2. What are the distribution of variables?
      3. Check if any outlier(s)
      4. If outlier(s), treat them
      5. Check if any missing value(s)
      6. Impute Missing values (if any)
      # See no. of rows and columns
      df.shape
      Result : 400 rows and 4 columns

      In the code below, we rename the variable rank to 'position' as rank is already a function in python.
      # rename rank column
      df = df.rename(columns={'rank': 'position'}) 
      Summarize and plot all the columns.
      # Summarize
      df.describe()
      # plot all of the columns
      df.hist()
      Categorical variable Analysis

      It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
      # Summarize
      df.position.value_counts(ascending=True)
      1     61
      4 67
      3 121
      2 151

      Generating Crosstab 

      By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
      pd.crosstab(df['admit'], df['position'])
      position   1   2   3   4
      admit
      0 28 97 93 55
      1 33 54 28 12

      Number of Missing Values

      We can write a simple loop to figure out the number of blank values in all variables in a dataset.
      for i in list(df.columns) :
          k = sum(pd.isnull(df[i]))
          print(i, k)
      In this case, there are no missing values in the dataset.

      4. Logistic Regression Model

      Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

      In python, we can write R-style model formula y ~ x1 + x2 + x3 using  patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
      #Reference Category
      from patsy import dmatrices, Treatment
      y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')
      It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -
      P  P_1 P_2 P_3
      3 0 0 1
      3 0 0 1
      1 1 0 0
      4 0 0 0
      4 0 0 0
      2 0 1 0


      Split Data into two parts

      80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

      Build Logistic Regression Model

      By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
      #Fit Logit model
      logit = sm.Logit(y_train, X_train)
      result = logit.fit()

      #Summary of Logistic regression model
      result.summary()
      result.params
                                Logit Regression Results                           
      ==============================================================================
      Dep. Variable: admit No. Observations: 320
      Model: Logit Df Residuals: 315
      Method: MLE Df Model: 4
      Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399
      Time: 19:57:24 Log-Likelihood: -193.49
      converged: True LL-Null: -200.30
      LLR p-value: 0.008627
      =======================================================================================
      coef std err z P|z| [95.0% Conf. Int.]
      ---------------------------------------------------------------------------------------
      C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356
      C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409
      C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910
      gre 0.0005 0.001 0.442 0.659 -0.002 0.003
      gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041
      ======================================================================================

      Confusion Matrix and Odd Ratio

      Odd ratio is exponential value of parameter estimates.
      #Confusion Matrix
      result.pred_table()
      #Odd Ratio
      np.exp(result.params)

      Prediction on Test Data
      In this step, we take estimates of logit model which was built on training data and then later apply it into test data.
      #prediction on test data
      y_pred = result.predict(X_test)

      Calculate Area under Curve (ROC)
      # AUC on test data
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.6763

      Calculate Accuracy Score
      accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

      Decision Tree Model

      Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

      #Drop Intercept from predictors for tree algorithms
      X_train = X_train.drop(['Intercept'], axis = 1)
      X_test = X_test.drop(['Intercept'], axis = 1)

      #Decision Tree
      from sklearn.tree import DecisionTreeClassifier
      model_tree = DecisionTreeClassifier(max_depth=7)

      #Fit the model:
      model_tree.fit(X_train,y_train)

      #Make predictions on test set
      predictions_tree = model_tree.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.664

      Important Note
      Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

      Random Forest Model

      Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

      #Random Forest
      from sklearn.ensemble import RandomForestClassifier
      model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

      #Fit the model:
      target = y_train['admit']
      model_rf.fit(X_train,target)

      #Make predictions on test set
      predictions_rf = model_rf.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      #Variable Importance
      importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
      print(importances)
      importances.plot.bar()

      Result : AUC = 0.6974

      Grid Search - Hyper Parameters Tuning

      The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

      from sklearn.model_selection import GridSearchCV
      rf = RandomForestClassifier()
      target = y_train['admit']

      param_grid = {
      'n_estimators': [100, 200, 300],
      'max_features': ['sqrt', 3, 4]
      }

      CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
      CV_rfc.fit(X_train,target)

      #Parameters with Scores
      CV_rfc.grid_scores_

      #Best Parameters
      CV_rfc.best_params_
      CV_rfc.best_estimator_

      #Make predictions on test set
      predictions_rf = CV_rfc.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      Cross Validation
      # Cross Validation
      from sklearn.linear_model import LogisticRegression
      from sklearn.model_selection import cross_val_predict,cross_val_score
      target = y['admit']
      prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
      #AUC
      cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

      Data Mining : PreProcessing Steps

      1.  The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn,  there is already a function for this step.

      from sklearn.preprocessing import LabelEncoder
      def ConverttoNumeric(df):
      cols = list(df.select_dtypes(include=['category','object']))
      le = LabelEncoder()
      for i in cols:
      try:
      df[i] = le.fit_transform(df[i])
      except:
      print('Error in Variable :'+i)
      return df

      ConverttoNumeric(mydf)
      Encoding

      2. Impute Missing Values

      Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.
      # fill missing values with 0
      df['var1'] = df['var1'].fillna(0)
      # fill missing values with mean
      df['var1'] = df['var1'].fillna(df['var1'].mean())

      3. Outlier Treatment

      There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -

      1. Cap extreme values at 95th / 99th percentile depending on distribution
      2. Apply log transformation of variables. See below the implementation of log transformation in Python.

      import numpy as np
      df['var1'] = np.log(df['var1'])
      Next Steps

      Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

      Feature Selection : Select Important Variables with Boruta Package

      $
      0
      0
      This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.

      Why Variable Selection is important?
      1. Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
      2. Too many variables might result to overfitting which means model is not able to generalize pattern
      3. Too many variables leads to slow computation which in turns requires more memory and hardware.

      Why Boruta Package?

      There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special".  See the following reasons to use boruta package for feature selection.
      1. It works well for both classification and regression problem.
      2. It takes into account multi-variable relationships.
      3. It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
      4. It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
      5. It can handle interactions between variables
      6. It can deal with fluctuating nature of random a random forest importance measure
      Boruta Package

      Basic Idea of Boruta Algorithm
      Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

      How Boruta Algorithm Works

      Follow the steps below to understand the algorithm -
      1. Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
      2. Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
      3. Combine the original ones with shuffled copies
      4. Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
      5. Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
      6. Find the maximum Z score among shadow attributes (MZSA)
      7. Tag the variables as 'unimportant' when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
      8. Tag the variables as 'important' when they have importance significantly higher than MZSA.
      9. Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.


      Difference between Boruta and Random Forest Importance Measure

      When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

      In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.


      Is Boruta a solution for all?
      Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

      What is shuffled feature or permuted copies?

      It simply means changing order of values of a variable. See the practical example below -
      set.seed(123)
      mydata = data.frame(var1 = 1 : 6, var2=runif(6))
      shuffle = data.frame(apply(mydata,2,sample))
      head(cbind(mydata, shuffle))

      Original Shuffled
      var1 var2 var1 var2
      1 1 0.2875775 4 0.9404673
      2 2 0.7883051 5 0.4089769
      3 3 0.4089769 3 0.2875775
      4 4 0.8830174 2 0.0455565
      5 5 0.9404673 6 0.8830174
      6 6 0.0455565 1 0.7883051

      R : Feature Selection with Boruta Package

      1. Get Data into R

      The read.csv() function is used to read data from CSV and import it into R environment.
      #Read data
      df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
      2. List of variables
      #Column Names
      names(df)
      Result :"admit""gre"  "gpa"  "rank"

      3. Define categorical variables
      df$admit = as.factor(df$admit)
      df$rank = as.factor(df$rank)

      4. Explore Data
      #Summarize Data
      summary(df)

      #Check number of missing values
      sapply(df, function(y) sum(is.na(y)))
       admit        gre             gpa        rank   
      0:273 Min. :220.0 Min. :2.260 1: 61
      1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
      Median :580.0 Median :3.395 3:121
      Mean :587.7 Mean :3.390 4: 67
      3rd Qu.:660.0 3rd Qu.:3.670
      Max. :800.0 Max. :4.000

      No missing values in the dataframe df.

      Handle Missing Values
      In this dataset, we have no missing values. If it exists in your dataset, you need to impute them before implementing boruta package.
      5. Run Boruta Algorithm
      #Install and load Boruta package
      install.packages("Boruta")
      library(Boruta)

      # Run Boruta Algorithm
      set.seed(456)
      boruta <- Boruta(admit~., data = df, doTrace = 2)
      print(boruta)
      plot(boruta)

      Boruta performed 9 iterations in 4.870027 secs.
      3 attributes confirmed important: gpa, gre, rank;
      No attributes deemed unimportant.

      It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
      Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.
      Box Plot - Variable Selection
      As you can see above the label of shadowMean is not displayed as it got truncated due to insufficient space. To fix this problem, run the following program.
      plot(boruta, xlab = "", xaxt = "n")
      k <-lapply(1:ncol(boruta$ImpHistory),function(i)
        boruta$ImpHistory[is.finite(boruta$ImpHistory[,i]),i])
      names(k) <- colnames(boruta$ImpHistory)
      Labels <- sort(sapply(k,median))
      axis(side = 1,las=2,labels = names(Labels),
             at = 1:ncol(boruta$ImpHistory), cex.axis = 0.7)

      Let's add some irrelevant data to our original dataset

      It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.
      #Add some random permuted data
      set.seed(777)
      df.new<-data.frame(df,apply(df[,-1],2,sample))
      names(df.new)[5:7]<-paste("Random",1:3,sep="")
      df.new$Random1 = as.numeric(as.character(df.new$Random1))
      df.new$Random2 = as.numeric(as.character(df.new$Random2))
      > head(df.new)
      admit gre gpa rank Random1 Random2 Random3
      1 0 380 3.61 3 600 3.76 4
      2 1 660 3.67 3 660 3.30 4
      3 1 800 4.00 1 700 3.37 2
      4 1 640 3.19 4 620 3.33 3
      5 0 520 2.93 4 600 3.04 2
      6 1 760 3.00 2 520 3.64 4

      Run Boruta Algorithm
      set.seed(456)
      boruta2 <- Boruta(admit~., data = df.new, doTrace = 1)
      print(boruta2)
      plot(boruta2)
      Boruta performed 55 iterations in 21.79995 secs.
      3 attributes confirmed important: gpa, gre, rank;
      3 attributes confirmed unimportant: Random1, Random2, Random3;

      The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.
      > attStats(boruta2)
      meanImp medianImp minImp maxImp normHits decision
      gre 5.56458881 5.80124786 2.347609 8.410490 0.90909091 Confirmed
      gpa 9.66289180 9.37140347 6.818527 13.405592 1.00000000 Confirmed
      rank 10.16762154 10.22875211 6.173894 15.235444 1.00000000 Confirmed
      Random1 0.05986751 0.18360283 -1.281078 2.219137 0.00000000 Rejected
      Random2 1.15927054 1.35728128 -2.779228 3.816915 0.29090909 Rejected
      Random3 0.05281551 -0.02874847 -3.126645 3.219810 0.05454545 Rejected

      To save a final list of important variables in a vector, use getSelectedAttributes() function.
      #See list of finalvars
      finalvars = getSelectedAttributes(boruta2, withTentative = F)
      [1] "gre" "gpa" "rank"

      Incase you get tentative attributes in your dataset, you need to treat them. In this dataset, we did not get any one. When you run the following function, it will compare the median Z score of the variables with the median Z score of the best shadow attribute and then make a decision whether an attribute should be confirmed or rejected.
      Tentative.boruta <- TentativeRoughFix(boruta2)
      List of parameters used in Boruta
      1. maxRuns: maximal number of random forest runs. Default is 100.
      2. doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus reporting each iteration. Default is 0.
      3. getImp : function used to obtain attribute importance. The default is getImpRfZ, which runs random forest from the ranger package and gathers Z-scores of mean decrease accuracy measure.
      4. holdHistory: The full history of importance runs is stored if set to TRUE (Default).

        Compare Boruta with RFE Algorithm

        In caret, there is a variable selection algorithm called recursive feature elimination (RFE). It is also called backward selection. A brief explanation of the algorithm is given below -
        1. Fit the model using all independent variables.
        2. Calculate variable importance of all the variables.
        3. Each independent variable is ranked using its importance to the model.
        4. Drop the weakest variable (worst ranked) and builds a model using the remaining variables and calculate model accuracy.
        5. Repeat step 4 until all variables are used.
        6. Variables are then ranked according to when they were dropped.
        7. For regression, RMSE and R-Squared are used as a metrics. For classification, it is 'Accuracy' and 'Kappa'.
        In the code below, we are building a random forest model in RFE algorithm. The function 'rfFuncs' denotes for random forest.
        library(caret)
        library(randomForest)
        set.seed(456)
        control <- rfeControl(functions=rfFuncs, method="cv", number=10)
        rfe <- rfe(df.new[,2:7], df.new[,1], rfeControl=control)
        print(rfe, top=10)
        plot(rfe, type=c("g", "o"), cex = 1.0)
        predictors(rfe)
        head(rfe$resample, 10)
        Outer resampling method: Cross-Validated (10 fold) 

        Resampling performance over subset size:

        Variables Accuracy Kappa AccuracySD KappaSD Selected
        4 0.6477 0.1053 0.07009 0.1665
        6 0.7076 0.2301 0.06285 0.1580 *

        The top 6 variables (out of 6):
        gpa, rank, gre, Random2, Random3, Random1

        RFE - Variable Selection
        In this case, RFE algorithm returned all the variables based on model accuracy. As compared to RFE, boruta final variables make more sense in terms of interpretation. It all depends on data and its variables' distribution. As an analyst, we should explore both the techniques and see which one works better for the dataset. There are many packages in R for variable selection. Every technique has pros and cons.

        The following functions can be used for model fitting in RFE selections
        1. linear regression (lmFuncs)
        2. random forests (rfFuncs)
        3. naive Bayes (nbFuncs)
        4. bagged trees (treebagFuncs)

        Does Boruta handle multicollinearity?

        Multicollinearity means high correlation between independent variables. It is an important assumption in linear and logistic regression model. It makes coefficients (or estimates) more biased. Lets's check whether boruta algorithm takes care of it. Let's create some sample data. In this case, we are creating 3 predictors x1-x3 and target variable y.
        set.seed(123)
        x1 <- runif(500)
        x2 <- rnorm(500)
        x3 <- x2 + rnorm(n,sd=0.5)
        y <- x3 + runif(500) 
        cor(x2,x3)
        [1] 0.8981247

        The correlation of variables x2 and x3 is very high (close to 0.9). It means they are highly correlated. 
        mydata = data.frame(x1,x2,x3)
        Boruta(mydata, y)
        Boruta performed 9 iterations in 7.088029 secs.
         2 attributes confirmed important: x2, x3;
         1 attributes confirmed unimportant: x1;

        Boruta considered both highly correlated variables to be important. It implies it does not treat collinearity while selecting important variables. It is because of the way algorithm works.

        Important points related to Boruta
        1. Impute missing values - Make sure missing or blank values are filled up before running boruta algorithm.
        2. Collinearity - It is important to handle collinearity after getting important variables from boruta.
        3. Slow Speed - It is slow in terms of speed as compared to other traditional feature selection algorithms.

        SAS, R & Predictive Modeling Training in Delhi NCR

        $
        0
        0
        SAS Training in Delhi NCR
        We are starting our first classroom training batch from July 15, 2017 in Delhi NCR (Delhi / Gurgaon). We are offering courses on SAS , R and Predictive Modeling.
        1. Practical SAS Programming - Learning SAS by Case Studies
        2. Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
        3. Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge  

        Practical SAS Programming
        Rs 20,000
        • Base and Advanced SAS Programming
        • Classroom Training + Videos
        •          Live Projects                  
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Limited Seats Available


        Venue : To be decided
        Weekend Classes

        Predictive Modeling using SAS
        Rs 25,000
        • Predictive Modeling + Intro to SAS Programming
        • Classroom Training + Videos
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Limited Seats Available


        Venue : To be decided
        Weekend Classes

        R Programming + Data Science with R 
        Rs 30,000
        • R Programming + Predictive Modeling
        • Classroom Training + Videos
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Limited Seats Available


        Venue : To be decided
        Weekend Classes


        Combo Deals - Spend Less, Learn More
        Get Rs 10,000 off on registration for any of the two courses
        Get Rs 20,000 off on registration for all of the three courses
        **Offer ends 15th June, 2017

        How We are different from other institutes?
        Here are some of the features of ListenData that makes us better than other training institutes.
        1. Explain Advanced Analytics and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
        2. Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
        3. Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
        4. Hands-on Model Development and Validation Experience
        5. Strategies to implement predictive model
        6. New algorithms to solve problems efficiently
        7. Explain complex topics via visual lessons

        Who should do these courses?
        These courses are ideal for candidates who want to make a career in analytics.
        1. Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
        2. Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

        The decline of SAS Jobs and rise of R?
        I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career. List of Companies using SAS

        At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

        Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast.

        Is my registration fees refundable?
        It would be automatically adjusted on total fees. In other words, you pay Rs 1000 less of the amount of total fees. Incase you want to opt out of the course, you can ask for refund within 7 days of registration.

        Any Questions?
        Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

        SAS & R Training



        Curriculum - Practical SAS Programming
        1. Introduction to SAS
        2. How SAS works
        3. Import Raw Data Files - Basics
        4. Import Raw Data Files - Special Cases
        5. Importing / Exporting Data with Procedures
        6. Exploring Data - Various Methods
        7. Data Subsetting
        8. Data Manipulation - Basics
        9. Data Manipulation - Intermediate
        10. Data Manipulation - Advanced
        11. Do Loops and Arrays
        12. Merging Data
        13. Appending Data
        14. Character & Numeric Functions
        15. Date Functions
        16. Reporting - Creating tabular reports
        17. Proc SQL - Part I
        18. Proc SQL - Part II
        19. Proc SQL - Part III
        20. SAS Macros - Basics
        21. SAS Macros - Intermediate
        22. SAS Macros - Advanced
        23. SAS Macros - Debugging Tips
        24. Efficient SAS Programming Tips
        25. Connect to Databases using SAS
        26. Interview Tips - Scenario Based Questions
        27. Live Project




        Curriculum - Predictive Modeling using SAS
        1. Introduction to Statistics & Modeling
        2. Marketing Analytics : Applications
        3. Predictive Modeling in Financial Services Industry
        4. Predictive Modeling in HR
        5. SAS Programming - Basics
        6. SAS Programming - Intermediate
        7. Descriptive Statistics with SAS
        8. Hypothesis Testing with SAS
        9. Correlation Analysis with SAS
        10. Steps of Predictive Modeling
        11. Data Preparation in Predictive Modeling
        12. Variable Selection Methods in Predictive Modeling
        13. Segmentation - Introduction
        14. Segmentation - Cluster Analysis : Theory
        15. Segmentation - Cluster Analysis : Data Preparation
        16. Segmentation - Cluster Analysis : k-means and Hierarchical
        17. Segmentation - Cluster Analysis : Cluster Performance
        18. Principal Component Analysis (PCA) - Theory
        19. Running and Understanding PCA with SAS
        20. Linear Regression - Theory
        21. Linear Regression - Assumptions and Treatment
        22. Linear Regression - Important Metrics
        23. Linear Regression - Variable Selection Methods
        24. Linear Regression - Model Development
        25. Linear Regression - Model Validation
        26. Linear Regression - Model Performance
        27. Linear Regression - Model Scoring
        28. Linear Regression - Model Implementation
        29. Logistic Regression - Theory
        30. Logistic Regression - Assumptions and Treatment
        31. Logistic Regression - Important Metrics
        32. Logistic Regression - Variable Selection Methods
        33. Logistic Regression - Model Development
        34. Logistic Regression - Model Validation
        35. Logistic Regression - Model Performance
        36. Logistic Regression - Model Implementation
        37. Decision Tree - How it works
        38. Decision Tree - Model Development
        39. Decision Tree - Model Validation
        40. Decision Tree - Model Performance
        41. Decision Tree - Model Implementation
        42. Time Series Forecasting - Theory
        43. Time Series Analysis with SAS
        44. Special Cases - Handle rare event model
        45. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        46. Case Studies - Customer Segmentation
        47. Case Studies - Probability of Default
        48. Case Studies - HR Drivers Analysis
        49. Case Studies - Sales Forecasting
        50. Case Studies - Time Series Forecasting
        51. Interview Tips - Common Interview Questions



        Curriculum - R Programming + Data Science with R
        1. Introduction to R
        2. Introduction to RStudio
        3. Data Structures in R
        4. Importing / Exporting Data in R
        5. Data Exploration
        6. Data Manipulation with dplyr package - Basics
        7. Data Manipulation with dplyr package - Intermediate
        8. Data Manipulation with dplyr package - Advanced
        9. Character and Numeric Functions in R
        10. Data & Time Functions in R
        11. Data Visualization in R
        12. Loops in R (Apply Family of Functions & For Loop)
        13. R Functions - Part I
        14. R Functions - Part II
        15. Introduction to Data Science
        16. Marketing Analytics : Applications
        17. Predictive Modeling in Financial Services Industry
        18. Predictive Modeling in HR
        19. Hypothesis Testing with R
        20. Correlation Analysis with R
        21. Steps of Predictive Modeling
        22. Data Preparation in Predictive Modeling
        23. Variable Selection Methods in Predictive Modeling
        24. Segmentation - Introduction
        25. Segmentation - Cluster Analysis : Theory
        26. Segmentation - Cluster Analysis : Data Preparation
        27. Segmentation - Cluster Analysis : k-means and Hierarchical
        28. Segmentation - Cluster Analysis : Cluster Performance
        29. Principal Component Analysis (PCA) - Theory
        30. Running and Understanding PCA with R
        31. Linear Regression - Theory
        32. Linear Regression - Assumptions and Treatment
        33. Linear Regression - Important Metrics
        34. Linear Regression - Variable Selection Methods
        35. Linear Regression - Model Development
        36. Linear Regression - Model Validation
        37. Linear Regression - Model Performance
        38. Linear Regression - Model Scoring
        39. Linear Regression - Model Implementation
        40. Logistic Regression - Theory
        41. Logistic Regression - Assumptions and Treatment
        42. Logistic Regression - Important Metrics
        43. Logistic Regression - Variable Selection Methods
        44. Logistic Regression - Model Development
        45. Logistic Regression - Model Validation
        46. Logistic Regression - Model Performance
        47. Logistic Regression - Model Implementation
        48. Decision Tree - How it works
        49. Decision Tree - Model Development
        50. Decision Tree - Model Validation
        51. Decision Tree - Model Performance
        52. Decision Tree - Model Implementation
        53. Machine Learning - Basics
        54. Random Forest - How it works
        55. Random Forest vs. Decision Tree
        56. Random Forest - Model Development and Validation
        57. Time Series Forecasting - Theory
        58. Time Series Analysis with R
        59. Special Cases - Handle rare event model
        60. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        61. Case Studies - Customer Segmentation
        62. Case Studies - Probability of Default
        63. Case Studies - HR Drivers Analysis
        64. Case Studies - Sales Forecasting
        65. Case Studies - Time Series Forecasting
        66. Interview Tips - Common Interview Questions

        Online SAS, R & Predictive Modeling Training

        $
        0
        0
        First Online Training Batch
        We are starting our first Instructor-led online training batch from July 15, 2017. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus pre-recorded videos from any part of the world via web conference mode.
        1. Practical SAS Programming - Learning SAS by Case Studies
        2. Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
        3. Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge  

        4. Batch : 15th July, Saturday       Mode : Live Instructor-led


        Practical SAS Programming
        Rs 20,000
        (325 USD)
        Special Price : Rs 16,000 (260USD) till 15th June
        • Base and Advanced SAS Programming
        • Instructor-led live class + Pre-recorded videos
        •          Live Projects                  
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Duration : 10 Weeks
        • Limited Seats Available
        • Weekend Classes
        • Money Back Guarantee
        Indian Users

        All Users (Except India)


        Predictive Modeling using SAS
        Rs 25,000
        (400 USD)
        Special Price : Rs 20,000 (320USD) till 15th June
        • Predictive Modeling + Intro to SAS Programming
        • Instructor-led live class + Pre-recorded videos
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Duration : 12 Weeks
        • Limited Seats Available
        • Weekend Classes
        • Money Back Guarantee
        Indian Users

        All Users (Except India)


        R Programming + Data Science with R 
        Rs 30,000
        (480 USD)
        Special Price : Rs 24,000 (380USD) till 15th June
        • R Programming + Predictive Modeling
        • Instructor-led live class + Pre-recorded videos
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Duration : 13 Weeks
        • Limited Seats Available
        • Weekend Classes
        • Money Back Guarantee
        Indian Users

        All Users (Except India)



        Combo Deals - Spend Less, Learn More
        Pay only Rs 50,000 on purchase of all three courses
        Pay only Rs 40,000 on purchase of any of two courses
        **Offer applies on registration before 15th June, 2017

        When do I need to pay full course fee?
        Now you just need to pay Rs 1,000 (15USD) to reserve your seat. Hit 'Pay & Reserve your seat' button above in the table. You can pay the remaining balance until 15th July,2017 to access to live lectures and pre-recorded videos.
        Money Back Guarantee : If you do not like our training, you can ask for 100% course fees refund after your first live session.

        How We are different from other institutes?
        Here are some of the features of ListenData that makes us better than other training institutes.
        1. Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
        2. Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
        3. Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
        4. Hands-on Model Development and Validation
        5. Strategies to implement predictive model
        6. New algorithms to solve problems efficiently
        7. Explain complex topics via visual lessons

        Who should do these courses?
        These courses are ideal for candidates who want to make a career in analytics.
        1. Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
        2. Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

        Every training institute promises job. Why should i trust you?
        Let's be honest! It's a universal fact that no college or training institute can provide 100% job guarantee. If they are claiming 100% job guarantee, they are luring learners by false promises. Even IITs do not hit 100% score. Some Facts - Only 66% of IITians landed a job offer via campus recruitment in 2016-17, as against 79% in 2015-16 and 78% in 2014-15, according to HRD ministry.

        Let me list down the common reasons why people don't get jobs in analytics industry even after completing training from some colleges / institutes -
        1. No hands-on experience
        2. No domain knowledge
        3. No theoretical knowledge of statistical concepts
        4. Poor analytical skill
        The objective of this program is to cover the above first three points in detail. In addition we provide job placement assistance to all students.

        The decline of SAS Jobs and rise of R?
        I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

        At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

        Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

        In case if I miss any live session?
        Every class is recorded. We will provide you recording of every session.

        I never studied Programming or Statistics during graduation. Can I still apply for this course?
        Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

        Is my registration fees refundable?
        100% refundable. Incase you want to opt out of the course, you can ask for 100% refund within 7 days of registration. If you want to continue, it would be automatically adjusted on total fees. In other words, you pay Rs 1000 (15USD) less of the amount of total fees.

        Any Questions?
        Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

        Predictive Modeling using SAS & R




        Curriculum - Practical SAS Programming
        1. Introduction to SAS
        2. How SAS works
        3. Import Raw Data Files - Basics
        4. Import Raw Data Files - Special Cases
        5. Importing / Exporting Data with Procedures
        6. Exploring Data - Various Methods
        7. Data Subsetting
        8. Data Manipulation - Basics
        9. Data Manipulation - Intermediate
        10. Data Manipulation - Advanced
        11. Do Loops and Arrays
        12. Merging Data
        13. Appending Data
        14. Character & Numeric Functions
        15. Date Functions
        16. Reporting - Creating tabular reports
        17. Proc SQL - Part I
        18. Proc SQL - Part II
        19. Proc SQL - Part III
        20. SAS Macros - Basics
        21. SAS Macros - Intermediate
        22. SAS Macros - Advanced
        23. SAS Macros - Debugging Tips
        24. Efficient SAS Programming Tips
        25. Connect to Databases using SAS
        26. Interview Tips - Scenario Based Questions
        27. Live Project




        Curriculum - Predictive Modeling using SAS
        1. Introduction to Statistics & Modeling
        2. Marketing Analytics : Applications
        3. Predictive Modeling in Financial Services Industry
        4. Predictive Modeling in HR
        5. SAS Programming - Basics
        6. SAS Programming - Intermediate
        7. Descriptive Statistics with SAS
        8. Hypothesis Testing with SAS
        9. Correlation Analysis with SAS
        10. Steps of Predictive Modeling
        11. Data Preparation in Predictive Modeling
        12. Variable Selection Methods in Predictive Modeling
        13. Segmentation - Introduction
        14. Segmentation - Cluster Analysis : Theory
        15. Segmentation - Cluster Analysis : Data Preparation
        16. Segmentation - Cluster Analysis : k-means and Hierarchical
        17. Segmentation - Cluster Analysis : Cluster Performance
        18. Principal Component Analysis (PCA) - Theory
        19. Running and Understanding PCA with SAS
        20. Linear Regression - Theory
        21. Linear Regression - Assumptions and Treatment
        22. Linear Regression - Important Metrics
        23. Linear Regression - Variable Selection Methods
        24. Linear Regression - Model Development
        25. Linear Regression - Model Validation
        26. Linear Regression - Model Performance
        27. Linear Regression - Model Scoring
        28. Linear Regression - Model Implementation
        29. Logistic Regression - Theory
        30. Logistic Regression - Assumptions and Treatment
        31. Logistic Regression - Important Metrics
        32. Logistic Regression - Variable Selection Methods
        33. Logistic Regression - Model Development
        34. Logistic Regression - Model Validation
        35. Logistic Regression - Model Performance
        36. Logistic Regression - Model Implementation
        37. Decision Tree - How it works
        38. Decision Tree - Model Development
        39. Decision Tree - Model Validation
        40. Decision Tree - Model Performance
        41. Decision Tree - Model Implementation
        42. Time Series Forecasting - Theory
        43. Time Series Analysis with SAS
        44. Special Cases - Handle rare event model
        45. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        46. Case Studies - Customer Segmentation
        47. Case Studies - Probability of Default
        48. Case Studies - HR Drivers Analysis
        49. Case Studies - Sales Forecasting
        50. Case Studies - Time Series Forecasting
        51. Interview Tips - Common Interview Questions



        Curriculum - R Programming + Data Science with R
        1. Introduction to R
        2. Introduction to RStudio
        3. Data Structures in R
        4. Importing / Exporting Data in R
        5. Data Exploration
        6. Data Manipulation with dplyr package - Basics
        7. Data Manipulation with dplyr package - Intermediate
        8. Data Manipulation with dplyr package - Advanced
        9. Character and Numeric Functions in R
        10. Data & Time Functions in R
        11. Data Visualization in R
        12. Loops in R (Apply Family of Functions & For Loop)
        13. R Functions - Part I
        14. R Functions - Part II
        15. Introduction to Data Science
        16. Marketing Analytics : Applications
        17. Predictive Modeling in Financial Services Industry
        18. Predictive Modeling in HR
        19. Hypothesis Testing with R
        20. Correlation Analysis with R
        21. Steps of Predictive Modeling
        22. Data Preparation in Predictive Modeling
        23. Variable Selection Methods in Predictive Modeling
        24. Segmentation - Introduction
        25. Segmentation - Cluster Analysis : Theory
        26. Segmentation - Cluster Analysis : Data Preparation
        27. Segmentation - Cluster Analysis : k-means and Hierarchical
        28. Segmentation - Cluster Analysis : Cluster Performance
        29. Principal Component Analysis (PCA) - Theory
        30. Running and Understanding PCA with R
        31. Linear Regression - Theory
        32. Linear Regression - Assumptions and Treatment
        33. Linear Regression - Important Metrics
        34. Linear Regression - Variable Selection Methods
        35. Linear Regression - Model Development
        36. Linear Regression - Model Validation
        37. Linear Regression - Model Performance
        38. Linear Regression - Model Scoring
        39. Linear Regression - Model Implementation
        40. Logistic Regression - Theory
        41. Logistic Regression - Assumptions and Treatment
        42. Logistic Regression - Important Metrics
        43. Logistic Regression - Variable Selection Methods
        44. Logistic Regression - Model Development
        45. Logistic Regression - Model Validation
        46. Logistic Regression - Model Performance
        47. Logistic Regression - Model Implementation
        48. Decision Tree - How it works
        49. Decision Tree - Model Development
        50. Decision Tree - Model Validation
        51. Decision Tree - Model Performance
        52. Decision Tree - Model Implementation
        53. Machine Learning - Basics
        54. Random Forest - How it works
        55. Random Forest vs. Decision Tree
        56. Random Forest - Model Development
        57. Random Forest - Model Validation
        58. Random Forest - How it works
        59. Gradient Boosting - How it works
        60. Gradient Boosting - Model Development
        61. Gradient Boosting - Model Validation
        62. Support Vector Machine - How it works
        63. Support Vector Machine - Model Development
        64. Support Vector Machine - Model Validation
        65. Ensemble Stacking / Blending
        66. Time Series Forecasting - Theory
        67. Time Series Analysis with R
        68. Special Cases - Handle rare event model
        69. Text Mining Basics & Applications
        70. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        71. Case Studies - Customer Segmentation
        72. Case Studies - Probability of Default
        73. Case Studies - HR Drivers Analysis
        74. Case Studies - Sales Forecasting
        75. Case Studies - Time Series Forecasting
        76. Interview Tips - Common Interview Questions

        Python Data Structures

        $
        0
        0
        This post explains the data structures used in Python. It is essential to understand the data structures in a programming language. In python, there are many data structures available. They are as follows :
        1. strings
        2. lists
        3. tuples
        4. dictionaries
        5. sets

        Python Data Structures

        1. Strings

        Python String is a sequence of characters.

        How to create a string in Python

        You can create Python string using a single or double quote.
        mystring = "Hello Python3.6"
        print(mystring)
        Output:
        Hello Python3.6

        Can I use multiple single or double quotes to define string?

        Answer is Yes. See examples below -

        Multiple Single Quotes
        mystring = '''Hello Python3.6'''
        print(mystring)
        Output:
        Hello Python3.6
        Multiple Double Quotes
        mystring = """Hello Python3.6"""
        print(mystring)
        Output:
        Hello Python3.6

        How to include quotes within a string?
        mystring = r'Hello"Python"'
        print(mystring)
        Output:
        Hello"Python"

        How to extract Nth letter or word?

        You can use the syntax below to get first letter.
        mystring = 'Hi How are you?'
        mystring[0]
        Output
        'H'
        mystring[0] refers to first letter as indexing in python starts from 0. Similarly, mystring[1] refers to second letter.

        To pull last letter, you can use -1 as index.
        mystring[-1]

        To get first word
        mystring.split('')[0]
        Output : Hi

        How it works -

        1. mystring.split('') tells Python to use space as a delimiter.

        Output : ['Hi', 'How', 'are', 'you?']

        2. mystring.split('')[0] tells Python to pick first word of a string.


        2. List

        Unlike String, List can contain different types of objects such as integer, float, string etc.
        1. x = [142, 124, 234, 345, 465]
        2. y = [‘A’, ‘C’, ‘E’, ‘M’]
        3. z = [‘AA’, 44, 5.1, ‘KK’]

        Get List Item

        We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
        k = [124, 225, 305, 246, 259]
        k[0]
        k[1]
        k[-1]
        k[0]
        124

        k[1]
        225

        k[-1]
        259

        Explanation :
        k[0] picks first element from list. Negative sign tells Python to search list item from right to left. k[-1] selects the last element from list.

        To select multiple elements from a list, you can use the following method :
        k[:3] returns [124, 225, 305]

        Add 5 to each element of a list

        In the program below, len() function is used to count the number of elements in a list. In this case, it returns 5. With the help of range() function, range(5) returns 0,1,2,3,4.
        x = [1, 2, 3, 4, 5]
        for i in range(len(x)):
            x[i] = x[i] + 5
        print(x)
        [6, 7, 8, 9, 10]

        It can also be written like this -
        for i in range(len(x)):
           x[i] += 5
        print(x)

        Combine / Join two lists

        The '+' operator is concatenating two lists.
        X = [1, 2, 3]
        Y = [4, 5, 6]
        Z = X + Y
        print(Z)
        [1, 2, 3, 4, 5, 6]

        Sum of values of two list
        X = [1, 2, 3]
        Y = [4, 5, 6]
        import numpy as np
        Z = np.add(X, Y)
        print(Z)
        print(Z)
        [5 7 9]
        Similarly, you can use np.multiply(X, Y) to multiply values of two list.

        Repeat List N times

        The '*' operator is repeating list N times.
        X = [1, 2, 3]
        Z = X * 3
        print(Z)
        [1, 2, 3, 1, 2, 3, 1, 2, 3]

        Note : The above two methods also work for string list.

        Modify / Replace a list item

        Suppose you need to replace third value to a different value.
        X = [1, 2, 3]
        X[2]=5
        print(X)
        print(X)
        [1, 2, 5]

        Add / Remove a list item

        We can add a list item by using append method.
        X = ['AA', 'BB', 'CC']
        X.append('DD')
        print(X)
        Result : ['AA', 'BB', 'CC', 'DD']

        Similarly, we can remove a list item by using remove method.
        X = ['AA', 'BB', 'CC']
        X.remove('BB')
        print(X)
        Result : ['AA', 'CC']

        Sort list
        k = [124, 225, 305, 246, 259]
        k.sort()
        print(k)
        Output : [124, 225, 246, 259, 305]


        3. Tuple

        Like list, tuple can also contain mixed data. But tuple cannot be changed or altered once created whereas list can be modified. Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

        Examples
        mytuple = (123,223,323)
        City = ('Delhi','Mumbai','Bangalore')
        Perform for loop on Tuple
        for i in City:
            print(i)
        Delhi
        Mumbai
        Bangalore

        Tuple cannot be altered

        Run the following command and check error
        X = (1, 2, 3)
        X[2]=5
        TypeError:'tuple' object does not support item assignment

        4. Dictionary

        It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings, numbers, or tuples.

        Create a dictionary

        It is defined in curly braces {}. Each key is followed by a colon (:) and then values.
        teams = {'Dave' : 'team A',
                 'Tim' : 'team B',
                 'Babita' : 'team C',
                 'Sam' : 'team B',
                 'Ravi' : 'team C'
                }

        Find Values
        teams['Sam']
        Output :'team B'

        Delete an item
        del teams['Ravi']

        Add an item
        teams['Deep'] = 'team B'
        Output :
        {'Babita': 'team C',
         'Dave': 'team A',
         'Deep': 'team B',
         'Sam': 'team B',
         'Tim': 'team B'}

        5. Sets

        Sets are unordered collections of simple objects.
        X = set(['A', 'B', 'C'])

        Q. Does 'A' exist in set X?
        'A' in X
        Result : True

        Q. Does 'D' exist in set X?
        'D' in X
        Result : False

        Q. How to add 'D' in set X?
        X.add('D')
        Q. How to remove 'C' from set X?
        X.remove('C')
        Q. How to create a copy of set X?
        Y = X.copy()
        Q. Which items are common in both sets X and Y?
        Y & X

        Predictive Modeling using SAS & R Online Training

        $
        0
        0
        First Online Training Batch
        The next Instructor-led online training batch will commence on October 8, 2017. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus recorded videos from any part of the world via web conference mode. Also you can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared.
        1. Practical SAS Programming - Learning SAS by Case Studies
        2. Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
        3. Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge)

        4. Batch : 8th October, Sunday     Mode : Live Instructor-led


        Practical SAS Programming
        Rs 20,000
        ($325)
        • Special Price : Get 10% off till 25th Sept, 2017
        • Base and Advanced SAS Programming
        • Instructor-led live class + Recorded videos
        • Duration : 8 Weeks (100 hours)
        • Live Projects + Scenario-Based Questions
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        Indian Users
        All Users (Except India)


        Predictive Modeling using SAS
        Rs 25,000
        ($400)
        • Special Price : Get 10% off till 25th Sept, 2017
        • Predictive Modeling with SAS                    
        • Instructor-led live class + Recorded videos
        • Duration : 8 - 10 Weeks (100 hours)
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        Indian Users
        All Users (Except India)


        R Programming + Data Science with R 
        Rs 30,000
        ($480)
        • Special Price : Get 10% off till 25th Sept, 2017
        • R Programming + Predictive Modeling with R
        • Instructor-led live class + Recorded videos
        • Duration : 10-12 Weeks (120 hours)
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        Indian Users
        All Users (Except India)



        Combo Deals - Spend Less, Learn More
        Pay only Rs 35,000 ($600) on purchase of 'Practical SAS Programming' and 'Predictive Modeling with SAS' courses
        Offer expires on 25th September,2017
        Enroll Now - Indian Users 
        Enroll Now - All Users (Except India)

        What is Instructor-led live program?
        It is an interactive training program. Learners will get an access to live lectures via live webinar mode and can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared. Also you can go through video recording if you miss a class.

        Money Back Guarantee?
        If you do not like our training, you can ask for 100% course fees refund after your first live session. No question asked refund policy!

        What is the duration of these programs?
        These are weekend programmes comprising 100-130 hours. Classes will be held on every Saturday and Sunday The course duration is as follows -
        1. Practical SAS Programming - 100 hours (At least 50 hours live training + 5 hours video based training + ~60 hours of Practice and Self Study)
        2. Predictive Modeling with SAS - 100 hours (Includes hours of Video based training and Practice and Self Study)
        3. Data Science with R - 120 hours (At least 60 hours live training + 7 hours video based training + ~80 hours of Practice and Self Study)

        If I opt for all the 3 courses, will classes be scheduled at the same time?
        All classes will be scheduled on weekend but not at the same time. It'll be one by one. For example if class A gets over at 5. Next class will start at 6.

        How We are different from other institutes?
        Here are some of the features of ListenData that makes us better than other training institutes.
        1. Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
        2. Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
        3. Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
        4. Hands-on Model Development and Validation
        5. Strategies to implement predictive model
        6. New algorithms to solve problems efficiently
        7. Explain complex topics via visual lessons

        Who should do these courses?
        These courses are ideal for candidates who want to make a career in analytics.
        1. Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
        2. Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

        Every training institute promises job. Why should i trust you?
        Let's be honest! It's a universal fact that no college or training institute can provide 100% job guarantee. If they are claiming 100% job guarantee, they are luring learners by false promises. Even IITs do not hit 100% score. Some Facts - Only 66% of IITians landed a job offer via campus recruitment in 2016-17, as against 79% in 2015-16 and 78% in 2014-15, according to HRD ministry.

        Let me list down the common reasons why people don't get jobs in analytics industry even after completing training from some colleges / institutes -
        1. No hands-on experience
        2. No domain knowledge
        3. No theoretical knowledge of statistical concepts
        4. Poor analytical skill
        The objective of this program is to cover the above first three points in detail. In addition we provide job placement assistance to all students. We will keep you informed about current openings in analytics industry. We are in constant contact with job consultancy firms and a solid network of analytics professionals.

        The decline of SAS Jobs and rise of R?
        I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

        At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

        Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

        In case if I miss any live session?
        Every class is recorded. We will provide you recording of every session.

        I never studied Programming or Statistics during graduation. Can I still apply for this course?
        Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

        Is my registration fees refundable?
        100% refundable. Incase you want to opt out of the course for any reason, you can ask for 100% refund within 7 days of registration. If you want to continue, it would be automatically adjusted on total fees. In other words, you pay $15 (Rs 1000) less of the amount of total fees.

        About Instructor
        Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. He has worked with companies like Aon, Cognizant, Genpact, RBS. He has handled global clients in various domains like retail and commercial banking, Telecom, HR and Automotive. He has worked extensively in various data science projects such as Customer Attrition, Customer Lifetime Value Model, Propensity Model, Opinion / Sentiment Mining, Geo Analytics, Credit risk scorecard, Portfolio Optimization, Pricing Analytics, Cross sell/Up sell campaign models, Survey Analytics, Customer Segmentation, Market Benchmarking, Employee Attrition, Employee Engagement etc.

        Any Questions?
        Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

        Predictive Modeling using SAS & R Training




        Curriculum - Practical SAS Programming
        1. Introduction to SAS
        2. How SAS works
        3. Import Raw Data Files - Basics
        4. Import Raw Data Files - Special Cases
        5. Importing / Exporting Data with Procedures
        6. Exploring Data - Various Methods
        7. Data Subsetting
        8. Data Manipulation - Basics
        9. Data Manipulation - Intermediate
        10. Data Manipulation - Advanced
        11. Do Loops and Arrays
        12. Merging Data
        13. Appending Data
        14. Character & Numeric Functions
        15. Date Functions
        16. Reporting - Creating tabular reports
        17. Proc SQL - Part I
        18. Proc SQL - Part II
        19. Proc SQL - Part III
        20. SAS Macros - Basics
        21. SAS Macros - Intermediate
        22. SAS Macros - Advanced
        23. SAS Macros - Debugging Tips
        24. Efficient SAS Programming Tips
        25. Connect to Databases using SAS
        26. Interview Preparation - Scenario Based Questions
        27. Live Project




        Curriculum - Predictive Modeling using SAS
        1. Introduction to Statistics & Modeling
        2. Marketing Analytics : Applications
        3. Predictive Modeling in Financial Services Industry
        4. Predictive Modeling in HR
        5. SAS Programming - Basics
        6. SAS Programming - Intermediate
        7. Descriptive Statistics with SAS
        8. Hypothesis Testing with SAS
        9. Correlation Analysis with SAS
        10. Steps of Predictive Modeling
        11. Data Preparation in Predictive Modeling
        12. Variable Selection Methods in Predictive Modeling
        13. Segmentation - Introduction
        14. Segmentation - Cluster Analysis : Theory
        15. Segmentation - Cluster Analysis : Data Preparation
        16. Segmentation - Cluster Analysis : k-means and Hierarchical
        17. Segmentation - Cluster Analysis : Cluster Performance
        18. Principal Component Analysis (PCA) - Theory
        19. Running and Understanding PCA with SAS
        20. Linear Regression - Theory
        21. Linear Regression - Assumptions and Treatment
        22. Linear Regression - Important Metrics
        23. Linear Regression - Variable Selection Methods
        24. Linear Regression - Model Development
        25. Linear Regression - Model Validation
        26. Linear Regression - Model Performance
        27. Linear Regression - Model Scoring
        28. Linear Regression - Model Implementation
        29. Logistic Regression - Theory
        30. Logistic Regression - Assumptions and Treatment
        31. Logistic Regression - Important Metrics
        32. Logistic Regression - Variable Selection Methods
        33. Logistic Regression - Model Development
        34. Logistic Regression - Model Validation
        35. Logistic Regression - Model Performance
        36. Logistic Regression - Model Implementation
        37. Decision Tree - How it works
        38. Decision Tree - Model Development
        39. Decision Tree - Model Validation
        40. Decision Tree - Model Performance
        41. Decision Tree - Model Implementation
        42. Time Series Forecasting - Theory
        43. Time Series Analysis with SAS
        44. Special Cases - Handle rare event model
        45. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        46. Case Studies - Customer Segmentation
        47. Case Studies - Probability of Default
        48. Case Studies - Employee Attrition
        49. Case Studies - Time Series Forecasting
        50. Interview Tips - Common Interview Questions



        Curriculum - R Programming + Data Science with R
        1. Introduction to R
        2. Introduction to RStudio
        3. Data Structures in R
        4. Importing / Exporting Data in R
        5. Data Exploration
        6. Data Manipulation with dplyr package - Basics
        7. Data Manipulation with dplyr package - Intermediate
        8. Data Manipulation with dplyr package - Advanced
        9. Character and Numeric Functions in R
        10. Data & Time Functions in R
        11. Data Visualization in R
        12. Loops in R (Apply Family of Functions & For Loop)
        13. R Functions - Part I
        14. R Functions - Part II
        15. Introduction to Data Science
        16. Marketing Analytics : Applications
        17. Predictive Modeling in Financial Services Industry
        18. Predictive Modeling in HR
        19. Hypothesis Testing with R
        20. Correlation Analysis with R
        21. Steps of Predictive Modeling
        22. Data Preparation in Predictive Modeling
        23. Variable Selection Methods in Predictive Modeling
        24. Segmentation - Introduction
        25. Segmentation - Cluster Analysis : Theory
        26. Segmentation - Cluster Analysis : Data Preparation
        27. Segmentation - Cluster Analysis : k-means and Hierarchical
        28. Segmentation - Cluster Analysis : Cluster Performance
        29. Principal Component Analysis (PCA) - Theory
        30. Running and Understanding PCA with R
        31. Linear Regression - Theory
        32. Linear Regression - Assumptions and Treatment
        33. Linear Regression - Important Metrics
        34. Linear Regression - Variable Selection Methods
        35. Linear Regression - Model Development
        36. Linear Regression - Model Validation
        37. Linear Regression - Model Performance
        38. Linear Regression - Model Scoring
        39. Linear Regression - Model Implementation
        40. Logistic Regression - Theory
        41. Logistic Regression - Assumptions and Treatment
        42. Logistic Regression - Important Metrics
        43. Logistic Regression - Variable Selection Methods
        44. Logistic Regression - Model Development
        45. Logistic Regression - Model Validation
        46. Logistic Regression - Model Performance
        47. Logistic Regression - Model Implementation
        48. Decision Tree - How it works
        49. Decision Tree - Model Development
        50. Decision Tree - Model Validation
        51. Decision Tree - Model Performance
        52. Decision Tree - Model Implementation
        53. Machine Learning - Basics
        54. Random Forest - How it works
        55. Random Forest vs. Decision Tree
        56. Random Forest - Model Development
        57. Random Forest - Model Validation
        58. Random Forest - How it works
        59. Gradient Boosting - How it works
        60. Gradient Boosting - Model Development
        61. Gradient Boosting - Model Validation
        62. Support Vector Machine - How it works
        63. Support Vector Machine - Model Development
        64. Support Vector Machine - Model Validation
        65. Ensemble Stacking / Blending
        66. Time Series Forecasting - Theory
        67. Time Series Analysis with R
        68. Special Cases - Handle rare event model
        69. Text Mining Basics & Applications
        70. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        71. Case Studies - Customer Segmentation
        72. Case Studies - Probability of Default
        73. Case Studies - HR Drivers Analysis
        74. Case Studies - Sales Forecasting
        75. Case Studies - Time Series Forecasting
        76. Interview Tips - Common Interview Questions

        K Nearest Neighbor : Step by Step Tutorial

        $
        0
        0
        In this article, we will cover how K-nearest neighbor (KNN) algorithm works and how to run k-nearest neighbor in R.  It is one of the most widely used algorithm for classification problems.

        K-Nearest Neighbor Simplified

        Introduction to K-Nearest Neighbor (KNN)

        Knn is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.
        Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.

        How KNN algorithm works

        Suppose we have height, weight and T-shirt size of some customers and we need to predict the T-shirt size of a new customer given only height and weight information we have. Data including height, weight and T-shirt size information is shown below -

        Height (in cms)Weight (in kgs)T Shirt Size
        15858M
        15859M
        15863M
        16059M
        16060M
        16360M
        16361M
        16064L
        16364L
        16561L
        16562L
        16565L
        16862L
        16863L
        16866L
        17063L
        17064L
        17068L


        Step 1 : Calculate Similarity based on distance function

        There are many distance functions but Euclidean is the most commonly used measure. It is mainly used when data is continuous. Manhattan distance is also very common for continuous variables.

        Distance Functions

        The idea to use distance measure is to find the distance (similarity) between new sample and training cases and then finds the k-closest customers to new customer in terms of height and weight.

        New customer named 'Monica' has height 161cm and weight 61kg.

        Euclidean distance between first observation and new observation (monica) is as follows -
        =SQRT((161-158)^2+(61-58)^2)
        Similarly, we will calculate distance of all the training cases with new case and calculates the rank in terms of distance. The smallest distance value will be ranked 1 and considered as nearest neighbor.

        Step 2 : Find K-Nearest Neighbors

        Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most similar to Monica in terms of attributes, and see what categories those 5 customers were in. If 4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for Monica is ‘Medium T shirt. See the calculation shown in the snapshot below -

        Calculate KNN manually

        In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New customer information is exhibited in yellow circle. Four blue highlighted data points and one orange highlighted data point are close to yellow circle. so the prediction for the new case is blue highlighted data point which is Medium T-shirt size.
        KNN: Visual Representation


        Assumptions of KNN

        1. Standardization

        When independent variables in training data are measured in different units, it is important to standardize variables before calculating distance. For example, if one variable is based on height in cms, and the other is based on weight in kgs then height will influence more on the distance calculation. In order to make them comparable we need to standardize them which can be done by any of the following methods :

        Standardization

        2. Outlier

        Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it considers more voters to decide prediction.

        Why KNN is non-parametric?

        Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you can imagine each training case as a "parameter" in the model.

        KNN vs. K-mean

        Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor. See some of the difference below -
        1. K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning algorithm (dependent variable exists)
        2. K-mean is a clustering technique which tries to split data points into K-clusters such that the points in each cluster tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the classification of the K nearest points

        Can KNN be used for regression?
        Yes, K-nearest neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied  when dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.

        Pros and Cons of KNN

        Pros

        1. Easy to understand
        2. No assumptions about data
        3. Can be applied to both classification and regression
        4. Works easily on multi-class problems

        Cons

        1. Memory Intensive / Computationally expensive
        2. Sensitive to scale of data
        3. Not work well on rare event (skewed) target variable
        4. Struggle when high number of independent variables
        For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting  k to a large value may lead to a large model bias.
        How to handle categorical variables in KNN?

        Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike regression, create k dummies instead of (k-1). For example, a categorical variable named "Department" has 5 unique levels / categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

        How to find best K value?

        Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by holding out a subset of the training set from the model building process. 

        Cross-validation (let's say 10 fold validation) involves randomly dividing the training set into 10 groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10% to validate it. The misclassification rate is then computed on the 10% validation data. This procedure repeats 10 times. Different group of observations are treated as a validation set each of the 10 times. It results to 10 estimates of the validation error which are then averaged out.

        K Nearest Neighbor in R

        We are going to use historical data of past win/loss statistics and the corresponding speeches. This dataset comprises of 1524 observations on 14 variables. Dependent variable is win/loss where 1 indicates win and 0 indicates loss. The independent variables are: 

        1.Proportion of words in the speech showing 
        a.Optimism
        b.Pessimism
        c.the use of Past
        d.the use of Present
        e.the use of Future

        2.Number of time he/she mentions his/her own party

        3.Number of time he/she mentions his/her opposite parties.

        4.Some measure indicating the content of speech showing
        a.Openness
        b.Conscientiousness
        c.Extraversion
        d.Agreeableness
        e.Neuroticism
        f.emotionality

        Download Link :Data File

        Read Data
        # Read data
        data1 = read.csv("US Presidential Data.csv")
        View(data1)

        We read the CSV file with the help of read.csv command. Here the first argument is the name of the dataset.  The second argument - Header = TRUE or T implies that the first row in our csv file denotes the headings while header = FALSE or F indicates that the data should be read from the first line and does not involves any headings.
        # load library
        library(caret)
        library(e1071)

        # Transforming the dependent variable to a factor
        data1$Win.Loss = as.factor(data1$Win.Loss)
        Here we will use caret package in order to run knn. Since my dependent variable is numeric here thus we need to transform it to factor using as.factor().
        #Partitioning the data into training and validation data
        set.seed(101)
        index = createDataPartition(data1$Win.Loss, p = 0.7, list = F )
        train = data1[index,]
        validation = data1[-index,]
        In order to partition the data into training and validation sets we use createDataPartition() function in caret.

        Firstly we set the seed to be 101 so that the same results can be obtained. In the createDataPartition() the first argument is the dependent variable , p denotes how much data we want in the training set; here we take 70% of the data in training set and rest in cross validation set, list = F denotes that the indices we obtain should be in form of a vector.
        # Explore data
        dim(train)
        dim(validation)
        names(train)
        head(train)
        head(validation)
        The dimensions of training and validation sets are checked via dim(). See first 6 rows of training dataset -

           Win.Loss   Optimism  Pessimism  PastUsed FutureUsed PresentUsed OwnPartyCount
        1 X1 0.10450450 0.05045045 0.4381443 0.4948454 0.06701031 2
        3 X1 0.11257190 0.04930156 0.4159664 0.5168067 0.06722689 1
        5 X1 0.10582640 0.05172414 0.3342618 0.5821727 0.08356546 3
        7 X1 0.09838275 0.06401617 0.3240741 0.6018519 0.07407407 6
        9 X1 0.10610734 0.04688464 0.3633540 0.5372671 0.09937888 2
        10 X1 0.10066128 0.05951506 0.3554817 0.5382060 0.10631229 1
        OppPartyCount NumericContent Extra Emoti Agree Consc Openn
        1 2 0.001877543 4.041 4.049 3.469 2.450 2.548
        3 1 0.002131163 3.463 4.039 3.284 2.159 2.465
        5 4 0.002229220 4.658 4.023 3.283 2.415 2.836
        7 4 0.002251985 3.727 4.108 3.357 2.128 2.231
        9 5 0.002446440 4.119 4.396 3.661 2.572 2.599
        10 2 0.002107436 3.800 4.501 3.624 2.117 2.154

        By default, levels of dependent variable in this dataset is "0""1". Later when we will do prediction, these levels will be used as variable names for prediction so we need to make it valid variable names.
        # Setting levels for both training and validation data
        levels(train$Win.Loss) <- make.names(levels(factor(train$Win.Loss)))
        levels(validation$Win.Loss) <- make.names(levels(factor(validation$Win.Loss)))
        Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.
        # Setting up train controls
        repeats = 3
        numbers = 10
        tunel = 10

        set.seed(1234)
        x = trainControl(method = "repeatedcv",
                         number = numbers,
                         repeats = repeats,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)

        Using train() function we run our knn; Win.Loss is dependent variable, the full stop after tilde  denotes all the independent variables are there. In ‘data=’ we pass our training set, ‘method=’ denotes which technique we want to deploy, setting preProcess to center and scale tells us that we are standardizing our independent variables

        center : subtract mean from values.
        scale : divide values by standard deviation.

        trControl demands our ‘x’ which was obtained via train( ) and tunelength is always an integer which is used to tune our algorithm.
        model1 <- train(Win.Loss~. , data = train, method = "knn",
                       preProcess = c("center","scale"),
                       trControl = x,
                       metric = "ROC",
                       tuneLength = tunel)

        # Summary of model
        model1
        plot(model1)
        k-Nearest Neighbors 

        1068 samples
        13 predictor
        2 classes: 'X0', 'X1'

        Pre-processing: centered (13), scaled (13)
        Resampling: Cross-Validated (10 fold, repeated 3 times)
        Summary of sample sizes: 961, 962, 961, 962, 961, 962, ...
        Resampling results across tuning parameters:

        k ROC Sens Spec
        5 0.8440407 0.6910182 0.8382051
        7 0.8537506 0.6847658 0.8520513
        9 0.8575183 0.6712350 0.8525796
        11 0.8588422 0.6545296 0.8592152
        13 0.8585478 0.6560976 0.8556333
        15 0.8570397 0.6432249 0.8648329
        17 0.8547545 0.6448509 0.8627894
        19 0.8520574 0.6336043 0.8632867
        21 0.8484632 0.6215447 0.8627894
        23 0.8453320 0.6071622 0.8658664

        ROC was used to select the optimal model using the largest value.
        The final value used for the model was k = 11.

        Cross Validation : Fine Tuning


        Finally to make predictions on our validation set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.

        # Validation
        valid_pred <- predict(model1,validation, type = "prob")

        #Storing Model Performance Scores
        library(ROCR)
        pred_val <-prediction(valid_pred[,2],validation$Win.Loss)

        # Calculating Area under Curve (AUC)
        perf_val <- performance(pred_val,"auc")
        perf_val

        # Plot AUC
        perf_val <- performance(pred_val, "tpr", "fpr")
        plot(perf_val, col = "green", lwd = 1.5)

        #Calculating KS statistics
        ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))
        ks

        The Area under curve (AUC) on validation dataset is 0.8642.

        Pandas Python Tutorial - Learn by Example

        $
        0
        0
        Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

        The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

        Data Analysis with Python : Pandas Step by Step Guide

        Why pandas?
        It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -
        1. Create a structured data set similar to R's data frame and Excel spreadsheet.
        2. Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
        3. Selecting particular rows or columns from data set
        4. Arranging data in ascending or descending order
        5. Filtering data based on some conditions
        6. Summarizing data by classification variable
        7. Reshape data into wide or long format
        8. Time series analysis
        9. Merging and concatenating two datasets
        10. Iterate over the rows of dataset
        11. Writing or Exporting data in CSV or Excel format

        Datasets:

        In this tutorial we will use two datasets: 'income'and'iris'.
        1. 'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
        2. 'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link 


        Important pandas functions to remember

        The following is a list of common tasks along with pandas functions.
        UtilityFunctions
        To select random no. of rowsdf.sample(n = 10)
        To select fraction of random rowsdf.sample(frac = 0.2)
        To rename the variablesdf.rename( )
        Selecting a column as indexdf.set_index( )
        Removing rows or columnsdf.drop( )
        Sorting valuesdf.sort_values( )
        Grouping variablesdf.groupby( )
        Filteringdf.query( )
        Finding the missing valuesdf.isnull( )
        Dropping the missing valuesdf.dropna( )
        Removing the duplicatesdf.drop_duplicates( )
        Creating dummiespd.get_dummies( )
        Rankingdf.rank( )
        Cumulative sumdf.cumsum( )
        Quantilesdf.quantile( )
        Selecting numeric variablesdf.select_dtypes( )
        Concatenating two dataframespd.concat()
        Merging on basis of common variablepd.merge( )

        Importing pandas library

        You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:
        import pandas as pd
        The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

        Importing Dataset

        To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.
        income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")
         Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        4 C California 1685349 1675807 1889570 1480280 1735069 1812546

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607
        Knowing the Variable types

        You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
        income.dtypes 
        Index    object
        State object
        Y2002 int64
        Y2003 int64
        Y2004 int64
        Y2005 int64
        Y2006 int64
        Y2007 int64
        Y2008 int64
        Y2009 int64
        Y2010 int64
        Y2011 int64
        Y2012 int64
        Y2013 int64
        Y2014 int64
        Y2015 int64
        dtype: object

        Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

        To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -
        income['State'].dtypes
        It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

        Changing the data types

        Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
        income.Y2008 = income.Y2008.astype(float)
        income.dtypes
        Index     object
        State object
        Y2002 int64
        Y2003 int64
        Y2004 int64
        Y2005 int64
        Y2006 int64
        Y2007 int64
        Y2008 float64
        Y2009 int64
        Y2010 int64
        Y2011 int64
        Y2012 int64
        Y2013 int64
        Y2014 int64
        Y2015 int64
        dtype: object

        To view the dimensions or shape of the data
        income.shape
         (51, 16)

        51 is the number of rows and 16 is the number of columns.

        You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 
        income.shape[0]
        income.shape[1]

        To view only some of the rows

        By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
        income.head()
        income.head(2)
        income.tail()
        income.tail(2)

        Alternatively, the command income[0:5] can be used to fetch first five rows.

        Extract Unique Values

        The unique() function shows the unique levels or categories in the dataset.
        income.Index.unique()
        array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)


        The nunique( ) shows the number of unique values.
        income.Index.nunique()
        It returns 19 as index column contains distinct 19 values.

        Generate Cross Tab

        pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.
        pd.crosstab(income.Index,income.State)

        Creating a frequency distribution

        income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.
        income.Index.value_counts(ascending = True)
        F    1
        G 1
        U 1
        L 1
        H 1
        P 1
        R 1
        D 2
        T 2
        S 2
        V 2
        K 2
        O 3
        C 3
        I 4
        W 4
        A 4
        M 8
        N 8
        Name: Index, dtype: int64

        To draw the samples
        income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
        income.sample(n = 5)
        income.sample(frac = 0.1)
        Selecting only a few of the columns
        To select only a specific columns we use either loc[ ] or iloc[ ] commands. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.
        income.loc[:,["Index","State","Y2008"]]
        income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
        #In the above command both Index and Y2008 are included.
        income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included
        The difference between loc and iloc is that loc requires the column(rows) names to be selected while iloc requires the column(rows) indices (position).

        Renaming the variables
        We create a dataframe 'data' for information of people and their respective zodiac signs.
        data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
        data 
               A          B
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
        #Renaming all the variables.
        data.columns = ['Names','Zodiac Signs']

           Names Zodiac Signs
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
        #Renaming only some of the variables.
        data.rename(columns = {"Names":"Cust_Name"},inplace = True)
          Cust_Name Zodiac Signs
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

        Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"
        income.columns = income.columns.str.replace('Y' , 'Year ')
        income.columns
        Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
        'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
        'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
        dtype='object')

        Setting one column in the data frame as the index
        Using set_index("column name") we can set the indices as that column and that column gets removed.
        income.set_index("Index",inplace = True)
        income.head()
        #Note that the indices have changed and Index column is now no more a column
        income.columns
        income.reset_index(inplace = True)
        income.head()
        reset_index( ) tells us that one should use the by default indices.

        Removing the columns and rows
        To drop a column we use drop( ) where the first argument is a list of columns to be removed. By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.
        income.drop('Index',axis = 1)

        #Alternatively
        income.drop("Index",axis = "columns")
        income.drop(['Index','State'],axis = 1)
        income.drop(0,axis = 0)
        income.drop(0,axis = "index")
        income.drop([0,1,2,3],axis = 0)
         Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

        Sorting the data
        To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
        income.sort_values("State",ascending = False)
        income.sort_values("State",ascending = False,inplace = True)
        income.Y2006.sort_values() 
        We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002
        income.sort_values(["Index","Y2002"]) 
        Some arithmetic operations.
        Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
        income["difference"] = income.Y2008-income.Y2009

        #Alternatively
        income["difference2"] = income.eval("Y2008 - Y2009")
        income.head()
          Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        4 C California 1685349 1675807 1889570 1480280 1735069 1812546

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \
        0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607

        difference difference2
        0 1056.0 1056.0
        1 115285.0 115285.0
        2 198556.0 198556.0
        3 -440876.0 -440876.0
        4 -176494.0 -176494.0

        income.ratio = income.Y2008/income.Y2009
        The above command does not work, thus to create new columns we need to use square brackets.
        We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
        data = income.assign(ratio = (income.Y2008 / income.Y2009))
        data.head()

        Finding Descriptive Statistics
        describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
        income.describe() #for numeric variables
        To find the total count, maximum occuring string and its frequency we write include = ['object']
        income.describe(include = ['object'])  #Only for strings / objects
        Mean, median, maximum and minimum can be obtained for a particular column(s) as:
        income.Y2008.mean()
        income.Y2008.median()
        income.Y2008.min()
        income.loc[:,["Y2002","Y2008"]].max()

        Groupby function
        To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.
        income.groupby("Index").Y2008.min()
        income.groupby("Index")["Y2008","Y2010"].max()
        agg( ) function is used to find all the functions for a given variable.
        income.groupby("Index").Y2002.agg(["count","min","max","mean"])
        income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])
        The following command finds minimum and maximum values for Y2002 and only mean for Y2003
        income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
                  Y2002                 Y2003
        min max mean
        Index
        A 1170302 1742027 1810289.000
        C 1343824 1685349 1595708.000
        D 1111437 1330403 1631207.000
        F 1964626 1964626 1468852.000
        G 1929009 1929009 1541565.000
        H 1461570 1461570 1200280.000
        I 1353210 1776918 1536164.500
        K 1509054 1813878 1369773.000
        L 1584734 1584734 1110625.000
        M 1221316 1983285 1535717.625
        N 1395149 1885081 1382499.625
        O 1173918 1802132 1569934.000
        P 1320191 1320191 1446723.000
        R 1501744 1501744 1942942.000
        S 1159037 1631522 1477072.000
        T 1520591 1811867 1398343.000
        U 1771096 1771096 1195861.000
        V 1134317 1146902 1498122.500
        W 1677347 1977749 1521118.500

        Filtering
        To filter only those rows which have Index as "A" we write:
        income[income.Index == "A"]

        #Alternatively
        income.loc[income.Index == "A",:]
          Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        To select the States having Index as "A":
        income.loc[income.Index == "A","State"]
        income.loc[income.Index == "A",:].State
        To filter the rows with Index as "A" and income for 2002 > 1500000"
        income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]
        To filter the rows with index either "A" or "W", we can use isin( ) function:
        income.loc[(income.Index == "A") | (income.Index == "W"),:]

        #Alternatively.
        income.loc[income.Index.isin(["A","W"]),:]
           Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
        48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
        49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
        50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
        48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
        49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
        50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
        Alternatively we can use query( ) function and write our filtering criteria:
        income.query('Y2002>1700000 & Y2003 > 1500000')

        Dealing with missing values
        We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.
        import numpy as np
        mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
                'Yield': [1010, 1025.2, 1404.2, 1251.7],
                'cost' : [102, np.nan, 20, 68]}
        crops = pd.DataFrame(mydata)
        crops
        isnull( ) returns True and notnull( ) returns False if the value is NaN.
        crops.isnull()  #same as is.na in R
        crops.notnull()  #opposite of previous command.
        crops.isnull().sum()  #No. of missing values.
        crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

        crops[crops.cost.isnull()] #shows the rows with NAs.
        crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
        crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
        To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

        crops.dropna(how = "any").shape
        crops.dropna(how = "all").shape  
        To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:
        crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
        crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
        Replacing the missing values by "UNKNOWN" sub attribute in Column name.
        crops['cost'].fillna(value = "UNKNOWN",inplace = True)
        crops

        Dealing with duplicates
        We create a new dataframe comprising of items and their respective prices.
        data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
        data
                     Items  Price
        0 TV 10000
        1 Washing Machine 50000
        2 Mobile 20000
        3 TV 10000
        4 TV 10000
        5 Washing Machine 40000
        duplicated() returns a logical vector returning True when encounters duplicated.
        data.loc[data.duplicated(),:]
        data.loc[data.duplicated(keep = "first"),:]
        By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
        If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.
        data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
        If keep = "False" then it considers all the occurences of the repeated observations as duplicates.
        data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.
        To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )
        data.drop_duplicates(keep = "first")
        data.drop_duplicates(keep = "last")
        data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False
        data

        Creating dummies
        Now we will consider the iris dataset
        iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
        iris.head()
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
        0 5.1 3.5 1.4 0.2 setosa
        1 4.9 3.0 1.4 0.2 setosa
        2 4.7 3.2 1.3 0.2 setosa
        3 4.6 3.1 1.5 0.2 setosa
        4 5.0 3.6 1.4 0.2 setosa
        map( ) function is used to match the values and replace them in the new series automatically created.
        iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
        iris.head()
        To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.
        pd.get_dummies(iris.Species,prefix = "Species")
        pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included
        species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
        With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
        iris = pd.concat([iris,species_dummies],axis = 1)
        iris.head()
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
        0 5.1 3.5 1.4 0.2 setosa
        1 4.9 3.0 1.4 0.2 setosa
        2 4.7 3.2 1.3 0.2 setosa
        3 4.6 3.1 1.5 0.2 setosa
        4 5.0 3.6 1.4 0.2 setosa

        Species_setosa Species_versicolor Species_virginica
        0 1 0 0
        1 1 0 0
        2 1 0 0
        3 1 0 0
        4 1 0 0
        It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True
        pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

        Ranking
         To create a dataframe of all the ranks we use rank( )
        iris.rank() 
        Ranking by a specific variable
        Suppose we want to rank the Sepal.Length for different species in ascending order:
        iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
        iris.head( )

        #Alternatively
        iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
        iris.head()

        Calculating the Cumulative sum
        Using cumsum( ) function we can obtain the cumulative sum
        iris['cum_sum'] = iris["Sepal.Length"].cumsum()
        iris.head()
        Cumulative sum by a variable
        To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
        iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
        iris.head()

        Calculating the percentiles.
        Various quantiles can be obtained by using quantile( )
        iris.quantile(0.5)
        iris.quantile([0.1,0.2,0.5])
        iris.quantile(0.55)

        if else
        We create a new dataframe of students' name and their respective zodiac signs.
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
        Importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
        import numpy as np
        students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
        students
              Names Zodiac Signs flag
        0 John Aquarius yes
        1 Mary Libra no
        2 Henry Gemini yes
        3 Augustus Pisces no
        4 Kenny Virgo no

        Multiple conditions for if else
        We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False
        conditions = [
            (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
            (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
            (students['Zodiac Signs'] == 'Pisces')]
        choices = ['yellow', 'blue', 'purple']
        students['color'] = np.select(conditions, choices, default='black')
        students
              Names Zodiac Signs flag   color
        0 John Aquarius yes yellow
        1 Mary Libra no blue
        2 Henry Gemini yes black
        3 Augustus Pisces no purple
        4 Kenny Virgo no black
        If using simple 'if else' we need to take care of the indentation. Python does not involve curly braces for the loops and if else.
        x = 1001   #Take care of the indentation.
        if x%10 == 0:
            print("x is a multiple of 10")
        else:
            print("It is not a multiple of 10")

        Select numeric or categorical columns only
        To include numeric columns we use select_dtypes( ) 
        data1 = iris.select_dtypes(include=[np.number])
        data1.head()
         _get_numeric_data also provides utility to select the numeric columns only.
        data3 = iris._get_numeric_data()
        data3.head(3)
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
        0 5.1 3.5 1.4 0.2 5.1 5.1
        1 4.9 3.0 1.4 0.2 10.0 10.0
        2 4.7 3.2 1.3 0.2 14.7 14.7
        For selecting categorical variables
        data4 = iris.select_dtypes(include = ['object'])
        data4.head(2)
         Species
        0 setosa
        1 setosa

        Concatenating
        We create 2 dataframes containing the details of the students:
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
        students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                  'Marks' : [50,81,98,25,35]})
         using pd.concat( ) function we can join the 2 dataframes:
        data = pd.concat([students,students2])  #by default axis = 0
           Marks     Names Zodiac Signs
        0 NaN John Aquarius
        1 NaN Mary Libra
        2 NaN Henry Gemini
        3 NaN Augustus Pisces
        4 NaN Kenny Virgo
        0 50.0 John NaN
        1 81.0 Mary NaN
        2 98.0 Henry NaN
        3 25.0 Augustus NaN
        4 35.0 Kenny NaN
        By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
        data = pd.concat([students,students2],axis = 1)
        data
              Names Zodiac Signs  Marks     Names
        0 John Aquarius 50 John
        1 Mary Libra 81 Mary
        2 Henry Gemini 98 Henry
        3 Augustus Pisces 25 Augustus
        4 Kenny Virgo 35 Kenny
        Using append function we can join the dataframes row-wise
        students.append(students2)  #for rows
        Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
        classes = {'x': students, 'y': students2}
         result = pd.concat(classes)
        result 
             Marks     Names Zodiac Signs
        x 0 NaN John Aquarius
        1 NaN Mary Libra
        2 NaN Henry Gemini
        3 NaN Augustus Pisces
        4 NaN Kenny Virgo
        y 0 50.0 John NaN
        1 81.0 Mary NaN
        2 98.0 Henry NaN
        3 25.0 Augustus NaN
        4 35.0 Kenny NaN

        Merging or joining on the basis of common variable.
        We take 2 dataframes with different number of observations:
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
        students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                  'Marks' : [50,81,98,25,35]})
        Using pd.merge we can join the two dataframes. on = 'Names'denotes the common variable on the basis of which the dataframes are to be combined is 'Names'
        result = pd.merge(students, students2, on='Names')  #it only takes intersections
        result
           Names Zodiac Signs  Marks
        0 John Aquarius 50
        1 Mary Libra 81
        2 Henry Gemini 98
         By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"
         result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
        result
              Names Zodiac Signs  Marks
        0 John Aquarius 50.0
        1 Mary Libra 81.0
        2 Henry Gemini 98.0
        3 Maria Capricorn NaN
        4 Augustus NaN 25.0
        5 Kenny NaN 35.0
        To take only intersections and all the values in left df set how = 'left'
        result = pd.merge(students, students2, on='Names',how = "left")
        result
           Names Zodiac Signs  Marks
        0 John Aquarius 50.0
        1 Mary Libra 81.0
        2 Henry Gemini 98.0
        3 Maria Capricorn NaN
        Similarly how = 'right'takes only intersections and all the values in right df.
        result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
        result
              Names Zodiac Signs  Marks      _merge
        0 John Aquarius 50 both
        1 Mary Libra 81 both
        2 Henry Gemini 98 both
        3 Augustus NaN 25 right_only
        4 Kenny NaN 35 right_only
        indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

        Predictive Modeling using SAS & R Online Training

        $
        0
        0
        First Online Training Batch
        The next Instructor-led online training batch will commence on February 10, 2018. We are offering courses on SAS , R and Predictive Modeling. In this program you will get an access to live lectures plus recorded videos from any part of the world via web conference mode. Also you can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared.
        1. Practical SAS Programming - Learning SAS by Case Studies
        2. Predictive Modeling with SAS - Modeling with Hands-on Examples plus Domain Knowledge
        3. Data Science using R - Practical Data Science Course (Incld. R Programming, Data Science and Domain Knowledge)

        4. Batch : 10th February, Saturday     Mode : Live Instructor-led


        Practical SAS Programming
        Rs 20,000
        ($325)
        • Special Price : Get 10% off till 8th Jan, 2018
        • Base and Advanced SAS Programming
        • Instructor-led live class + Recorded videos
        • Duration : 8 Weeks (100 hours)
        • Timing : 17:00 - 19:00 IST
        • Live Projects + Scenario-Based Questions
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        • Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date
        Indian Users
        All Users (Except India)


        Predictive Modeling using SAS
        Rs 25,000
        ($400)
        • Special Price : Get 10% off till 8th Jan, 2018
        • Predictive Modeling with SAS                    
        • Instructor-led live class + Recorded videos
        • Duration : 8 - 10 Weeks (100 hours)
        • Timing : 19:00 - 21:00 IST
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        • Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date
        Indian Users
        All Users (Except India)


        R Programming + Data Science with R 
        Rs 30,000
        ($480)
        • Special Price : Get 10% off till 8th Jan, 2018
        • R Programming + Predictive Modeling with R
        • Instructor-led live class + Recorded videos
        • Duration : 10-12 Weeks (120 hours)
        • Timing : 21:15 - 23:15 IST
        • Live Projects + Domain Knowledge
        • Case Studies
        • Hands-on Examples
        • Weekly Assignments
        • Certification
        • Job Placement Assistance
        • Weekend Classes
        • Money Back Guarantee
        • Pay Rs1,000 (USD20) now to reserve your seat and the remaining balance 10 days before the batch commencement date
        Indian Users
        All Users (Except India)



        Combo Deals - Spend Less, Learn More
        Pay only Rs 35,000 ($600) on purchase of 'Practical SAS Programming' and 'Predictive Modeling with SAS' courses
        Offer expires on 8th January,2018
        Enroll Now - Indian Users 
        Enroll Now - All Users (Except India)

        What is Instructor-led live program?
        It is an interactive training program. Learners will get an access to live lectures via live webinar mode and can chat or even ask their questions verbally over the VoIP in real time to get their doubts cleared. Also you can go through video recording if you miss a class.

        Money Back Guarantee?
        If you do not like our training, you can ask for 100% course fees refund after your first live session. No question asked refund policy!

        What is the duration of these programs?
        These are weekend programmes comprising 100-130 hours. Classes will be held on every Saturday and Sunday.

        SAS Programming : 17:00 - 19:00 IST (07:30AM - 09:30AM  Eastern Daylight Time)
        Predictive Modeling with SAS : 19:00 - 21:00 IST (09:30AM - 11:30AM Eastern Daylight Time)
        Data Science with R : 21:15 - 23:15 IST (11:45AM - 01:45PM Eastern Daylight Time)

        The course duration is as follows -
        1. Practical SAS Programming -100 hours (At least 50 hours live training + 5 hours video based training + ~60 hours of Practice and Self Study)
        2. Predictive Modeling with SAS - 100 hours (Includes hours of Video based training and Practice and Self Study)
        3. Data Science with R - 120 hours (At least 40 hours live training + 7 hours video based training + ~80 hours of Practice and Self Study)

        If I opt for all the 3 courses, will classes be scheduled at the same time?
        All classes will be scheduled on weekend but not at the same time. It'll be one by one. For example if class A gets over at 5. Next class will start at 5.

        How We are different from other institutes?
        Here are some of the features of ListenData that makes us better than other training institutes.
        1. Explain Advanced Statistical and Machine Learning Algorithms in Simple English. We make classes more logical and understandable than just telling concepts.
        2. Practical Application of Techniques using Real-world Datasets. No sample or cleaned dataset.
        3. Domain Knowledge - It is the most important element of a predictive modeling project. People who lack in domain knowledge find it difficult to crack interviews in spite of having knowledge of predictive modeling.
        4. Hands-on Model Development and Validation
        5. Strategies to implement predictive model
        6. New algorithms to solve problems efficiently
        7. Explain complex topics via visual lessons

        Who should do these courses?
        These courses are ideal for candidates who want to make a career in analytics.
        1. Any candidate pursuing graduation / post graduation or already graduate can apply for this course. No particular specialization is required prior to applying for these courses. You can be from any educational background like Engineering, Economics, Statistics, Mathematics, Commerce, Business Management, Operational Research etc.
        2. Anyone who is planning a career shift to analytics. It does not matter if you are a network engineer or financial analyst. You can go ahead with these courses as they do not require any prior knowledge of programming or statistics.

        Every training institute promises job. Why should i trust you?
        Let's be honest! It's a universal fact that no college or training institute can provide 100% job guarantee. If they are claiming 100% job guarantee, they are luring learners by false promises. Even IITs do not hit 100% score. Some Facts - Only 66% of IITians landed a job offer via campus recruitment in 2016-17, as against 79% in 2015-16 and 78% in 2014-15, according to HRD ministry.

        Let me list down the common reasons why people don't get jobs in analytics industry even after completing training from some colleges / institutes -
        1. No hands-on experience
        2. No domain knowledge
        3. No theoretical knowledge of statistical concepts
        4. Poor analytical skill
        The objective of this program is to cover the above first three points in detail. In addition we provide job placement assistance to all students. We will keep you informed about current openings in analytics industry. We are in constant contact with job consultancy firms and a solid network of analytics professionals.

        The decline of SAS Jobs and rise of R?
        I have been working in SAS for close to 7 years and worked with 4 organizations (Instability in career! :D ). Whenever I look for a job change, I do not see any decline of SAS jobs in the market. Don't trust me, go to job portals and search 'SAS'! List of Companies using SAS It is a big hit in banks, insurance, telecom and pharmaceutical companies. SAS is still a world leader in advanced analytics and has over 40,000 customers worldwide. It has been tagged 'leader' consistently in advanced analytics platform as per Gartner 2015 and 2016 reports. It is one of the most sought after skill in job market. Learning SAS will help you to scale up your skills, which in turns leads to boost your career.

        At the same time, R has gained popularity. It is a language of choice for data scientists. It makes advanced statistical techniques and machine learning algorithms easy to implement. It is being used as a primary tool in IT, ecommerce, startups, HR, service and product based companies and secondary tool in banks, insurance and telecom companies. List of Companies using R

        Final Comment - You should not get into language wars and should focus on learning both the languages as jobs are evolving very fast. Companies prefer candidates who know both SAS & R.

        In case if I miss any live session?
        Every class is recorded. We will provide you recording of every session.

        I never studied Programming or Statistics during graduation. Can I still apply for this course?
        Yes, these courses are designed to keep in mind the needs of non-programmers/non-statisticians. Only prerequisite is hard work and zeal for learning.

        Is my registration fees refundable?
        100% refundable. Incase you want to opt out of the course for any reason, you can ask for 100% refund within 7 days of registration. If you want to continue, it would be automatically adjusted on total fees. In other words, you pay $20 (Rs 1000) less of the amount of total fees.

        About Instructor
        Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. He has worked with companies like Aon, Cognizant, Genpact, RBS. He has handled global clients in various domains like retail and commercial banking, Telecom, HR and Automotive. He has worked extensively in various data science projects such as Customer Attrition, Customer Lifetime Value Model, Propensity Model, Opinion / Sentiment Mining, Geo Analytics, Credit risk scorecard, Portfolio Optimization, Pricing Analytics, Cross sell/Up sell campaign models, Survey Analytics, Customer Segmentation, Market Benchmarking, Employee Attrition, Employee Engagement etc.

        Any Questions?
        Please feel free to write me at deepanshu.bhalla@outlook.com OR Join me on linkedin

        Predictive Modeling using SAS & R Training




        Curriculum - Practical SAS Programming
        1. Introduction to SAS
        2. How SAS works
        3. Import Raw Data Files - Basics
        4. Import Raw Data Files - Special Cases
        5. Importing / Exporting Data with Procedures
        6. Exploring Data - Various Methods
        7. Data Subsetting
        8. Data Manipulation - Basics
        9. Data Manipulation - Intermediate
        10. Data Manipulation - Advanced
        11. Do Loops and Arrays
        12. Merging Data
        13. Appending Data
        14. Character & Numeric Functions
        15. Date Functions
        16. Reporting - Creating tabular reports
        17. Proc SQL - Part I
        18. Proc SQL - Part II
        19. Proc SQL - Part III
        20. SAS Macros - Basics
        21. SAS Macros - Intermediate
        22. SAS Macros - Advanced
        23. SAS Macros - Debugging Tips
        24. Efficient SAS Programming Tips
        25. Connect to Databases using SAS
        26. Interview Preparation - Scenario Based Questions
        27. Live Project




        Curriculum - Predictive Modeling using SAS
        1. Introduction to Statistics & Modeling
        2. Marketing Analytics : Applications
        3. Predictive Modeling in Financial Services Industry
        4. Predictive Modeling in HR
        5. SAS Programming - Basics
        6. SAS Programming - Intermediate
        7. Descriptive Statistics with SAS
        8. Hypothesis Testing with SAS
        9. Correlation Analysis with SAS
        10. Steps of Predictive Modeling
        11. Data Preparation in Predictive Modeling
        12. Variable Selection Methods in Predictive Modeling
        13. Segmentation - Introduction
        14. Segmentation - Cluster Analysis : Theory
        15. Segmentation - Cluster Analysis : Data Preparation
        16. Segmentation - Cluster Analysis : k-means and Hierarchical
        17. Segmentation - Cluster Analysis : Cluster Performance
        18. Principal Component Analysis (PCA) - Theory
        19. Running and Understanding PCA with SAS
        20. Linear Regression - Theory
        21. Linear Regression - Assumptions and Treatment
        22. Linear Regression - Important Metrics
        23. Linear Regression - Variable Selection Methods
        24. Linear Regression - Model Development
        25. Linear Regression - Model Validation
        26. Linear Regression - Model Performance
        27. Linear Regression - Model Scoring
        28. Linear Regression - Model Implementation
        29. Logistic Regression - Theory
        30. Logistic Regression - Assumptions and Treatment
        31. Logistic Regression - Important Metrics
        32. Logistic Regression - Variable Selection Methods
        33. Logistic Regression - Model Development
        34. Logistic Regression - Model Validation
        35. Logistic Regression - Model Performance
        36. Logistic Regression - Model Implementation
        37. Decision Tree - How it works
        38. Decision Tree - Model Development
        39. Decision Tree - Model Validation
        40. Decision Tree - Model Performance
        41. Decision Tree - Model Implementation
        42. Time Series Forecasting - Theory
        43. Time Series Analysis with SAS
        44. Special Cases - Handle rare event model
        45. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        46. Case Studies - Customer Segmentation
        47. Case Studies - Probability of Default
        48. Case Studies - Employee Attrition
        49. Case Studies - Time Series Forecasting
        50. Interview Tips - Common Interview Questions



        Curriculum - R Programming + Data Science with R
        1. Introduction to R
        2. Introduction to RStudio
        3. Data Structures in R
        4. Importing / Exporting Data in R
        5. Data Exploration
        6. Data Manipulation with dplyr package - Basics
        7. Data Manipulation with dplyr package - Intermediate
        8. Data Manipulation with dplyr package - Advanced
        9. Character and Numeric Functions in R
        10. Data & Time Functions in R
        11. Data Visualization in R
        12. Loops in R (Apply Family of Functions & For Loop)
        13. R Functions - Part I
        14. R Functions - Part II
        15. Introduction to Data Science
        16. Marketing Analytics : Applications
        17. Predictive Modeling in Financial Services Industry
        18. Predictive Modeling in HR
        19. Hypothesis Testing with R
        20. Correlation Analysis with R
        21. Steps of Predictive Modeling
        22. Data Preparation in Predictive Modeling
        23. Variable Selection Methods in Predictive Modeling
        24. Segmentation - Introduction
        25. Segmentation - Cluster Analysis : Theory
        26. Segmentation - Cluster Analysis : Data Preparation
        27. Segmentation - Cluster Analysis : k-means and Hierarchical
        28. Segmentation - Cluster Analysis : Cluster Performance
        29. Principal Component Analysis (PCA) - Theory
        30. Running and Understanding PCA with R
        31. Linear Regression - Theory
        32. Linear Regression - Assumptions and Treatment
        33. Linear Regression - Important Metrics
        34. Linear Regression - Variable Selection Methods
        35. Linear Regression - Model Development
        36. Linear Regression - Model Validation
        37. Linear Regression - Model Performance
        38. Linear Regression - Model Scoring
        39. Linear Regression - Model Implementation
        40. Logistic Regression - Theory
        41. Logistic Regression - Assumptions and Treatment
        42. Logistic Regression - Important Metrics
        43. Logistic Regression - Variable Selection Methods
        44. Logistic Regression - Model Development
        45. Logistic Regression - Model Validation
        46. Logistic Regression - Model Performance
        47. Logistic Regression - Model Implementation
        48. Decision Tree - How it works
        49. Decision Tree - Model Development
        50. Decision Tree - Model Validation
        51. Decision Tree - Model Performance
        52. Decision Tree - Model Implementation
        53. Machine Learning - Basics
        54. Random Forest - How it works
        55. Random Forest vs. Decision Tree
        56. Random Forest - Model Development
        57. Random Forest - Model Validation
        58. Random Forest - How it works
        59. Gradient Boosting - How it works
        60. Gradient Boosting - Model Development
        61. Gradient Boosting - Model Validation
        62. Support Vector Machine - How it works
        63. Support Vector Machine - Model Development
        64. Support Vector Machine - Model Validation
        65. Ensemble Stacking / Blending
        66. Time Series Forecasting - Theory
        67. Time Series Analysis with R
        68. Special Cases - Handle rare event model
        69. Text Mining Basics & Applications
        70. Case Studies - Attrition / Churn Model (BFSI / Telecom)
        71. Case Studies - Customer Segmentation
        72. Case Studies - Probability of Default
        73. Case Studies - HR Drivers Analysis
        74. Case Studies - Sales Forecasting
        75. Case Studies - Text Mining
        76. Interview Tips - Common Interview Questions

        NumPy Tutorial : Step by Step Guide

        $
        0
        0
        NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

        Python : Numpy Tutorial

        Why NumPy instead of lists?
        One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:
        1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
        2. They are more speedy to work with and hence are more efficient than the lists.
        3. They are more convenient to deal with.
          NumPy vs. Pandas
          Pandas is built on top of NumPy. In other words, Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additional method or provides more streamlined way of working with numerical and tabular data in Python.

          Importing numpy

          Firstly you need to import the numpy library. Importing numpy can be done by running the following command:
          import numpy as np
          It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

          FunctionsTasks
          arrayCreate numpy array
          ndimDimension of the array
          shapeSize of the array (Number of rows and Columns)
          sizeTotal number of elements in the array
          dtypeType of elements in the array, i.e., int64, character
          reshapeReshapes the array without changing the original shape
          resizeReshapes the array. Also change the original shape
          arrangeCreate sequence of numbers in array
          ItemsizeSize in bytes of each item
          diagCreate a diagonal matrix
          vstackStacking vertically
          hstackStacking horizontally

          1D array

          Using numpy an array is created by using np.array:
          a = np.array([15,25,14,78,96])
          a
          print(a)
          a
          Output: array([15, 25, 14, 78, 96])

          print(a)
          Output: [15 25 14 78 96]
          Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

          Changing the datatype

          np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.
          a.dtype
          a = np.array([15,25,14,78,96],dtype = "float")
          a
          a.dtype
          Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

          Creating the sequence of numbers

          If you want to create a sequence of numbers then using np.arange we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.
          b = np.arange(start = 20,stop = 30) 
          b
          array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
           In np.arange the end point is always excluded.

          Create an Arithmetic Progression

          np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

          Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.
          c = np.arange(20,30,2)   #30 is excluded.
          array([20, 22, 24, 26, 28])
          It is to be taken care that in np.arange( ) the stop argument is always excluded.

          Reshaping the arrays

          To reshape the array we can use reshape( ). 
          f = np.arange(101,113)
          f.reshape(3,4)
          f
           array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

          Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( ) 
          f.resize(3,4)
          f
          array([[101, 102, 103, 104],
          [105, 106, 107, 108],
          [109, 110, 111, 112]])

          If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.
          f.reshape(3,-1)
          array([[101, 102, 103, 104],
          [105, 106, 107, 108],
          [109, 110, 111, 112]])

          In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

          2D arrays

          A 2D array in numpy can be created in the following manner:
          g = np.array([(10,20,30),(40,50,60)])
          #Alternatively
          g = np.array([[10,20,30],[40,50,60]])
          g
          The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:
          g.ndim
          g.size
          g.shape
          g.ndim
          Output: 2

          g.size
          Output: 6

          g.shape
          Output: (2, 3)

          Creating some usual matrices.

          numpy provides the utility to create some usual matrices which are commonly used for linear algebra.
          To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):
          np.zeros( (2,4) )
          array([[ 0.,  0.,  0.,  0.],
          [ 0., 0., 0., 0.]])
          Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'
          np.zeros([2,4],dtype=np.int16)       
          array([[0, 0, 0, 0],
          [0, 0, 0, 0]], dtype=int16)
          To get a matrix of all random numbers from 0 to 1 we write np.empty. 
          np.empty( (2,3) )       
          array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
          [ 2.29175545e-312, 2.33419537e-312, 2.37663529e-312]])
          Note: The results may vary everytime you run np.empty.
          To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:
          np.ones([3,3])
          array([[ 1.,  1.,  1.],
          [ 1., 1., 1.],
          [ 1., 1., 1.]])
          To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:
          np.diag([14,15,16,17])
          array([[14,  0,  0,  0],
          [ 0, 15, 0, 0],
          [ 0, 0, 16, 0],
          [ 0, 0, 0, 17]])
          To create an identity matrix we can use np.eye( ) .
          np.eye(5,dtype = "int")
          array([[1, 0, 0, 0, 0],
          [0, 1, 0, 0, 0],
          [0, 0, 1, 0, 0],
          [0, 0, 0, 1, 0],
          [0, 0, 0, 0, 1]])
          By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

          Reshaping 2D arrays:

          To get a flattened 1D array we can use ravel( ) 
          g = np.array([(10,20,30),(40,50,60)])
          g.ravel()  
           array([10, 20, 30, 40, 50, 60])
          To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.
          g.reshape(3,-1)  # returns the array with a modified shape
          #It does not modify the original array
          g.shape
           (2, 3)
          Similar to 1D arrays, using resize( ) will modify the shape in the original array.
          g.resize((3,2))
          g   #resize modifies the original array
          array([[10, 20],
          [30, 40],
          [50, 60]])

          Time for some matrix algebra.

          Let us create some arrays A,b and B and they will be used for this section:
          A = np.array([[2,0,1],[4,3,8],[7,6,9]])
          b = np.array([1,101,14])
          B = np.array([[10,20,30],[40,50,60],[70,80,90]])
           In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.
          A.T             #transpose
          A.transpose()  #transpose
          np.trace(A)  # trace
          np.linalg.inv(A)  #Inverse
          A.transpose()  #transpose
          Output:
          array([[2, 4, 7],
          [0, 3, 6],
          [1, 8, 9]])

          np.trace(A) # trace
          Output: 14

          np.linalg.inv(A) #Inverse
          Output:
          array([[ 0.53846154, -0.15384615, 0.07692308],
          [-0.51282051, -0.28205128, 0.30769231],
          [-0.07692308, 0.30769231, -0.15384615]])
          Note that transpose does not modify the original array.

          Matrix addition and subtraction can be done in the usual way:
          A+B
          A-B
          A+B
          Output:
          array([[12, 20, 31],
          [44, 53, 68],
          [77, 86, 99]])

          A-B
          Output:
          array([[ -8, -20, -29],
          [-36, -47, -52],
          [-63, -74, -81]])
          Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.
          A.dot(B)
          array([[  90,  120,  150],
          [ 720, 870, 1020],
          [ 940, 1160, 1380]])
          To solve the system of linear equations: Ax = b we use np.linalg.solve( )
          np.linalg.solve(A,b)
          array([-13.92307692, -24.69230769,  28.84615385])
          The eigen values and eigen vectors can be calculated using np.linalg.eig( ) 
          np.linalg.eig(A)
          (array([ 14.0874236 ,   1.62072127,  -1.70814487]),
          array([[-0.06599631, -0.78226966, -0.14996331],
          [-0.59939873, 0.54774477, -0.81748379],
          [-0.7977253 , 0.29669824, 0.55608566]]))
          The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.


          Some Mathematics functions

          We can have various trigonometric functions like sin, cosine etc. using numpy:
          B = np.array([[0,-20,36],[40,50,1]])
          np.sin(B)
          array([[ 0.        , -0.91294525, -0.99177885],
          [ 0.74511316, -0.26237485, 0.84147098]])
          The resultant is the matrix of all sin( ) elements.
          In order to get the exponents we use **
          B**2
          array([[   0,  400, 1296],
          [1600, 2500, 1]], dtype=int32)
          We get the matrix of the square of all elements of B.
          In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:
          B>25
          array([[False, False,  True],
          [ True, True, False]], dtype=bool)
          We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
          In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.
          np.absolute(B)
          np.sqrt(B)
          np.exp(B)
          Now we consider a matrix A of shape 3*3:
          A = np.arange(1,10).reshape(3,3)
          A
          array([[1, 2, 3],
          [4, 5, 6],
          [7, 8, 9]])
          To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:
          A.sum()
          A.min()
          A.max()
          A.mean()
          A.std()   #Standard deviation
          A.var()  #Variance
          A.sum()
          Output: 45

          A.min()
          Output: 1

          A.max()
          Output: 9

          A.mean()
          Output: 5.0

          A.std() #Standard deviation
          Output: 2.5819888974716112

          A.var()
          Output: 6.666666666666667
          In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.
          A.argmin()
          A.argmax()
          A.argmin()
          Output: 0

          A.argmax()
          Output: 8
          If we wish to find the above statistics for each row or column then we need to specify the axis:
          A.sum(axis=0)             
          A.mean(axis = 0)
          A.std(axis = 0)
          A.argmin(axis = 0)
          A.sum(axis=0)                 # sum of each column, it will move in downward direction
          Output: array([12, 15, 18])

          A.mean(axis = 0)
          Output: array([ 4., 5., 6.])

          A.std(axis = 0)
          Output: array([ 2.44948974, 2.44948974, 2.44948974])

          A.argmin(axis = 0)
          Output: array([0, 0, 0], dtype=int64)
          By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column.
          To find the min and index of maximum element fow each row, we need to move in rightwise direction so we write axis = 1:
          A.min(axis=1)
          A.argmax(axis = 1)
          A.min(axis=1)                  # min of each row, it will move in rightwise direction
          Output: array([1, 4, 7])

          A.argmax(axis = 1)
          Output: array([2, 2, 2], dtype=int64)
             To find the cumulative sum along each row we use cumsum( )
          A.cumsum(axis=1)
          array([[ 1,  3,  6],
          [ 4, 9, 15],
          [ 7, 15, 24]], dtype=int32)

          Creating 3D arrays

          Numpy also provides the facility to create 3D arrays. A 3D array can be created as:
          X = np.array( [[[  1, 2,3],             
                          [ 4, 5, 6]],
                         [[7,8,9],
                          [10,11,12]]])
          X.shape
          X.ndim
          X.size
          X contains two 2D arrays  Thus the shape is 2,2,3. Totol number of elements is 12.
          To calculate the sum along a particular axis we use the axis parameter as follows:
          X.sum(axis = 0)
          X.sum(axis = 1)
          X.sum(axis = 2)
          X.sum(axis = 0)
          Output:
          array([[ 8, 10, 12],
          [14, 16, 18]])

          X.sum(axis = 1)
          Output:
          array([[ 5, 7, 9],
          [17, 19, 21]])

          X.sum(axis = 2)
          Output:
          array([[ 6, 15],
          [24, 33]])
          axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.
          X.ravel()
           array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
          ravel( ) writes all the elements in a single array.

          Indexing in arrays

          It is important to note that Python indexing starts from 0. The syntax of indexing is as follows - 
          1. x[start:end] : Elements in array x start through the end (but the end is excluded)
          2. x[start:]       : Elements start through the end
          3. x[:end]        : Elements from the beginning through the end (but the end is excluded)

          If we want to extract 3rd element we write the index as 2 as it starts from 0.
          x = np.arange(10)
          x[2]
          x[2:5]
           x[2]
          Output: 2

          x[2:5]
          Output: array([2, 3, 4])
          Note that in x[2:5] elements starting from 2nd index upto 5th index(exclusive) are selected.
          If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:
          x[:7:3] = 123
          x
           array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])
          To reverse a given array we write:
          x = np.arange(10)
          x[ : :-1]                                 # reversed x
          array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
          Note that the above command does not modify the original array.
          Consider a 3D array:
          X = np.array( [[[  1, 2,3],           
                          [ 4, 5, 6]],
                         [[7,8,9],
                          [10,11,12]]])
          To extract the 2nd matrix we write:
          X[1,...]                                   # same as X[1,:,:] or X[1]
          array([[ 7,  8,  9],
          [10, 11, 12]])
          Remember python indexing starts from 0 that is why we wrote 1 to extract the2nd 2D array.
          To extract the first element from all the rows we write:
          X[...,0]                                   # same as X[:,:,0] 
          array([[ 1,  4],
          [ 7, 10]])

          Indexing with Arrays of Indices

          Consider a 1D array.
          x = np.arange(11,35,2)                   
          x
          array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])
          We form a 1D array i which subsets the elements of x as follows:
          i = np.array( [0,1,5,3,7,9 ] )
          x[i]                                    
          array([11, 13, 21, 17, 25, 29])
          In a similar manner we create a 2D array j of indices to subset x.
          j = np.array( [ [ 0, 1], [ 6, 2 ] ] )   
          x[j]                                      
          array([[11, 13],
          [23, 15]])
          Similarly we can create both  i and j as 2D arrays of indices for x
          x = np.arange(15).reshape(3,5)
          x
          i = np.array( [ [0,1],                        # indices for the first dim
                          [2,0] ] )
          j = np.array( [ [1,1],                        # indices for the second dim
                          [2,0] ] )
          To get the ith index in row and jth index for columns we write:
          x[i,j]                                     # i and j must have equal shape
          array([[ 1,  6],
          [12, 0]])
          To extract ith index from 3rd column we write:
          x[i,2]
          array([[ 2,  7],
          [12, 2]])
          For each row if we want to find the jth index we write:
          x[:,j]                                    
          array([[[ 1,  1],
          [ 2, 0]],

          [[ 6, 6],
          [ 7, 5]],

          [[11, 11],
          [12, 10]]])
          Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

          You can also use indexing with arrays to assign the values:
          x = np.arange(10)
          x
          x[[4,5,8,1,2]] = 0
          x
          array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])
          0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
          When the list of indices contains repetitions then it assigns the last value to that index:
          x = np.arange(10)
          x
          x[[4,4,2,3]] = [100,200,300,400]
          x
          array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])
          Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
          Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.
          x = np.arange(10)
          x[[1,1,1,7,7]]+=1
          x
           array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])
          Although index 1 and 7 are repeated but they are incremented only once.

          Indexing with Boolean arrays

          We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.
          a = np.arange(12).reshape(3,4)
          b = a > 4
          b                                    
          array([[False, False, False, False],
          [False, True, True, True],
          [ True, True, True, True]], dtype=bool)
          Note that 'b' is a Boolean with same shape as that of 'a'.
          To select the elements from 'a' which adhere to condition 'b' we write:
          a[b]                                       
          array([ 5,  6,  7,  8,  9, 10, 11])
          Now 'a' becomes a 1D array with the selected elements
          This property can be very useful in assignments:
          a[b] = 0                                 
          a
          array([[0, 1, 2, 3],
          [4, 0, 0, 0],
          [0, 0, 0, 0]])
          All elements of 'a' higher than 4 become 0
          As done in integer indexing we can use indexing via Booleans:
          Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.
          x = np.arange(15).reshape(3,5)
          y = np.array([True,True,False])             # first dim selection
          z = np.array([True,True,False,True,False])       # second dim selection
          We write the x[y,:] which will select only those rows where y is True.
          x[y,:]                                   # selecting rows
          x[y]                                     # same thing
          Writing x[:,z] will select only those columns where z is True.
          x[:,z]                                   # selecting columns
          x[y,:]                                   # selecting rows
          Output:
          array([[0, 1, 2, 3, 4],
          [5, 6, 7, 8, 9]])

          x[y] # same thing
          Output:
          array([[0, 1, 2, 3, 4],
          [5, 6, 7, 8, 9]])

          x[:,z] # selecting columns
          Output:
          array([[ 0, 1, 3],
          [ 5, 6, 8],
          [10, 11, 13]])

          Stacking various arrays

          Let us consider 2 arrays A and B:
          A = np.array([[10,20,30],[40,50,60]])
          B = np.array([[100,200,300],[400,500,600]])
          To join them vertically we use np.vstack( ).
          np.vstack((A,B))   #Stacking vertically
          array([[ 10,  20,  30],
          [ 40, 50, 60],
          [100, 200, 300],
          [400, 500, 600]])
          To join them horizontally we use np.hstack( ).
          np.hstack((A,B))   #Stacking horizontally
          array([[ 10,  20,  30, 100, 200, 300],
          [ 40, 50, 60, 400, 500, 600]])
          newaxis  helps in transforming a 1D row vector to a 1D column vector.
          from numpy import newaxis
          a = np.array([4.,1.])
          b = np.array([2.,8.])
          a[:,newaxis]  
          array([[ 4.],
          [ 1.]])
          #The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:
          np.column_stack((a[:,newaxis],b[:,newaxis]))
          np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack
          np.column_stack((a[:,newaxis],b[:,newaxis]))
          Output:
          array([[ 4., 2.],
          [ 1., 8.]])

          np.hstack((a[:,newaxis],b[:,newaxis]))
          Output:
          array([[ 4., 2.],
          [ 1., 8.]])

          Splitting the arrays.

          Consider an array 'z' of 15 elements:
          z = np.arange(1,16)
          Using np.hsplit( ) one can split the arrays
          np.hsplit(z,5)   # Split a into 5 arrays
          [array([1, 2, 3]),
          array([4, 5, 6]),
          array([7, 8, 9]),
          array([10, 11, 12]),
          array([13, 14, 15])]
          It splits 'z' into 5 arrays of eqaual length.
          On passing 2 elements we get:
          np.hsplit(z,(3,5))   
          [array([1, 2, 3]),
          array([4, 5]),
          array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
          It splits 'z' after the third and the fifth element.
          For 2D arrays np.hsplit( ) works as follows:
          A = np.arange(1,31).reshape(3,10)
          A
          np.hsplit(A,5)   # Split a into 5 arrays
          [array([[ 1,  2],
          [11, 12],
          [21, 22]]), array([[ 3, 4],
          [13, 14],
          [23, 24]]), array([[ 5, 6],
          [15, 16],
          [25, 26]]), array([[ 7, 8],
          [17, 18],
          [27, 28]]), array([[ 9, 10],
          [19, 20],
          [29, 30]])]
          In the above command A gets split into 5 arrays of same shape.
          To split after the third and the fifth column we write:
          np.hsplit(A,(3,5))   
          [array([[ 1,  2,  3],
          [11, 12, 13],
          [21, 22, 23]]), array([[ 4, 5],
          [14, 15],
          [24, 25]]), array([[ 6, 7, 8, 9, 10],
          [16, 17, 18, 19, 20],
          [26, 27, 28, 29, 30]])]

          Copying

          Consider an array x
          x = np.arange(1,16)
          We assign y as x and then say 'y is x'
          y = x         
          y is x         
          Let us change the shape of y
          y.shape = 3,5  
          Note that it alters the shape of x
          x.shape
          (3, 5)

          Creating a view of the data:

          Let us store z as a view of x by:
          z = x.view()
          z is x   
          False
          Thus z is not x.
          Changing the shape of z
          z.shape = 5,3                   
          Creating a view does not alter the shape of x
          x.shape
          (3, 5)
          Changing an element in z
          z[0,0] = 1234                      
          Note that the value in x also get alters:
          x
          array([[1234,    2,    3,    4,    5],
          [ 6, 7, 8, 9, 10],
          [ 11, 12, 13, 14, 15]])
          Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.


          Creating a copy of the data:
          Now let us create z as a copy of x:
          z = x.copy()                        
          Note that z is not x
          z is x
          Changing the value in z
          z[0,0] = 9999
          No alterations are made in x.
          x
          array([[1234,    2,    3,    4,    5],
          [ 6, 7, 8, 9, 10],
          [ 11, 12, 13, 14, 15]])
          Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

          Data Visualization in R using ggplot2

          $
          0
          0
          For the purpose of data visualization, R offers various methods through inbuilt graphics and powerful packages such as ggolot2. Former helps in creating simple graphs while latter assists in creating customized professional graphs. In this article we will try to learn how various graphs can be made and altered using ggplot2 package.
          Data Visualization with R

          What is ggplot2?

          ggplot2 is a robust and a versatile R package, developed by the most well known R developer, Hadley Wickham, for generating aesthetic plots and charts. 

          The ggplot2 implies "Grammar of Graphics" which believes in the principle that a plot can be split into the following basic parts -
          Plot = data + Aesthetics + Geometry
          1. data refers to a data frame (dataset).
          2. Aesthetics indicates x and y variables. It is also used to tell R how data are displayed in a plot, e.g. color, size and shape of points etc.
          3. Geometry refers to the type of graphics (bar chart, histogram, box plot, line plot, density plot, dot plot etc.)

          Apart from the above three parts, there are other important parts of plot -
          1. Faceting implies the same type of graph can be applied to each subset of the data. For example, for variable gender, creating 2 graphs for male and female.
          2. Annotation lets you to add text to the plot.
          3. Summary Statistics allows you to add descriptive statistics on a plot.
          4. Scales are used to control x and y axis limits

          Why ggplot2 is better?
          • Excellent themes can be created with a single command. 
          • Its colors are nicer and more pretty than the usual graphics. 
          • Easy to visualize data with multiple variables. 
          • Provides a platform to create simple graphs providing plethora of information.

          The table below shows common charts along with various important functions used in these charts.
          Important PlotsImportant Functions
          Scatter Plotgeom_point(), geom_smooth(), stat_smooth()
          Bar Chartgeom_bar(), geom_errorbar()
          Histogramgeom_histogram(), stat_bin(), position_identity(), position_stack(), position_dodge()
          Box Plotgeom_boxplot(), stat_boxplot(), stat_summary()
          Line Plotgeom_line(), geom_step(), geom_path(), geom_errorbar()
          Pie Chartcoord_polar()

          Datasets

          In this article, we will use three datasets - 'iris' , 'mpg' and 'mtcars' datasets available in R.

          1. The 'iris' data comprises of 150 observations with 5 variables. We have 3 species of flowers: Setosa, Versicolor and Virginica and for each of them the sepal length and width and petal length and width are provided.

          2. The 'mtcars' data consists of fuel consumption (mpg) and 10 aspects of automobile design and performance for 32 automobiles. In order words, we have 32 observations and 11 different variables:
          1. mpg Miles/(US) gallon
          2. cyl Number of cylinders
          3. disp Displacement (cu.in.)
          4. hp Gross horsepower
          5. drat Rear axle ratio
          6. wt Weight (1000 lbs)
          7. qsec 1/4 mile time
          8. vs V/S
          9. am Transmission (0 = automatic, 1 = manual)
          10. gear Number of forward gears
          11. carb Number of carburetors

          3. The 'mpg' data consists of 234 observations and 11 variables.


          Install and Load Package

          First we need to install package in R by using command install.packages( ).
          #installing package
          install.packages("ggplot2")
          library(ggplot2)
          Once installation is completed, we need to load the package so that we can use the functions available in the ggplot2 package. To load the package, use command library( )

          Histogram, Density plots and Box plots are used for visualizing a continuous variable.

          Creating Histogram: 

          Firstly we consider the iris data to create histogram and scatter plot.
          # Considering the iris data.
          # Creating a histogram
          ggplot(data  = iris, aes( x = Sepal.Length)) + geom_histogram( )
          Here we call ggplot( ) function, the first argument being the dataset to be used.

          1. aes( ) i.e. aesthetics we define which variable will be represented on the x- axis; here we consider 'Sepal.Length'
          2. geom_histogram( ) denotes we want to plot a histogram.

          Histogram in R

           To change the width of bin in the histograms we can use binwidth in geom_histogram( )
          ggplot(data  = iris, aes(x = Sepal.Length)) + geom_histogram(binwidth=1)

          One can also define the number of bins being wanted, the binwidth in that case will be adjusted automatically.
          ggplot(data = iris , aes(x=Sepal.Length)) + geom_histogram(color="black", fill="white", bins = 10

          Using  color = "black" and fill = "white" we are denoting the boundary colors and the inside color of the bins respectively.

          How to visualize various groups in histogram

          ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_histogram(fill="white", binwidth = 1)
          Histogram depicting various species


          Creating Density Plot

          Density plot is also used to present the distribution of a continuous variable.
          ggplot(iris, aes( x = Sepal.Length)) + geom_density( )
          geom_density( ) function is for displaying density plot.

          Density Plot

          How to show various groups in density plot

          ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )
          Density Plot by group

          Creating Bar and Column Charts :

          Bar and column charts are probably the most common chart type. It is best used to compare different values.

          Now mpg data will be used for creating the following graphics.
          ggplot(mpg, aes(x= class)) + geom_bar() 
          Here we are trying to create a bar plot for number of cars in each class using geom_bar( ).

          Column Chart using ggplot2

          Using coord_flip( ) one can inter-change x and y axis.
          ggplot(mpg, aes(x= class)) + geom_bar() + coord_flip()
          Bar Chart


          How to add or modify Main Title and Axis Labels

          The following functions can be used to add or alter main title and axis labels.
          1. ggtitle("Main title"): Adds a main title above the plot
          2. xlab("X axis label"): Changes the X axis label
          3. ylab("Y axis label"): Changes the Y axis label
          4. labs(title = "Main title", x = "X axis label", y = "Y axis label"): Changes main title and axis labels
          p = ggplot(mpg, aes(x= class)) + geom_bar()
          p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
          Title and Axis Labels

          How to add data labels

          p = ggplot(mpg, aes(x= class)) + geom_bar()
          p = p + labs(title = "Number of Cars in each type", x = "Type of car", y = "Number of cars")
          p + geom_text(stat='count', aes(label=..count..), vjust=-0.25)
          geom_text() is used to add text directly to the plot. vjust is to adjust the position of data labels in bar.

          Add Data Labels in Bar


          How to reorder Bars

          Using stat="identity"we can use our derived values instead of count.
          library(plyr)
          library(dplyr)
          count(mpg,class) %>% arrange(-n) %>%
          mutate(class = factor(class,levels= class)) %>%
          ggplot(aes(x=class, y=n)) + geom_bar(stat="identity")
          The above command will firstly create a frequency distribution for the type of car and then arrange it in descending order using arrange(-n). Then using mutate( )  we modify the 'class' column to a factor with levels 'class' and hence plot the bar plot using geom_bar( ).

          Change order of bars

          Here, bar of SUV appears first as it has maximum number of cars. Now bars are ordered based on frequency count.

          Showing Mean of Continuous Variable by Categorical Variable

          df = mpg %>% group_by(class) %>% summarise(mean = mean(displ)) %>%
            arrange(-mean) %>% mutate(class = factor(class,levels= class))

          p = ggplot(df, aes(x=class, y=mean)) + geom_bar(stat="identity")
          p + geom_text(aes(label = sprintf("%0.2f", round(mean, digits = 2))),
                        vjust=1.6, color="white", fontface = "bold", size=4)

          Now using dplyr library we create a new dataframe 'df' and try to plot it.
          Using group_by we group the data according to various types of cars and summarise enables us to find the statistics (here mean for 'displ' variable) for each group. To add data labels (with 2 decimal places) we use geom_text( )


          Customized BarPlot

          Creating Stacked Bar Chart

          p <- ggplot(data=mpg, aes(x=class, y=displ, fill=drv))
          p + geom_bar(stat = "identity")

          Stacked BarPlot
          p + geom_bar(stat="identity", position=position_dodge())
          Stacked - Position_dodge

          Creating BoxPlot

          Using geom_boxplot( ) one can create a boxplot.

          To create different boxplots for 'disp' for different levels of x we can define aes(x = cyl, y = disp)
          mtcars$cyl = factor(mtcars$cyl)
          ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot()
          We can see one outlier for 6 cylinders.

          To create a notched boxplot we write notch = TRUE
          ggplot(mtcars, aes(x=cyl, y=disp)) + geom_boxplot(notch = TRUE)

          Notched Boxplot


          Scatter Plot

          A scatterplot is used to graphically represent the relationship between two continuous variables.
          # Creating a scatter plot denoting various species.
          ggplot(data = iris, aes( x = Sepal.Length, y = Sepal.Width,shape = Species, color = Species)) + geom_point()
          We plot the points using geom_point( ). In the aesthetics we define that x axis denotes sepal length, y axis denotes sepal width; shape = Species and color = Species denotes that different shapes and different sizes should be used for each particular specie of flower.
          Scatter Plot
          Scatter plots are constructed using geom_point( ) 
          # Creating scatter plot for automatic cars denoting different cylinders.
          ggplot(data = subset(mtcars,am == 0),aes(x = mpg,y = disp,colour = factor(cyl))) + geom_point()
          Scatter plot denotingvarious levels of cyl
          We use subset( ) function to select only those cars which have am = 0; paraphrasing it; we are considering only those cars which are automatic. We plot the displacement corresponding to mileage and for different cylinders we are using various colors. Also factor(cyl) transforms our continuous variable cylinder to a factor.
          # Seeing the patterns with the help of geom_smooth.
          ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp))  + geom_point() + geom_smooth()
          In the above command we try to plot mileage (mpg) and displacement (disp) and variation in colors denote the varying horsepower(hp) .  geom_smooth( ) is used to determine what kind of pattern is exhibited by the points.
          In a similar way we can use geom_line( ) to plot another line on the graph:
          # Plotting the horsepower using geom_line
          ggplot(data = mtcars, aes(x = mpg,y = disp,colour = hp))  + geom_point(size = 2.5) + geom_line(aes(y = hp))
           Here in geom_point we have added an optional argument size = 2.5 denoting the size of the points. geom_line( ) creates a line. Note that we have not provided any aesthetics for x axis in geom_line, it means that it will plot the horsepower(hp) corresponding to mileage(mpg) only.

          Modifying the axis labels and appending the title and subtitle.

          #Adding title or changing the labels
          ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + labs(title = "Scatter plot") 
          #Alternatively
          ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot")
          ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point() + ggtitle(label = "Scatter plot",
                                                                        subtitle = "mtcars data in R")
          Adding title and subtitle to plots
            Here using labs( ) we can change the title of our legend or ggtitle we can assign our graph some title. If we want to add some title or sub-title to our graph thus we can use ggtitle( ) where the first argument is our 'main title' and second argument is our subtitle.
          a <- ggplot(mtcars,aes(x = mpg, y = disp, color = factor(cyl))) + geom_point()
          a
          #Changing the axis labels.
          a + labs(color = "Cylinders")
          a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement")
          We firstly save our plot to 'a' and thus we make the alterations.
          Note that in the labs command we are using color = "Cylinders" which changes the title of our legend.
          Using the xlab and ylab commands we can change the x and y axis labels respectively. Here our x axis label is 'mileage' and y axis label is 'displacement'
          #Combining it all
          a + labs(color = "Cylinders") + xlab("Mileage") + ylab("Displacement") + ggtitle(label = "Scatter plot", subtitle = "mtcars data in R")
           In the above plot we can see that the labels on x axis,y axis and legend have changed; the title and subtitle have been added and the points are colored, distinguishing the number of cylinders.

          Playing with themes

          Themes can be used in ggplot2 to change the backgrounds,text colors, legend colors and axis texts.
          Firstly we save our plot to 'b' and hence create the visualizations by manipulating 'b'. Note that in aesthetics we have written mpg, disp which automatically plots mpg on x axis and disp on y axis.
          #Changing the themes.
          b <- ggplot(mtcars,aes(mpg,disp)) + geom_point()  + labs(title = "Scatter Plot") 
          #Changing the size and color of the Title and the background color.
          b + theme(plot.title = element_text(color = "blue",size = 17),plot.background = element_rect("orange"))
          Plot background color changed.
           We use theme( ) to modify the the plot title and background. plot.title is an element_text( ) object in which we have specified the color and size of our title. Utilizing plot.background which is an element_rect( ) object we can specify the color of our background.
          ggplot2( ) offers by default themes with background panel design colors being changed automatically. Some of them are theme_gray, theme_minimal, theme_dark etc.
          b + theme_minimal( )
          We can observe horizontal and vertical lines behind the points. What if we don't need them? This can be achieved via: 
          #Removing the lines from the background.
          b + theme(panel.background = element_blank())

          Setting panel.background = element_blank( ) with no other parameter can remove those lines and color from the panel.
          #Removing the text from x and y axis.
          b + theme(axis.text = element_blank())
          b + theme(axis.text.x = element_blank())
          b + theme(axis.text.y = element_blank())
          To remove the text from both the axis we can use axis.text = element_blank( ). If we want to remove the text only from particular axis then we need to specify it.
          Now we save our plot to c and then make the changes.
          #Changing the legend position
          c <- ggplot(mtcars,aes(x = mpg, y = disp, color = hp)) +labs(title = "Scatter Plot") + geom_point()
          c +  theme(legend.position = "top")
          If we want to move the legend then we can specify legend.position as "top" or "bottom" or "left" or "right".
          Finally combining all what we have learnt in themes we create the above plot where the legend is placed at bottom, plot title is in forest green color, the background is in yellow and no text is displayed on both the axis.
          #Combining everything.
          c + theme(legend.position = "bottom", axis.text = element_blank()) +
            theme(plot.title = element_text(color = "Forest Green",size = 17),plot.background = element_rect("Yellow")) 

          Changing the color scales in the legend.

          In ggplot2, by default the color scale is from dark blue to light blue. It might happen that we wish to innovate the scales by changing the colors or adding new colors. This can be done successfuly via scale_color_gradient function.
          c + scale_color_gradient(low = "yellow",high = "red") 
          Suppose we want the colors to vary from yellow to red; yellow denoting the least value and red denoting the highest value; we set low = "yellow" and high = "red". Note that in the legend it takes the scale to be started from 0 and not the minimum value of the series.
          What if we want 3 colors?
          c + scale_color_gradient2(low = "red",mid = "green",high = "blue")
           To serve the purpose of having 3 colors in the legend we use scale_color_gradient2 with low = "red",mid = "green" and high = "blue" means it divides the entire range(Starting from 0) to the maximum observation in 3 equal parts, with first part being shaded as red, central part as green and highest part as blue.
          c + theme(legend.position = "bottom") + scale_color_gradientn(colours = c("red","forest green","white","blue"))
          If we want more than 3 colors to be represented by our legend we can utilize scale_color_gradientn( ) function and the argument colors will be a vector starting where 1st element denotes the color of the 1st part, 2nd color denotes the color of 2nd part etc.


          Changing the breaks in the legend.

          It can be seen that the legend for continuous variable starts from 0.
          Suppose we want the breaks to be: 50,125,200,275 and 350, we use seq(50,350,75) where 50 denotes the least number, 350 is the maximum number in the sequence and 75 is the difference between 2 consecutive numbers.
          #Changing the breaks in the legend
          c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), labels = paste(seq(50,350,75),"hp"))
           In scale_color_continuous we set the breaks as our desired sequence, and can change the labels if we want. Using paste function our sequence is followed by the word "hp" and name = "horsepower" changes the name of our legend.

          Changing the break points and color scale of the legend together.

          Let us try changing the break points and the colors in the legend together by trial and error.
          #Trial 1 : This one is wrong
          c + scale_color_continuous( breaks = seq(50,350,75)) +
            scale_color_gradient(low = "blue",high = "red") 
          We can refer to trial1 image for the above code which can be found below. Notice that the color scale is blue to red as desired but the breaks have not changed.
          #Trial 2: Next one is wrong.
          c  +  scale_color_gradient(low = "blue",high = "red") +
            scale_color_continuous( breaks = seq(50,350,75))
          trial2 image is the output for the above code. Here the color scale has not changed but the breaks have been created.
          trial1 
          trial2

           What is happening? The reason for this is that we cannot have 2 scale_color functions for a single graph. If there are multiple scale_color_ functions then R overwrites the other scale_color_ functions by the last scale_color_ command it has received.
          In trial 1, scale_color_gradient overwrites the previous scale_color_continuous command. Similarly in trial 2, scale_color_continuous overwrites the previous scale_color_gradient command.

          The correct way to do is to define the arguments in one function only.
          c + scale_color_continuous(name = "horsepower", breaks = seq(50,350,75), low = "red", high = "black") + theme(panel.background = element_rect("green"),
           plot.background = element_rect("orange"))
          Here low = "red" and high = "black" are defined in scale_color_continuous function along with the breaks.

          Changing the axis cut points.

          We save our initial plot to 'd'. 
          d <- ggplot(mtcars,aes(x = mpg,y = disp)) + geom_point(aes(color = factor(am)))  +
            xlab("Mileage") + ylab("Displacement") +
            theme(panel.background = element_rect("black") , plot.background = element_rect("pink"))
          To change the axis cut points we use scale_(axisname)_continuous.
          d +  scale_x_continuous(limits = c(2,4)) + scale_y_continuous(limits = c(15,30))
          To change the x axis limits to 2 to 4, we use scale_x_continuous and my 'limits' is a vector defining the upper and lower limits of the axis. Likewise, scale_y_continuous set the least cut off point to 15 and highest cut off point of y axis to 30.

          d + scale_x_continuous(limits = c(2,4),breaks = seq(2,4,0.25)) +
            scale_y_continuous(limits = c(15,30),breaks = seq(15,30,3))
          We can also add another parameter 'breaks' which will need a vector to specify all the cut of points of the axis. Here we create a sequence of 2,2.5,3,3.5,4 for x axis and for y axis the sequence is 15,18,21,...,30.

          Faceting.

          Faceting is a technique which is used to plot the graphs for the data corresponding to various categories of a particular variable. Let us try to understand it via an illustration:


          facet_wrap function is used for faceting where the after the tilde(~) sign we define the variables on which we want the classification.
          Faceting for carb
          We see that there are 6 categories of "carb". Faceting creates 6 plots between mpg and disp; where the points correspond to the categories.
          We can mention the number of rows we need for faceting.
          # Control the number of rows and columns with nrow and ncol
          ggplot(mtcars, aes(mpg, disp)) +  geom_point() +  facet_wrap(~carb,nrow = 3)
          Here an additional parameter nrow =  3 depicts that in total all the graphs should be adjusted in 3 rows.

          Faceting using multiple variables.

          Faceting can be done for various combinations of carb and am.  
          # You can facet by multiple variables
          ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(~carb + am)
          #Alternatively
          ggplot(mtcars, aes(mpg, disp)) + geom_point() + facet_wrap(c("carb","am"))
           There are 6 unique 'carb' values and 2 unique 'am' values thus there could be 12 possible combinations but we can get only 9 graphs, this is because for remaining 3 combinations there is no observation. 
          It might be puzzling to grasp which the level of am and carb specially when the labels ain't provided. Accordingly we can label the variables.
          # Use the `labeller` option to control how labels are printed:
          ggplot(mtcars, aes(mpg, disp)) +  geom_point() +  facet_wrap(~carb  + am, labeller = "label_both")
          facet_wrap in multiple variables.
          R provides facet_grid( ) function which can be used to faced in two dimensions.
          z <- ggplot(mtcars, aes(mpg, disp)) + geom_point()
          We store our basic plot in 'z' and thus we can make the additions:
          z + facet_grid(. ~ cyl)   #col
          z + facet_grid(cyl ~ .)   #row
          z + facet_grid(gear ~ cyl,labeller = "label_both")  #row and col
          using facet_grid( )
          In facet_grid(.~cyl), it facets the data by 'cyl' and the cylinders are represented in columns. If we want to represent 'cyl' in rows, we write facet_grid(cyl~.). If we want to facet according to 2 variables we write facet_grid(gear~cyl) where gears are represented in rows and 'cyl' are illustrated in columns.


          Adding text to the points.

          Using ggplot2 we can define what are the different values / labels for all the points. This can be accomplished by using geom_text( ) 
          #Adding texts to the points
          ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
            geom_text(aes(label = am))
          In geom_text we provide aes(label = am) which depicts that for all the points the corresponding levels of "am" should be shown.
          In the graph it can be perceived that the labels of 'am' are overlapping with the points. In some situations it may become difficult to read the labels when there are many points. In order to avoid this we use geom_text_repel function in 'ggrepel' library.
          require(ggrepel)
          ggplot(mtcars, aes(x= mpg,y = disp)) + geom_point() +
            geom_text_repel(aes(label = am))
           We load the library ggrepel using require( ) function. If we don't want the text to overlap we use geom_text_repel( ) instead of geom_text( ) of ggplot2 , keeping the argument aes(label = am).
          geom_text_repel

          Linear Regression in Python

          $
          0
          0
          Linear Regression is a supervised statistical technique where we try to estimate the dependent variable with a given set of independent variables. We assume the relationship to be linear and our dependent variable must be continuous in nature.
          Python : Linear Regression
          In the following diagram we can see that as horsepower increases mileage decreases thus we can think to fit linear regression. The red line is the fitted line of regression and the points denote the actual observations.

          The vertical distance between the points and the fitted line (line of best fit) are called errors. The main idea is to fit this line of regression by minimizing the sum of squares of these errors. This is also known as principle of least squares.

          Examples:

          • Estimating the price (Y) of a house on the basis of its Area (X1), Number of bedrooms (X2), proximity to market (X3) etc. 
          • Estimating the mileage of a car (Y) on the basis of its displacement (X1), horsepower(X2), number of cylinders(X3), whether it is automatic or manual (X4) etc. 
          • To find the treatment cost or to predict the treatment cost on the basis of factors like age, weight, past medical history, or even if there are blood reports, we can use the information from the blood report.
          Simple Linear Regression Model: In this we try to predict the value of dependent variable (Y) with only one regressor or independent variable(X).

          Multiple Linear Regression Model: Here we try to predict the value of dependent variable (Y) with more than one regressor or independent variables.

          The linear regression model:

          Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term.
          Multiple Regression Equation

          Assumptions of linear regression:

          • There must be a linear relationship between the dependent and independent variables.
          • Sample observations are independent.
          • Error terms are normally distributed with mean 0. 
          • No multicollinearity -  When the independent variables in my model are highly linearly related then such a situation is called multicollinearity.
          • Error terms are identically and independently distributed. (Independence means absence of autocorrelation).
          • Error terms have constant variance i.e. there is no heteroscedasticity.
          • No outliers are present in the data.


          Important Model Performance Metrics

          Coefficient of determination (R square)

          It suggests the proportion of variation in Y which can be explained with the independent variables. Mathematically, it is the ratio of predicted values and observed values, i.e.
          RSquare

          If our fit is perfect then

          If then R2 = 0 indicates a poor fit. Thus it lies between 0 and 1.
          If the value of R2 is 0.912 then this suggests that 91.2% of the variation in Y can be explained with the help of given explanatory variables in that model. In other words, it explains the proportion of variation in the dependent variable that is explained by the independent variables.

          R square solely not such a good measure:

          On addition of a new variable the error is sure to decrease, thus R square always increases whenever a new variable is added to our model. This may not describe the importance of a variable.
          For eg. In a model determining the price of the house, suppose we had the variables GDP, Inflation rate, Area. If we add a new variable: no. of plane crashes (which is irrelevant) then still R square will increase.

          Adjusted R square:

          Adjusted R square is given by:
          or
          Adjusted R-Square

          where k is the no. of regressors or predictors.

          Hence adjusted R square will always be less than or equal to R square.

          On addition of a variable then R square in numerator and 'k' in the denominator will increase.
          If the variable is actually useful then R square will increase by a large amount and 'k' in the denominator will be increased by 1. Thus the magnitude of increase in R square will compensate for increase in 'k'. On the other hand, if a variable is irrelevant then on its addition R square will not increase much and hence eventually adjusted R square will increase.
          Thus as a general thumb rule if adjusted R square increases when a new variable is added to the model, the variable should remain in the model. If the adjusted R square decreases when the new variable is added then the variable should not remain in the model.

          Why error terms should be normally distributed?

          For parameter estimate (i.e. estimating the βi’s) we don't need that assumption. But, if it is not a normal distribution, some of those hypotheses tests which we will be doing as part of diagnostics may not be valid. 
          For example:  To check whether the Beta (the regression coefficient) is significant or not, I'll do a T-test. So, if my error is not a normal distribution, then the statistic I derive may not be a T-distribution. So, my diagnostic test or hypotheses test is not valid. Similarly, F-test for linear regression which checks whether any of the independent variables in a multiple linear regression model are significant will be not be viable.

          Why is expectation of error always zero?

          The error term is the deviation between observed points and the fitted line.  The observed points will be above and below the fitted line, so if I took the average of all the deviations, it should be 0 or  near 0. Zero conditional mean is there which says that there are both negative and positive errors which cancel out on an average. This helps us to estimate dependent variable precisely.

          Why multicollinearity is a problem?

          If my Xi’s are highly correlated then |X’X| will be close to 0 and hence inverse of (X’X) will not exist or will be indefinitely large. Mathematically, which will be indefinitely large in presence of multicollinearity. Long story in short, multicollinearity increases the estimate of standard error of regression coefficients which makes some variables statistically insignificant when they should be significant.

          How can you detect multicollinearity?

          1. Bunch Map Analysis: By plotting scatter plots between various Xi’ s we can have a visual description of how the variables are related.

          2. Correlation Method: By calculating the correlation coefficients between the variables we can get to know about the extent of multicollinearity in the data.

          3.  VIF (Variance Inflation Factor) Method: Firstly we fit a model with all the variables and then calculate the variance inflation factor (VIF) for each variable. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. The higher the value of VIF for ith regressor, the more it is highly correlated to other variables.

          So what is Variance Inflation Factor?

          Variance inflation factor (VIF) for an explanatory variable is given 1/(1-R^2 )  . Here, we take that particular X as response variable and all other explanatory variables as independent variables. So, we run a regression between one of those explanatory variables with remaining explanatory variables. 

          Detecting heteroscedasticity!

          1. Graphical Method: Firstly do the regression analysis and then plot the error terms against the predicted values( Yi^). If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.
          2. Goldfeld Quandt (GQ)Test: It assumes that heteroscedastic variance σi2 is positively related to one of the explanatory variables And errors are assumed to be normal. Thus if heteroscedasticity is present then the variance would be high for large values of X.

          Steps for GQ test:
          1. Order/ rank (ascending) the observations according to the value of Xi beginning with the lowest X value.
          2. Omit ‘c’ central observations and divide the remaining (n-c) observations into 2 groups of (n-c)/2 observations each.
          3. Fit separate OLS regression to both the groups and obtain residual sum of squares (RSS1 and RSS2) for both the groups.
          4. Obtain F = RSS2/ RSS1 
          It follows F with ((n-c)/2-k) d.f. both both numerator and denominator.
          Where k is the no. of parameters to be estimated including the intercept.
          If errors are homoscedastic then the two variances RSS2 and RSS1 turn out to be equal i. e. F will tend to 1. 

          Dataset used:

          We have 1030 observations on 9 variables. We try to estimate the Complete compressive strength(CRS) using:

          1. Cement - kg in a m3 mixture
          2. Blast Furnace Slag - kg in a m3 mixture
          3. Fly Ash - kg in a m3 mixture
          4. Water - kg in a m3 mixture
          5. Superplasticizer - kg in a m3 mixture
          6. Coarse Aggregate - kg in a m3 mixture
          7. Fine Aggregate - kg in a m3 mixture
          8. Age - Day (1-365)


          Importing the libraries:

          Numpy, pandas and matplotlib.pyplot are imported with aliases np, pd and plt respectively.
          import numpy as np
          import pandas as pd
          import matplotlib.pyplot as plt


          Loading the data

          We load our data using pd.read_csv( )
          data = pd.read_csv("Concrete_Data.csv")
          Now the data is divided into independent (x) and dependent variables (y)
          x = data.iloc[:,0:8]
          y = data.iloc[:,8:]


          Splitting the data into training and test sets.

          Using sklearn we split 80% of our data into training set and rest in test set. Setting random_state will give the same training and test set everytime on running the code.
          from sklearn.cross_validation import train_test_split
          x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 100) 

          Running linear regression using sklearn:

          Using sklearn linear regression can be carried out using LinearRegression( ) class. sklearn automatically adds an intercept term to our model.
          from sklearn.linear_model import LinearRegression
          lm = LinearRegression()
          lm = lm.fit(x_train,y_train)   #lm.fit(input,output)
          The coefficients are given by:
          lm.coef_
          array([[ 0.12415357,  0.10366839,  0.093371  , -0.13429401,  0.28804259,
          0.02065756, 0.02563037, 0.11461733]])

          To store coefficients in a data frame along with their respective independent variables -
          coefficients = pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))], axis = 1)
          0            Cement  0.124154
          1 Blast 0.103668
          2 Fly Ash 0.093371
          3 Water -0.134294
          4 Superplasticizer 0.288043
          5 CA 0.020658
          6 FA 0.025630
          7 Age 0.114617
          The intercept is:
          lm.intercept_
          array([-34.273527])
          To predict the values of y on the test set we use lm.predict( )
          y_pred = lm.predict(x_test)
          Errors are the difference between observed and predicted values.
          y_error = y_test - y_pred
          R square can be obbtained using sklearn.metrics ( ):
          from sklearn.metrics import r2_score
          r2_score(y_test,y_pred)
          0.62252008774048395


          Running linear regression using statsmodels:

          It is to be noted that statsmodels does not add intercept term automatically thus we need to create an intercept to our model.
          import statsmodels.api as sma
          X_train = sma.add_constant(x_train) ## let's add an intercept (beta_0) to our model
          X_test = sma.add_constant(x_test) 
          Linear regression can be run by using sm.OLS:
          import statsmodels.formula.api as sm
          lm2 = sm.OLS(y_train,X_train).fit()
          The summary of our model can be obtained via:
          lm2.summary()
          """
          OLS Regression Results
          ==============================================================================
          Dep. Variable: CMS R-squared: 0.613
          Model: OLS Adj. R-squared: 0.609
          Method: Least Squares F-statistic: 161.0
          Date: Wed, 03 Jan 2018 Prob (F-statistic): 4.37e-162
          Time: 21:29:10 Log-Likelihood: -3090.4
          No. Observations: 824 AIC: 6199.
          Df Residuals: 815 BIC: 6241.
          Df Model: 8
          Covariance Type: nonrobust
          ====================================================================================
          coef std err t P>|t| [0.025 0.975]
          ------------------------------------------------------------------------------------
          const -34.2735 29.931 -1.145 0.253 -93.025 24.478
          Cement 0.1242 0.010 13.054 0.000 0.105 0.143
          Blast 0.1037 0.011 9.229 0.000 0.082 0.126
          Fly Ash 0.0934 0.014 6.687 0.000 0.066 0.121
          Water -0.1343 0.046 -2.947 0.003 -0.224 -0.045
          Superplasticizer 0.2880 0.102 2.810 0.005 0.087 0.489
          CA 0.0207 0.011 1.966 0.050 2.79e-05 0.041
          FA 0.0256 0.012 2.131 0.033 0.002 0.049
          Age 0.1146 0.006 19.064 0.000 0.103 0.126
          ==============================================================================
          Omnibus: 3.757 Durbin-Watson: 2.033
          Prob(Omnibus): 0.153 Jarque-Bera (JB): 3.762
          Skew: -0.165 Prob(JB): 0.152
          Kurtosis: 2.974 Cond. No. 1.07e+05
          ==============================================================================

          Warnings:
          [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
          [2] The condition number is large, 1.07e+05. This might indicate that there are
          strong multicollinearity or other numerical problems.
          """
          The predicted values for test set are given by:
          y_pred2 = lm2.predict(X_test) 
          Note that both y_pred and y_pred2 are same. It's just these are calculated via different packages.

          Calculate R-Squared and Adjusted R-Squared Manually on Test data

          We can also calculate r-squared and adjusted r-squared via formula without using any package.
          import numpy as np
          y_test = pd.to_numeric(y_test.CMS, errors='coerce')
          RSS = np.sum((y_pred2 - y_test)**2)
          y_mean = np.mean(y_test)
          TSS = np.sum((y_test - y_mean)**2)
          R2 = 1 - RSS/TSS
          R2

          n=X_test.shape[0]
          p=X_test.shape[1] - 1

          adj_rsquared = 1 - (1 - R2) * ((n - 1)/(n-p-1))
          adj_rsquared

          R-Squared : 0.6225
          Adjusted RSquared : 0.60719

          Detecting Outliers:

          Firstly we try to get the studentized residuals using get_influence( ). The studentized residuals are saved in resid_student.
          influence = lm2.get_influence() 
          resid_student = influence.resid_studentized_external
          Combining the training set and the residuals we have:
             Cement  Blast  Fly Ash  Water  Superplasticizer      CA     FA    Age  \
          0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28.0
          1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28.0
          2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270.0
          3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365.0
          4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360.0

          Studentized Residuals
          0 1.559672
          1 -0.917354
          2 1.057443
          3 0.637504
          4 -1.170290
          resid = pd.concat([x_train,pd.Series(resid_student,name = "Studentized Residuals")],axis = 1)
          resid.head()
           If the absolute value of studentized residuals is more than 3 then that observation is considered as an outlier and hence should be removed. We try to create a logical vector for the absolute studentized residuals more than 3
               Cement  Blast  Fly Ash  Water  Superplasticizer     CA     FA  Age  \
          649 166.8 250.2 0.0 203.5 0.0 975.6 692.6 3.0

          Studentized Residuals
          649 3.161183
          resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:]
          The index of the outliers are given by ind:
          ind = resid.loc[np.absolute(resid["Studentized Residuals"]) > 3,:].index
          ind
           Int64Index([649], dtype='int64')

          Dropping Outlier

          Using the drop( ) function we remove the outlier from our training sets!
          y_train.drop(ind,axis = 0,inplace = True)
          x_train.drop(ind,axis = 0,inplace = True)  #Interept column is not there
          X_train.drop(ind,axis = 0,inplace = True)  #Intercept column is there

          Detecting and Removing Multicollinearity

          We use the statsmodels library to calculate VIF
          from statsmodels.stats.outliers_influence import variance_inflation_factor
          [variance_inflation_factor(x_train.values, j) for j in range(x_train.shape[1])]
          [15.477582601956859,
          3.2696650121931814,
          4.1293255012993439,
          82.210084751631086,
          5.21853674386234,
          85.866945489015535,
          71.816336942930675,
          1.6861600968467656]

          We create a function to remove the collinear variables. We choose a threshold of 5 which means if VIF is more than 5 for a particular variable then that variable will be removed.
          def calculate_vif(x):
              thresh = 5.0
              output = pd.DataFrame()
              k = x.shape[1]
              vif = [variance_inflation_factor(x.values, j) for j in range(x.shape[1])]
              for i in range(1,k):
                  print("Iteration no.")
                  print(i)
                  print(vif)
                  a = np.argmax(vif)
                  print("Max VIF is for variable no.:")
                  print(a)
                  if vif[a] <= thresh :
                      break
                  if i == 1 :         
                      output = x.drop(x.columns[a], axis = 1)
                      vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
                  elif i > 1 :
                      output = output.drop(output.columns[a],axis = 1)
                      vif = [variance_inflation_factor(output.values, j) for j in range(output.shape[1])]
              return(output)
          train_out = calculate_vif(x_train) 
          Now we view the training set
          train_out.head()
               Cement Blast Fly Ash Superplasticizer Age
          337   275.1    0.0    121.4               9.9   56
          384 516.0 0.0 0.0 8.2 28
          805 393.0 0.0 0.0 0.0 90
          682 183.9 122.6 0.0 0.0 28
          329 246.8 0.0 125.1 12.0 3


          Removing the variables from the test set.

          x_test.head()
          x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
          x_test.head()
               Cement  Blast  Fly Ash  Superplasticizer  Age
          173 318.8 212.5 0.0 14.3 91
          134 362.6 189.0 0.0 11.6 28
          822 322.0 0.0 0.0 0.0 28
          264 212.0 0.0 124.8 7.8 3
          479 446.0 24.0 79.0 11.6 7

          Running linear regression again on our new training set (without multicollinearity)

          import statsmodels.api as sma
          import statsmodels.formula.api as sm
          train_out = sma.add_constant(train_out) ## let's add an intercept (beta_0) to our model
          x_test.drop(["Water","CA","FA"],axis = 1,inplace = True)
          X_test = sma.add_constant(x_test)
          lm2 = sm.OLS(y_train,train_out).fit()
          lm2.summary()
          """
          OLS Regression Results
          ==============================================================================
          Dep. Variable: CMS R-squared: 0.570
          Model: OLS Adj. R-squared: 0.567
          Method: Least Squares F-statistic: 216.3
          Date: Wed, 10 Jan 2018 Prob (F-statistic): 6.88e-147
          Time: 15:14:59 Log-Likelihood: -3128.8
          No. Observations: 823 AIC: 6270.
          Df Residuals: 817 BIC: 6298.
          Df Model: 5
          Covariance Type: nonrobust
          ====================================================================================
          coef std err t P>|t| [0.025 0.975]
          ------------------------------------------------------------------------------------
          const -11.1119 1.915 -5.803 0.000 -14.871 -7.353
          Cement 0.1031 0.005 20.941 0.000 0.093 0.113
          Blast 0.0721 0.006 12.622 0.000 0.061 0.083
          Fly Ash 0.0614 0.009 6.749 0.000 0.044 0.079
          Superplasticizer 0.7519 0.077 9.739 0.000 0.600 0.903
          Age 0.1021 0.006 16.582 0.000 0.090 0.114
          ==============================================================================
          Omnibus: 0.870 Durbin-Watson: 2.090
          Prob(Omnibus): 0.647 Jarque-Bera (JB): 0.945
          Skew: 0.039 Prob(JB): 0.623
          Kurtosis: 2.853 Cond. No. 1.59e+03
          ==============================================================================

          Checking normality of residuals

          We use Shapiro Wilk test  from scipy library to check the normality of residuals.

          1. Null Hypothesis: The residuals are normally distributed.
          2. Alternative Hypothesis: The residuals are not normally distributed.

          from scipy import stats
          stats.shapiro(lm2.resid)
          (0.9983407258987427, 0.6269884705543518)

          Since the pvalue is 0.6269 thus at 5% level of significance we can say that the residuals are normally distributed.

          Checking for autocorrelation

          To ensure the absence of autocorrelation we use Ljungbox test.

          1. Null Hypothesis: Autocorrelation is absent.
          2. Alternative Hypothesis: Autocorrelation is present.

          from statsmodels.stats import diagnostic as diag
          diag.acorr_ljungbox(lm2.resid , lags = 1) 
          (array([ 1.97177212]), array([ 0.16025989]))

          Since p-value is 0.1602 thus we can accept the null hypothesis and can say that autocorrelation is absent.

          Checking heteroscedasticity

          Using Goldfeld Quandt we test for heteroscedasticity.

          1. Null Hypothesis: Error terms are homoscedastic
          2. Alternative Hypothesis: Error terms are heteroscedastic.

          import statsmodels.stats.api as sms
          from statsmodels.compat import lzip
          name = ['F statistic', 'p-value']
          test = sms.het_goldfeldquandt(lm2.resid, lm2.model.exog)
          lzip(name, test)
          [('F statistic', 0.9903), ('p-value', 0.539)]

          The p-value is 0.539 hence we can say that the residuals have constant variance. Hence we can say that all the assumptions of our linear regression model are satisfied.

          SAS Date Formats and Informats

          $
          0
          0
          This tutorial describes the usage of SAS Date formats and informats. It includes practical real-world data problems related to SAS formats.

          What are Formats and Informats?

          Informats is used to tell SAS how to read a variable whereas Formats is used to tell SAS how to display or write values of a variable.

          Informats is basically used when you read or import data from either an external file (Text/Excel/CSV) or read in sample data which was created using CARDS/DATALINES statement. It is also used when you create a new variable in a dataset.

          Formats can be used in both Data Steps and PROC Steps whereas Informat can be used only in Data Steps. Let's understand by examples -

          Example 1 - Read Dates in SAS

          In the program below, we have used INFORMATS ddmmyy8. and ddymmyy10. to read dates in SAS. It creates a dataset called sampledata which is stored in WORK library.

          DATA sampledata;
               INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
              CARDS;
               30-12-16 30-12-2016
          ;
          RUN;
          The INFORMATS ddmmyy8. is used to read 30-12-16 date and ddmmyy10. to read 30-12-2016 date. In this case, 8 and 10 refers to width of the date.

          The created dataset looks like below -
          Read SAS Date
          It returns 20818 as it is in SAS date value form. It is not meaningful if you look at the value. You cannot tell which date it is. To display in real date form, use FORMAT statement.
          DATA sampledata;
               INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
               FORMAT date1 ddmmyy8. date2 ddmmyy10.;
          cards;
               30-12-16 30-12-2016
          ;
          RUN;
          Output : SAS Format

          How to read DD-MMM-YY format

          You can use date11. format for both DD-MMM-YY and DD-MMM-YYYY format.
          DATA temp;
               INPUT @6 dt date11.;
               FORMAT dt date11.;
               CARDS;
               10-oct-14
               ;
          PROC PRINT NOOBS;
          RUN;
          Result : 10-OCT-2014

          Example 2 - Display Today's Date

          The today() function can be used to generate current date.
          data _null_;
              dt=today();
              format dt yymmdd10.;
              put dt ;
          run;
          Result : It returns 2016-12-30 as 30DEC2016 is the today's date. It's in YYYY-MM-DD format because we've used yymmdd10. format. The 10 refers to the width of the date as 2016-12-30 contains 10 elements. The PUT statement is used to show value in log window.

          To display date in WORD format

          1. Short Word Date Format

          The format date9. returns 30DEC2016.
          format dt date9.;

          2. Complete Word Date Format

          The format WORDDATE. returns DECEMBER 30, 2016. No need to specify width in this format. It automatically adjusts the width depending on the month.
          format dt WORDDATE.;

          3. Including WEEK

          The format WEEKDATE. gives Friday, December 30, 2016
          format dt WEEKDATE.;

          Display DAY / MONTH / YEAR

          In this section, we will see how we can write only day, month, year and weekday.
          data _null_;
          dt=today();
          put "Day :" dt  DAY.;
          put "Month :" dt MONTH.;
          put "YEAR:" dt YEAR.;
          put "WEEKDAY:" dt DOWNAME.;
          run;
          SAS : Date Formats
          We can also use FORMAT in the PUT statement without specifying FORMAT statement explicitly. The DAY. format returned 30, MONTH. format returned 12 and YEAR. format returned 2016. In addition, we have used DOWNAME. format to extract weekday (Friday).

          Other Popular Formats

          Some of the commonly used date formats are listed below -
          FormatsResult
          DDMMYYP10.30.12.2016
          DDMMYYS10.30/12/2016
          MMDDYYP10.12.30.2016
          MMDDYYS10.12/30/2016
          WORDDATX19.30 DECEMBER 2016

          Endnotes

          Hope you have a better understanding of the difference between SAS Date Formats and Informats and how they are used after completing this tutorial.

          Tutorial : Build Webapp in R using Shiny

          $
          0
          0
          In this tutorial, we will cover how to build shiny app from scratch in R. It includes various examples which would help you to get familiar with shiny package.

          Shiny is a R package developed by RStudio that can be used to create interactive web pages with R. In simple words, you can build web page (online reporting tool) without knowing any web programming languages such as Javascript / PHP / CSS.

          The best part about shiny package is that you can easily integrate R with webpage. Suppose you want your web page run machine learning algorithms like random forest, SVM etc  and display summary of the model with the flexibility of selecting inputs from user. Shiny can do it very easily.
          R : Shiny Tutorial


          Shiny's prominent features

          1. Customizable widgets like sliders, drop down lists, numeric inputs and many more.
          2. Downloading datasets, graphs and tables in various formats.
          3. Uploading files.
          4. Provides utility to create brilliant plots.
          5. In-built functions for viewing data or printing the text or summaries.
          6. Reactive programs which makes data handling easier.
          7. Conditional Panels for only when a particular condition is present.
          8. Works in any R environment (Console R, RGUI for Windows or Mac,  RStudio, etc)
          9. No need to learn another software for online dashboarding
          10. Can style your app with CSS / HTML (Optional)

          Must things in shiny app code

          1. User Interface (UI) : It controls the layout and appearance of various widgets on the web page. Every shiny app requires a user interface which is controlled by ui script. 
          2. Server: It contains the instructions that your computer needs when the user interacts with the app.
          Example - You must have seen or created interactive charts in Excel. To make it interactive, we use drop downs, list boxes or some user controls. When user changes the values from drop downs,  you will notice that the chart gets updated. 

          The UI is responsible for creating these drop downs, list boxes and telling Shiny where to place these user controls and where to place the charts, while the server is responsible for creating the chart and the data in the table

          Basic layout of UI

          User Interface: A simple shiny UI consists of a fluidpage which contains various panels. We can divide the display in two parts named sidebdarPanel( )  and mainPanel( ). Both of the panels can be accessed using sidebarLayout( ).

          In the following image you can get an idea what is a title panel, sidebar panel and main panel in a shiny app.
          1. Title panel is a place where the title of the app is displayed.
          2. Sidebar panel is where special instructions or widgets (drop down / slider/ checkbox) are displayed to the app user. The sidebar panel appears on the left side of your app by default. You can move it to the right side by changing the position argument in the sidebar layout.
          3. Main panel is the area where all the outputs are generally placed.

          Shiny Elements

          Installing Shiny

          First we need to install shiny package by using command install.packages( ).
          install.packages("shiny")

          Loading Shiny

          Shiny package can be loaded by using library( ).
          library(shiny)

          The first simple shiny app with basic layout

          ui = fluidPage(sidebarLayout(sidebarPanel("Welcome to Shiny App"),
                                       mainPanel("This is main Panel")))
          server = function(input, output) {  }
          shinyApp(ui, server)
          Sidebar Panel and Main Panel 


          Guidelines for beginners to run a shiny app

          Step 1 : shinyApp(ui,server) It is an in-built function in shiny package to run the app with ui and server as the arguments. Select the code and run it. Once you do it successfully, you would find the text Listening on http://127.0.0.1:4692 on console.

          Step 2 : To create your app you need to save the code as an app.R file and a RunApp icon will get displayed on your screen. Click on it and a new prompt window as your app will appear.
          Shiny App

          Some more panels...

          There are some additional panels which can be added to sidebarPanel and mainPanel depending upon the layout and requirements of the app. Some of them which shall be explained later in this tutorial are:
          Shiny : Panels


          Adding a title to your App!

          Using titlePanel( ) once can provide an appropriate title for the app. Note that after titlePanel a comma(,) is placed.
          ui =  fluidPage(titlePanel("My favorite novels:"),
                          sidebarLayout(sidebarPanel(),
                                        mainPanel()))
          server = function(input, output) {
          }
          shinyApp(ui, server)
          Title : Shiny App

          Using HTML tags in Shiny

          Content can be added in various panels. To change the appearance of the text by bolds, italics, images, changing the fonts and colors, adding heading etc. we can use various HTML functions in shiny. Some of them being the same in both of them are:
          HTML Tags

          Creating a hyperlink

          A hyperlink can be created using a( ) where the first argument is the text with which the link is attached. href contains the link for our website which we want to attach.
          ui =  fluidPage(sidebarLayout(sidebarPanel(
            a("Click here!", href = "http://www.listendata.com/")),
            mainPanel()))
          server = function(input, output) {}
          shinyApp(ui, server)

          Modifying the text presentation using HTML tags.

          We create an app containing the list of the favorite novels . You can refer to the above mentioned table of HTML and shiny functions.
          ui =  fluidPage(titlePanel("My favorite novels:"),
                          sidebarLayout(sidebarPanel(
                            ("My favorite novels are:"),
                            br(),
                            h4(strong("The Kiterunner"), "a novel by", em("Khaled Hoseinni")),
                            h3(strong("Jane Eyre"), "a novel by", code("Charolette Bronte")),
                            strong(
                              "The diary of a young girl",
                              "by",
                              span("Anne Frank", style = "color:blue")
                            ),
                            div(strong("A thousand splendid suns"), "by Khaled Hoseinni", style = "color: red")
                          ),
                          mainPanel()))
          server = function(input, output) { }
          shinyApp(ui, server)

          Note that "Charolette Bronte"in the app would be written in a coded style;
          Difference between span( ) and div( ) span( ) wrote "Anne Frank" on the same line with "blue" color.  div( ) is similar to span( ), it is just that it creates the text in a different line. 
          Shiny : HTML Formating

          Introducing widgets

          Various widgets are used in shiny to select various outputs. These widgets can be inserted in the ui function (anywhere in the main panel and sidebar panel).
          The most commonly used widgets are:
          Shiny Widgets

          The following image tells how various widgets appear on running an app.

          Shiny Widgets
          • 'Buttons' can be created using an actionButton and submitButton widgets
          • Single check box, multiple check box and date inputs are created using checkboxInput, checkboxGroupInput and dateInput respectively.
          • Date range is created using dateRangeInput.

          Most commonly used widgets

          All the widgets demand an input ID which we will use to retrieve the values.  This input ID is not accessible by the app user. labels is the heading for our widget which be visible on when the app is being run. In order to understand more we create an app to get the details of the user by the widgets provided by shiny.

          HelpText and TextInput

          ui =  fluidPage(sidebarLayout(
            sidebarPanel(helpText("This questionnaire is subject to privacy."),
           
              textInput(inputId = "name", label = "Enter your name.")
            ),

            mainPanel()

          ))
          server = function(input, output) { }
          shinyApp(ui, server)
          helptext() and Text Input


          helpText( ) creates a disclaimer which will be displayed on the sidebarPanel.


          Adding SliderInput
          ui =  fluidPage(sidebarLayout(
            sidebarPanel(
              helpText("This questionnaire is subject to privacy."),
           
              textInput(inputId = "name", label = "Enter your name."),
              sliderInput(
                inputId = "age",
                label = "What is your age?",
                min = 1,
                max = 100,
                value = 25
              )
            ),

            mainPanel()

          ))
          server = function(input, output) { }
          shinyApp(ui, server)

          In sliderInput we use the ID as "age" and the label which will be displayed in our app is "What is your age?"min = 1 and max = 100 depict the minimum and maximum values for our sliders and value = 25 denotes the default selected value.

          SliderInput

          RadioButtons, NumericInput and CheckBoxInput

          ui =  fluidPage(sidebarLayout(
            sidebarPanel(
              radioButtons(
                inputId = "month",
                label = "In which month are you born?",
                choices = list(
                  "Jan - March" = 1,
                  "April - June" = 2,
                  "July - September" = 3,
                  "October - November" = 4
                ),
                selected = 1
              ),
           
              numericInput(
                inputId = "children_count",
                label = "How many children do you have?",
                value = 2,
                min = 0,
                max = 15
              ),
           
              selectInput(
                inputId  = "smoker",
                label = "Are you a smoker?",
                choices = c("Yes", "No", "I smoke rarely"),
                selected = "Yes"
              ),
           
              checkboxInput(
                inputId = "checkbox",
                label = "Are you a human?",
                value = FALSE
              ),
           
              checkboxGroupInput(
                inputId = "checkbox2",
                label = "2 + 2 = ?",
                choices = list(4, "Four", "IV", "None of the above")
              )
           
            ),

            mainPanel()

          ))
          server = function(input, output) { }
          shinyApp(ui, server)
          Other common Widgets

          In radioButtons or selectInput widgets we define the list of options in choices parameter. The parameter selected  implies the default selected option.

          Using fluidRow

          The output of our above app is a bit weird. Right? Everything comes in the sidepanel and nothing on the mainPanel. We can make it a bit more sophisticated by removing the mainPanel and creating the widgets in a row.

          We use fluidRow for such things. It is to be kept in mind that the width of the row is 12 thus if a row contains the widgets which require in more than 12 units of width then the last widget will be displayed on the other row.

          Let us create the above same app using fluidRow.

          Our app creates textInput, sliderInput and radioButtons in one row.

          ui =  fluidPage(
          helpText(
          "This questionnaire is subject to privacy. All the information obtained will be confidential."
          ),


          fluidRow(
          column(4,textInput(inputId = "name", label = "Enter your name.")),


          column(
          4, sliderInput(
          inputId = "age",
          label = "What is your age?",
          min = 1,
          max = 100,
          value = 25
          )
          ),

          column(
          4, radioButtons(
          inputId = "month",
          label = "In which month are you born?",
          choices = list(
          "Jan - March" = 1,
          "April - June" = 2,
          "July - September" = 3,
          "October - November" = 4
          ),

          selected = 1
          )
          )
          ),


          fluidRow(column(
          6, numericInput(
          inputId = "children_count",
          label = "How many children do you have?",
          value = 2,
          min = 0,
          max = 15
          )
          ) )
          )

          server = function(input, output) { }

          shinyApp(ui, server)

          fluidrow

          In column(6,...) 6 denotes the width required by one widget. To move to the next row another fluidRow command is used.

          Time to get some output!

          So far we have been providing the input to our server function but note that server function also has an output as an argument. Thus we can have various outputs like:
          The above functions are defined in ui and are given a key and using that key we denote them in the server function.

          In the server function we use render* to display various outputs. Some of the most common render* commands are:

          Dealing with dates

          Using dateInput( ) we can select the dates from our calendar.

          The inputID is "bday", and the label which will be displayed in our app is "Enter your Date of Birth" and by default value is 1st January, 1985.

          The verbatimTextOutput is used in the ui and it will be referred in the server as "text".

          In the server function we use output$text to tell shiny that the following output will be displayed in verbatimTextOutput("text").

          The renderPrint( ) denotes our output to be printed and we get the date of birth printed using input$bday (Recall bday was the inputID in our dateInput).
          ui = fluidPage(dateInput(
            "bday",
            label = h3("Enter your Date of Birth"),
            value = "1985-01-01"
          ),
          verbatimTextOutput("text"))

          server = function(input, output) {
            output$text <- renderPrint({
              paste(input$bday, "was a blessed day!")
           
            })
          }
          shinyApp(ui, server)

          Viewing Data

          Here we are using the iris dataset and we want to display only the data for the particular specie selected by the user.

          Using selectInput( ) we choose the specie with inpuID as "specie". In the main panel we want out output as a table thus we use tableOutput( ). In the server( ) output$data matches tableOutput("data") and renders a table using renderTable.

          ui =  fluidPage(titlePanel("Viewing data"),
                       
                          sidebarLayout(sidebarPanel(
                            selectInput(
                              inputId  = "specie",
                              label = "Select the flower specie",
                              choices = c("setosa", "versicolor", "virginica"),
                              selected = "setosa"
                            )
                          ),
                       
                          mainPanel(tableOutput("data"))))
          server = function(input, output) {
            output$data  = renderTable({
              iris[iris$Species == input$specie, ]
            })
          }
          shinyApp(ui, server)

          Reactivity in Shiny

          Shiny apps use a functionality called reactivity that means that shiny app will be responsive to changes to inputs. It's similar to MS Excel where changing one cell have effect on the whole workbook.

          It is quite useful to define reactive( ) function when there are multiple widgets.

          Suppose we have two widgets with inputID 'a' and 'b'. We have two reactive functions say 'X' and 'Y' for one each. Thus is the value in 'a' changes then reactive function 'X' will be updated and for 'b' reactive function 'Y' will be updated.

          If a change is made only in one of the input values say 'a'  and 'b' is the same then reactive function 'X' will be updated but 'Y' will be skipped. Hence it reduces a lot of time and saves shiny from confusion.


          Creating Plots

          Here we want to display the histogram by selecting any one variable in the iris dataset available in R.

          Using plotOutput in main panel we refer to the server function.

          In the server function we are using reactive.  It means that it will change the value only when the  value input$characterstic is changed.

          The output$myplot matches to plotOutput("myplot") and hence draws the histogram using renderPlot( )
          ui =  fluidPage(titlePanel("Creating the plots!"),
                          sidebarLayout(sidebarPanel(
                            selectInput(
                              inputId  = "characterstic",
                              label = "Select the characterstic for which you want the histogram",
                              choices = c("Sepal Length", "Sepal Width" ,
                                          "Petal Length", "Petal Width"),
                              selected = "Sepal Length"
                            )
                          ),
                       
                          mainPanel(plotOutput("myplot"))))
          server = function(input, output) {
            char = reactive({
              switch(
                input$characterstic,
                "Sepal Length" = "Sepal.Length",
                "Sepal Width" = "Sepal.Width",
                "Petal Length" = "Petal.Length",
                "Petal Width" = "Petal.Width"
              )
            })

            output$myplot  = renderPlot({
              hist(
                iris[, char()],
                xlab = input$characterstic,
                main = paste("Histogram of", input$characterstic)
              )
            })

          }
          shinyApp(ui, server)

          Well Panel and Vertical Layout

          Vertical Layout creates a layout in which each element passed in the UI will appear in its own line.
          WellPanel creates a panel with a border and a grey background.

          In the following example we are trying to create an app where we draw the QQ plot for random sample from normal distribution.

          Using the sliders we define the size of the sample. By default it is 500.
          ui = fluidPage(titlePanel("Vertical layout"),
          verticalLayout(wellPanel(
          sliderInput("n", "QQ Plot of normal distribution", 100, 1000,
          value = 500)
          ),
          plotOutput("plot1")))
          server = function(input, output) {
          output$plot1 = renderPlot({
          qqnorm(rnorm(input$n))
          })
          }
          shinyApp(ui, server)

          Creating tabs

          We can create various tabs in shiny where some particular output is displayed in a particular tab. This can be done using tabsetPanel.

          We are creating an app in which the user selects the columns for which he wants the summary and the boxplot.

          In the main panel we are creating the tabs. each tab has a label and the output to be shown.
          For instance the first tab label is 'Summary' and it will show the verbatimTextOutput and the other tab will have label displayed as 'Boxplot' with output being plotted using plotOutput.
          ui =  fluidPage(titlePanel("Creating the tabs!"),
          sidebarLayout(sidebarPanel(
          radioButtons(
          inputId = "characterstic",
          label = "Select the characterstic for which you want the summary",
          choices = c(
          "Mileage" = "mpg",
          "Displacement" = "disp",
          "Horsepower" = "hp",
          "Rear axle ratio" = "drat",
          "Weight" = "wt"
          ),
          selected = "mpg"
          )
          ),
          mainPanel(tabsetPanel(
          tabPanel("Summary", verbatimTextOutput("mysummary")),
          tabPanel("Boxplot", plotOutput("myplot"))
          ))))

          server = function(input, output) {
          output$mysummary = renderPrint({
          summary(mtcars[, input$characterstic])
          })

          output$myplot = renderPlot({
          boxplot(mtcars[, input$characterstic], main = "Boxplot")
          })
          }
          shinyApp(ui, server)
          Creating tabs in Shiny

          Some more plots...

          In this example we are using VADeaths data. We firstly select the area (Rural or Urban) and gender( Male or Female) and hence plot the barplot denoting the death rate for different age groups.
          ui = fluidPage(
          titlePanel("Death rates by Gender and area"),

          sidebarLayout(
          sidebarPanel(
          selectInput("area", "Choose the area",
          choices = c("Rural", "Urban")),
          br(),
          selectInput("gender", "Choose the gender", choices = c("Male", "Female"))
          ),

          mainPanel(plotOutput("deathrate"))

          )
          )

          server = function(input, output) {
          output$deathrate <- renderPlot({
          a = barplot(VADeaths[, paste(input$area, input$gender)],
          main = "Death Rates in Virginia",
          xlab = "Age Groups")
          text(a,
          y = VADeaths[, paste(input$area, input$gender)] - 2,
          labels = VADeaths[, paste(input$area, input$gender)],
          col = "black")
          })
          }

          shinyApp(ui, server)

          Conditional Panels

          Suppose you wish to create outputs only when a particular option is selected or if a particular condition is satisfied. For such a purpose we can use conditional panels where we define the condition in a JavaScript format and then define the output or the widget to appear if the condition is met. A simple example of a conditional panel is given as follows: Firstly we seek the number of hours one sleeps and then if someone sleeps for less than 7 hours then he needs more sleep and if someone sleeps more than or equal to 9 hours then he sleeps a lot.
          ui = fluidPage(
          titlePanel("Conditional Panels"),
          sidebarPanel(
          numericInput("num","How many hours do you sleep?",min = 1,max = 24,value = 6)),
          mainPanel(
          conditionalPanel("input.num < 7","You need more sleep"),
          conditionalPanel("input.num >= 9","You sleep a lot")
          )
          )
          server = function(input,output){

          }
          shinyApp(ui,server)

          Note: The first argument in conditional panel is a JavaScript expression thus we write input.num and not input$num to access the input value of sleeping hours.


          Conditional Panel : Example 2

          In the following example we are using the income.csv file. Firstly we ask for which variable the user wants to work with and save the data in 'a' using reactive( ) . Then we using uiOutput we insert a widget asking for whether the user wants the summary or to view the data or the histogram. Based on the option selected by the user we create conditional panels for summary, viewing the data and plotting the histogram.
          income = read.csv("income.csv", stringsAsFactors = FALSE)

          ui = fluidPage(titlePanel(em("Conditional panels")),
          sidebarLayout(
          sidebarPanel(
          selectInput(
          "Choice1",
          "Select the variable",
          choices = colnames(income)[3:16],
          selected = "Y2008"
          ),
          uiOutput("Out1")
          ),
          mainPanel(
          conditionalPanel("input.Choice2 === 'Summary'", verbatimTextOutput("Out2")),
          conditionalPanel("input.Choice2 === 'View data'", tableOutput("Out3")),
          conditionalPanel("input.Choice2 === 'Histogram'", plotOutput("Out4"))
          )
          ))

          server = function(input, output) {
          a = reactive({
          income[, colnames(income) == input$Choice1]
          })
          output$Out1 = renderUI({
          radioButtons(
          "Choice2",
          "What do you want to do?",
          choices = c("Summary", "View data", "Histogram"),
          selected = "Summary"
          )
          })
          output$Out2 = renderPrint({
          summary(a())
          })
          output$Out3 = renderTable({
          return(a())
          })
          output$Out4 = renderPlot({
          return(hist(a(), main = "Histogram", xlab = input$Choice1))
          })
          }
          shinyApp(ui = ui, server = server)

          Downloading Data

          shiny allows the users to download the datasets. This can be done by using downloadButton in UI and downloadHandler in server. Firstly we select the data using radioButtons and hence save the dataset using reactive( ) in server. Then in the UI we create a downloadButton where the first argument is the inputID and the other one is the label. downloadHandler needs two arguments: filename and content. In 'filename' we specify by which name the file should be saved and using content we write the dataset into a csv file.
          ui =  fluidPage(titlePanel("Downloading the data"),
          sidebarLayout(sidebarPanel(
          radioButtons(
          "data",
          "Choose a dataset to be downloaded",
          choices = list("airquality", "iris", "sleep"),
          selected = "airquality"
          ),
          downloadButton("down", label = "Download the data.")
          ),
          mainPanel()))

          server = function(input, output) {

          # Reactive value for selected dataset ----
          datasetInput = reactive({
          switch(input$data,
          "airquality" = airquality,
          "iris" = iris,
          "sleep" = sleep)
          })

          # Downloadable csv of selected dataset ----
          output$down = downloadHandler(
          filename = function() {
          paste(input$data, ".csv", sep = "")
          },
          content = function(file) {
          write.csv(datasetInput(), file, row.names = FALSE)
          }
          )

          }
          shinyApp(ui, server)

          Uploading a file

          So far we were dealing with inbuilt datasets in R. In order to allow the users to upload their own datasets and do the analysis on them, fileInput function in UI in shiny allows users to upload their own file. Here we are creating an app to upload the files. In fileInput 'multiple = F' denotes that only one file can be uploaded by the user and 'accept = csv' denotes the type of files which can be uploaded. Then we ask the user whether he wants to view the head of the data or the entire dataset which is then viewed by using renderTable.
          library(shiny)
          ui = fluidPage(titlePanel("Uploading file in Shiny"),
          sidebarLayout(
          sidebarPanel(
          fileInput(
          "myfile",
          "Choose CSV File",
          multiple = F,
          accept = ".csv"
          ),

          checkboxInput("header", "Header", TRUE),

          radioButtons(
          "choice",
          "Display",
          choices = c(Head = "head",
          All = "all"),
          selected = "head"
          )
          ),

          mainPanel(tableOutput("contents"))

          ))
          server = function(input, output) {
          output$contents = renderTable({
          req(input$myfile)

          data = read.csv(input$myfile$datapath,
          header = input$header)

          if (input$choice == "head") {
          return(head(data))
          }
          else {
          return(data)
          }

          })
          }
          shinyApp(ui, server)

          Sharing the app with others

          Method I :Sharing the R code: You can share you app with others by sharing your R code. To make it work, users need to have R installed on their system.

          Method II :Share your app as a web page: You need to create an account on shinyapps.io and follow the instructions below to share your app.R file.

          Deploying shiny app on shinyapps.io

          First you need to have an account on shinyapps.io.

          Import library rsconnect by using
          library(rsconnect) 
          Then you need to configure the rsconnect package to your account using the code below -
          rsconnect::setAccountInfo(name="<ACCOUNT>", token="<TOKEN>", secret="<SECRET>")
          To deploy the app you can write:
          rsconnect::deployApp(' Folder path in which your app.R file is saved') 
           As a result a new web page of your app link will be opened.

          Shiny App for Normality

          In this app the user first selects the variable for which he wants to test the normality. Then he is asked whether he wants to check normality via plots or some statistical tests. If the user selects plots then he will be asked whether he wants a Histogram or a QQ-Plot. The link for the shiny app is:  My Shiny App
          ui =  fluidPage(titlePanel("My first App"),
          sidebarLayout(
          sidebarPanel(
          selectInput(
          "varchoice",
          "Choose the variable for which you want to check the normality",
          choices = c("mpg", "disp", "drat", "qsec", "hp", "wt")
          ),
          radioButtons(
          "normchoice",
          "How do you want to check the normality?",
          choices = c("Plots", "Tests"),
          selected = "Plots"
          ),
          conditionalPanel(
          "input.normchoice == 'Plots'",
          selectInput(
          "plotchoice",
          "Choose which plot you want?",
          choices = c("Histogram", "QQ-Plot")
          )
          )


          ),
          mainPanel(
          conditionalPanel("input.normchoice == 'Plots'", plotOutput("myplot")),
          conditionalPanel("input.normchoice == 'Tests'", verbatimTextOutput("mytest"))
          )
          ))
          server = function(input, output) {
          var = reactive({
          mtcars[, input$varchoice]

          })
          output$myplot = renderPlot({
          if (input$plotchoice == "Histogram")
          return(hist(var(), main = "Histogram", xlab = input$varchoice))
          else
          return(qqnorm(var(), main = paste("QQ plot of", input$varchoice)))
          })
          output$mytest = renderPrint({
          shapiro.test(var())
          })
          }

          shinyApp(ui, server)
          Following is the clip of how the app will look when opened the link:
          My First Shiny App

          Web Scraping Matrimonial Website with R

          $
          0
          0
          In this tutorial, we will cover how to extract information from a matrimonial website using R.  We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

          We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
          Web Scraping in R
          Web Scraping in R

          Install the required packages

          To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.
          install.packages("rvest")
          install.packages("dplyr")

          Load the required Libraries

          To make the libraries in use, you need to submit the program below.
          library(rvest)
          library(dplyr)

          Scrape Information from Matrimonial Website

          First we need to understand the structure of URL. See the URLs below.
          https://www.jeevansathi.com/punjabi-brides-girls
          https://www.jeevansathi.com/punjabi-grooms-boys

          The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

          We need to split the main URL into different elements so that we can access it. 
          Main_URL = Static_URL + Mother_Tongue + Brides_Grooms
          Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -
          1. Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
          2. Select Mother Tongue. For example, punjabi, tamil etc.
          # Looking for bride/groom
          Bride_Groom = "bride"
          # Possible Values : bride, groom

          # Select Mother Tongue
          Mother_Tongue = "punjabi"
          # Possible Values
          # punjabi
          # tamil
          # bengali
          # telugu
          # kannada
          # marathi

          # URL
          if (tolower(Bride_Groom) == "bride") {
          html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
          } else {
          html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
          }
          See the output :
          [1] "https://www.jeevansathi.com/punjabi-brides-girls"

          Extract Profile IDs

          First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

          How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
          text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
          profileIDs = data.frame(ID = text)
                   ID
          1 ZARX0345
          2 ZZWX5573
          3 ZWVT2173
          4 ZAYZ6100
          5 ZYTS6885
          6 ZXYV9849
          7 TRZ8475
          8 VSA7284
          9 ZXTU1965
          10 ZZSA6877
          11 ZZSZ6545
          12 ZYSW4809
          13 ZARW2199
          14 ZRSY0723
          15 ZXAT2801
          16 ZYXX8818
          17 ZAWA8567
          18 WXZ2147
          19 ZVRT8875
          20 ZWWR9533
          21 ZYXW4043
          The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
          1. read_html() :  you can create a html document from a URL
          2. html_nodes() : extracts pieces out of HTML documents.
          3. html_nodes(".class") : calls node based on CSS class
          4. html_nodes("#class") : calls node based on <div>, <span>, <pre> id
          5. html_text() : extracts only the text from HTML tag
          6. html_attr() : extracts contents of a single attribute

          Difference between .class and #class

          1..class targets the following element:
          <div class="class"></div>

          2.#class targets the following element:
          <div id="class"></div>

          Get Detailed Information of Profiles

          The following program performs the following tasks -
          1. Loop through profile IDs
          2. Pull information about Age, Height, Qualification etc.
          3. Extract details about appearance
          4. Fetch 'About Me' section of profiles
          # Get Detailed Information
          finaldf = data.frame()
          for (i in 1:length(profileIDs$ID)){
          ID = profileIDs[i,1]
          link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
          FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
          html_text()))
          # Final Table
          FormattedInfo = data.frame(ProfileID = ID,
          Description = read_html(link) %>%
          html_nodes("#myinfoView") %>%
          html_text(),
          Appearance = read_html(link) %>%
          html_nodes("#section-lifestyle #appearanceView") %>%
          html_text(),
          FormattedInfo)

          finaldf = bind_rows(finaldf, FormattedInfo)
          }

          # Assign Variable Names
          names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
          Web Scraping Output
          Web Scraping Output PartII


          Download Display Pic

          To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.
          # Download Profile Pic of a particular Profile
          ID = "SWU6683"
          text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
          pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
          download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
          # File saved as match.jpg

          Disclaimer
          We have accessed only publicly available data which does not require login or registration. The purpose is  not to cause any damage or copy the content from the website.
          Other Functions of rvest( )
          You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

          You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
          library(rvest)
          url       = "http://www.google.com"
          pgsession = html_session(url)           
          pgform    = html_form(pgsession)[[1]]

          # Set search term
          filled_form = set_values(pgform, q="Datascience")
          session = submit_form(pgsession,filled_form)

          # look for headings of first page
          session %>% html_nodes(".g .r a") %>% html_text()
           [1] "Data science - Wikipedia"
          [2] "Data Science Courses | Coursera"
          [3] "Data Science | edX"
          [4] "Data science - Wikipedia"
          [5] "DataScience.com | Enterprise Data Science Platform Provider"
          [6] "Top Data Science Courses Online - Updated February 2018 - Udemy"
          [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"
          [8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
          [9] "Online Data Science Courses | Microsoft Professional Program"
          [10] "News for Datascience"
          [11] "Data Science Course - Cognitive Class"

          15 Types of Regression you should know

          $
          0
          0
          Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed for various types of analysis. Each type has its own significance. Every analyst must know which form of regression to use depending on type of data and distribution.

          Table of Contents
          1. What is Regression Analysis?
          2. Terminologies related to Regression
          3. Types of Regressions
          • Linear Regression
          • Polynomial Regression
          • Logistic Regression
          • Quantile Regression
          • Ridge Regression
          • Lasso Regression
          • ElasticNet Regression
          • Principal Component Regression
          • Partial Least Square Regression
          • Support Vector Regression
          • Ordinal Regression
          • Poisson Regression
          • Negative Binomial Regression
          • Quasi-Poisson Regression
          • Cox Regression
        1. How to choose the correct Regression Model?
        2. Regression Analysis Simplified


          What is Regression Analysis?

          Lets take a simple example : Suppose your manager asked you to predict annual sales. There can be a hundred of factors (drivers) that affects sales. In this case, sales is your dependent variable. Factors affecting sales are independent variables. Regression analysis would help you to solve this problem.
          In simple words, regression analysis is used to model the relationship between a dependent variable and one or more independent variables.

          It helps us to answer the following questions -
          1. Which of the drivers have a significant impact on sales. 
          2. Which is the most important driver of sales
          3. How do the drivers interact with each other
          4. What would be the annual sales next year.

          Terminologies related to regression analysis

          1. Outliers
          Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

          2. Multicollinearity
          When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

          3. Heteroscedasticity
          When dependent variable's variability is not equal across values of an independent variable, it is called heteroscedasticity. Example - As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

          4. Underfitting and Overfitting
          When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of high variance.

          When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of high bias.

          In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.
          Underfitting vs Overfitting
          Regression : Underfitting and Overfitting

          Types of Regression

          Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution.

          1. Linear Regression

          It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. The green points are the actual observations while the black line fitted is the line of regression

          regression analysis
          Regression Analysis

          When you have only 1 independent variable and 1 dependent variable, it is called simple linear regression.
          When you have more than 1 independent variable and 1 dependent variable, it is called Multiple linear regression.
          The equation of multiple linear regression is listed below -

          Multiple Regression Equation
          Here 'y' is the dependent variable to be estimated, and X are the independent variables and ε is the error term. βi’s are the regression coefficients.

          Assumptions of linear regression: 
          1. There must be a linear relation between independent and dependent variables. 
          2. There should not be any outliers present. 
          3. No heteroscedasticity 
          4. Sample observations should be independent. 
          5. Error terms should be normally distributed with mean 0 and constant variance. 
          6. Absence of multicollinearity and auto-correlation.

          Estimating the parametersTo estimate the regression coefficients βi’s we use principle of least squares which is to minimize the sum of squares due to the error terms i.e.


          On solving the above equation mathematically we obtain the regression coefficients as:

          Interpretation of regression coefficients
          Let us consider an example where the dependent variable is marks obtained by a student and explanatory variables are number of hours studied and no. of classes attended. Suppose on fitting linear regression we got the linear regression as:
          Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)
          Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:
          1. If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
          2. Keeping no. of classes attended constant, if student studies for one hour more then he will score 2 more marks in the examination. 
          3. Similarly keeping no. of hours studied constant, if student attends one more class then he will attain 0.5 marks more.

          Linear Regression in R
          We consider the swiss data set for carrying out linear regression in R. We use lm() function in the base package. We try to estimate Fertility with the help of other variables.
          library(datasets)
          model = lm(Fertility ~ .,data = swiss)
          lm_coeff = model$coefficients
          lm_coeff
          summary(model)

          The output we get is:

          > lm_coeff
               (Intercept)      Agriculture      Examination        Education         Catholic 
          66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
          Infant.Mortality
          1.0770481
          > summary(model)

          Call:
          lm(formula = Fertility ~ ., data = swiss)

          Residuals:
          Min 1Q Median 3Q Max
          -15.2743 -5.2617 0.5032 4.1198 15.3213

          Coefficients:
          Estimate Std. Error t value Pr(>|t|)
          (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
          Agriculture -0.17211 0.07030 -2.448 0.01873 *
          Examination -0.25801 0.25388 -1.016 0.31546
          Education -0.87094 0.18303 -4.758 2.43e-05 ***
          Catholic 0.10412 0.03526 2.953 0.00519 **
          Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
          ---
          Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

          Residual standard error: 7.165 on 41 degrees of freedom
          Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
          F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
          Hence we can see that 70% of the variation in Fertility rate can be explained via linear regression.

          2. Polynomial Regression

          It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.
          In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relation between the dependent and independent variable seems to be non-linear we can deploy Polynomial Regression Models.
          Thus a polynomial of degree k in one variable is written as:
          Here we can create new features like
          and can fit linear regression in the similar manner.

          In case of multiple variables say X1 and X2, we can create a third new feature (say X3) which is the product of X1 and X2 i.e.
          Disclaimer: It is to be kept in mind that creating unnecessary extra features or fitting polynomials of higher degree may lead to overfitting.

          Polynomial regression in R:
          We are using poly.csv data for fitting polynomial regression where we try to estimate the Prices of the house given their area.

          Firstly we read the data using read.csv( ) and divide it into the dependent and independent variable
          data = read.csv("poly.csv")
          x = data$Area
          y = data$Price
          In order to compare the results of linear and polynomial regression, firstly we fit linear regression:
          model1 = lm(y ~x)
          model1$fit
          model1$coeff

          The coefficients and predicted values obtained are:
          > model1$fit
          1 2 3 4 5 6 7 8 9 10
          169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454
          > model1$coeff
          (Intercept) x
          120.05663769 0.09808581
          We create a dataframe where the new variable are x and x square.

          new_x = cbind(x,x^2)

          new_x
                   x        
          [1,] 500 250000
          [2,] 600 360000
          [3,] 700 490000
          [4,] 1000 1000000
          [5,] 1050 1102500
          [6,] 1495 2235025
          [7,] 1750 3062500
          [8,] 1800 3240000
          [9,] 2000 4000000
          [10,] 2200 4840000
          Now we fit usual OLS to the new data:
          model2 = lm(y~new_x)
          model2$fit
          model2$coeff

          The fitted values and regression coefficients of polynomial regression are:
          > model2$fit
          1 2 3 4 5 6 7 8 9 10
          122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110
          > model2$coeff
          (Intercept) new_xx new_x
          -7.684980e+01 4.689175e-01 -1.402805e-04

          Using ggplot2 package we try to create a plot to compare the curves by both linear and polynomial regression.
          library(ggplot2)
          ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
          geom_line(aes(x = Area,y = model1$fit),color = "red") +
          geom_line(aes(x = Area,y = model2$fit),color = "blue") +
          theme(panel.background = element_blank())



          3. Logistic Regression

          In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable.

          Here my model is:
          logistic regression
          logistic regression equation

          Why don't we use linear regression in this case?
          • The homoscedasticity assumption is violated.
          • Errors are not normally distributed
          • y follows binomial distribution and hence is not normal.

          Examples
          • HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome - Join / Not Join).

          • Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

          Predicting the category of dependent variable for a given vector X of independent variables

          Through logistic regression we have 

          Thus we choose a cut-off of probability say 'p'  and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

          Interpreting the logistic regression coefficients (Concept of Odds Ratio)
          If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

          Logistic Regression in R:
          In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.


          We fit logistic regression with glm( )  function and we set family = "binomial"
          model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = "binomial")
          The predicted probabilities are given by:
          #Predicted Probablities

          model$fitted.values
                  1         2         3         4         5         6         7         8         9 
          0.4545455 0.4545455 0.6428571 0.6428571 0.4545455 0.4545455 0.4545455 0.4545455 0.6428571
          10 11 12 13 14 15 16 17 18
          0.6428571 0.4545455 0.4545455 0.6428571 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
          19 20 21 22 23 24 25
          0.6428571 0.4545455 0.6428571 0.6428571 0.4545455 0.6428571 0.6428571
          Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5
          data$prediction <- model$fitted.values>0.5
          > data$prediction
          [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
          [16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE

          4. Quantile Regression

          Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data.

          In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
          The term “quantile” is the same as “percentile”

          Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile of the dependent variable given the values of X's. Note that the dependent variable should be continuous.

          The quantile regression model:
          For qth quantile we have the following regression model:
          This seems similar to linear regression model but here the objective function we consider to minimize is:
          where q is the qth quantile.

          If q  = 0.5 i.e. if we are interested in the median then it becomes median regression (or least absolute deviation regression) and substituting the value of q = 0.5 in above equation we get the objective function as:
          Interpreting the coefficients in quantile regression:
          Suppose the regression equation for 25th quantile of regression is: 
          y = 5.2333 + 700.823 x

          It means that for one unit increase in x the estimated increase in 25th quantile of y by 700.823 units.
          Advantages of Quantile over Linear Regression
          • Quite beneficial when heteroscedasticity is present in the data.
          • Robust to outliers
          • Distribution of dependent variable can be described via various quantiles.
          • It is more useful than linear regression when the data is skewed.

          Disclaimer on using quantile regression!
          It is to be kept in mind that the coefficients which we get in quantile regression for a particular quantile should differ significantly from those we obtain from linear regression. If it is not so then our usage of quantile regression isn't justifiable. This can be done by observing the confidence intervals of regression coefficients of the estimates obtained from both the regressions.

          Quantile Regression in R
          We need to install quantreg package in order to carry out quantile regression.

          install.packages("quantreg")
          library(quantreg)

          Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss data. For this we set tau = 0.25.

          model1 = rq(Fertility~.,data = swiss,tau = 0.25)
          summary(model1)
          tau: [1] 0.25

          Coefficients:
          coefficients lower bd upper bd
          (Intercept) 76.63132 2.12518 93.99111
          Agriculture -0.18242 -0.44407 0.10603
          Examination -0.53411 -0.91580 0.63449
          Education -0.82689 -1.25865 -0.50734
          Catholic 0.06116 0.00420 0.22848
          Infant.Mortality 0.69341 -0.10562 2.36095

          Setting tau = 0.5 we run the median regression.
          model2 = rq(Fertility~.,data = swiss,tau = 0.5)
          summary(model2)

          tau: [1] 0.5

          Coefficients:
          coefficients lower bd upper bd
          (Intercept) 63.49087 38.04597 87.66320
          Agriculture -0.20222 -0.32091 -0.05780
          Examination -0.45678 -1.04305 0.34613
          Education -0.79138 -1.25182 -0.06436
          Catholic 0.10385 0.01947 0.15534
          Infant.Mortality 1.45550 0.87146 2.21101

          We can run quantile regression for multiple quantiles in a single plot.
          model3 = rq(Fertility~.,data = swiss, tau = seq(0.05,0.95,by = 0.05))
          quantplot = summary(model3)
          quantplot

          We can check whether our quantile regression results differ from the OLS results using plots.

          plot(quantplot)
          We get the following plot:

          Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS coefficients and the dotted red lines are the confidence intervals around those OLS coefficients for various quantiles. The black dotted line are the quantile regression estimates and the gray area is the confidence interval for them for various quantiles. We can see that for all the variable both the regression estimated coincide for most of the quantiles. Hence our use of quantile regression is not justifiable for such quantiles. In other words we want that both the red and the gray lines should overlap as less as possible to justify our use of quantile regression.

          5. Ridge Regression

          It's important to understand the concept of regularization before jumping to ridge regression.

          1. Regularization

          Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

          Regularization is generally useful in the following situations:
          1. Large number of variables
          2. Low ratio of number observations to number of variables
          3. High Multi-Collinearity

          2. L1 Loss function or L1 Regularization

          In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients.  This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

          3. L2 Loss function or L2 Regularization

          In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.

          In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces.  For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

          In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:
          Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

          Very Important Note: 
          We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X's.
          We can see that ridge regression makes use of L2 regularization.


          On solving the above objective function we can get the estimates of β as:

          How can we choose the regularization parameter λ?

          If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

          R code for Ridge Regression

          Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.
          X = swiss[,-1]
          y = swiss[,1]

          We need to load glmnet library to carry out ridge regression.
          library(glmnet)
          Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.
          set.seed(123) #Setting the seed to get similar results.
          model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

          We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.
          best_lambda = model$lambda.min

          ridge_coeff = predict(model,s = best_lambda,type = "coefficients")
          ridge_coeff The coefficients obtained using ridge regression are:
          6 x 1 sparse Matrix of class "dgCMatrix"
          1
          (Intercept) 64.92994664
          Agriculture -0.13619967
          Examination -0.31024840
          Education -0.75679979
          Catholic 0.08978917
          Infant.Mortality 1.09527837

          6. Lasso Regression
          Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularization technique in the objective function. Thus the objective function in LASSO regression becomes:
          λ is the regularization parameter and the intercept term is not regularized. 
          We do not assume that the error terms are normally distributed.
          For the estimates we don't have any specific mathematical formula but we can obtain the estimates using some statistical software.

          Note that lasso regression also needs standardization.

          Advantage of lasso over ridge regression

          Lasso regression can perform in-built variable selection as well as parameter shrinkage. While using ridge regression one may end up getting all the variables but with Shrinked Paramaters.

          R code for Lasso Regression

          Considering the swiss dataset from "datasets" package, we have: 
          #Creating dependent and independent variables.
          X = swiss[,-1]
          y = swiss[,1]
          Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set alpha = 1. By default standardize = TRUE hence we do not need to standardize the variables seperately.
          #Setting the seed for reproducibility
          set.seed(123)
          model = cv.glmnet(as.matrix(X),y,alpha = 1,lambda = 10^seq(4,-1,-0.1))
          #By default standardize = TRUE

          We consider the best value of lambda by filtering out lamba.min from the model and hence get the coefficients using predict function.
          #Taking the best lambda
          best_lambda = model$lambda.min
          lasso_coeff = predict(model,s = best_lambda,type = "coefficients")
          lasso_coeff The lasso coefficients we got are:
          6 x 1 sparse Matrix of class "dgCMatrix"
          1
          (Intercept) 65.46374579
          Agriculture -0.14994107
          Examination -0.24310141
          Education -0.83632674
          Catholic 0.09913931
          Infant.Mortality 1.07238898


          Which one is better - Ridge regression or Lasso regression?

          Both ridge regression and lasso regression are addressed to deal with multicollinearity. 
          Ridge regression is computationally more efficient over lasso regression. Any of them can perform better. So the best approach is to select that regression model which fits the test set data well.

          7. Elastic Net Regression
          Elastic Net regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables.

          It is a combination of both L1 and L2 regularization.

          The objective function in case of Elastic Net Regression is:
          Like ridge and lasso regression, it does not assume normality.

          R code for Elastic Net Regression

          Setting some different value of alpha between 0 and 1 we can carry out elastic net regression.
          set.seed(123)
          model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
          #Taking the best lambda
          best_lambda = model$lambda.min
          en_coeff = predict(model,s = best_lambda,type = "coefficients")
          en_coeff
          The coeffients we obtained are:
          6 x 1 sparse Matrix of class "dgCMatrix"
          1
          (Intercept) 65.9826227
          Agriculture -0.1570948
          Examination -0.2581747
          Education -0.8400929
          Catholic 0.0998702
          Infant.Mortality 1.0775714
          8. Principal Components Regression (PCR) 
          PCR is a regression technique which is widely used when you have many independent variables OR multicollinearity exist in your data. It is divided into 2 steps:
          1. Getting the Principal components
          2. Run regression analysis on principal components
          The most common features of PCR are:
          1. Dimensionality Reduction
          2. Removal of multicollinearity

          Getting the Principal components

          Principal components analysis is a statistical method to extract new features when the original features are highly correlated. We create new features with the help of original features such that the new features are uncorrelated.

          Let us consider the first principle component:
          The first PC is having the maximum variance.
          Similarly we can find the second PC U2 such that it is uncorrelated with U1 and has the second largest variance.
          In a similar manner for 'p' features we can have a maximum of 'p' PCs such that all the PCs are uncorrelated with each other and the first PC has the maximum variance, then 2nd PC has the maximum variance and so on.

          Drawbacks:

          It is to be mentioned that PCR is not a feature selection technique instead it is a feature extraction technique. Each principle component we obtain is a function of all the features. Hence on using principal components one would be unable to explain which factor is affecting the dependent variable to what extent.

          Principal Components Regression in R

          We use the longley data set available in R which is used for high multicollinearity. We excplude the Year column.
          data1 = longley[,colnames(longley) != "Year"]

          View(data)  This is how some of the observations in our dataset will look like:
          We use pls package in order to run PCR.
          install.packages("pls")
          library(pls)

          In PCR we are trying to estimate the number of Employed people; scale  = T denotes that we are standardizing the variables; validation = "CV" denotes applicability of cross-validation.
          pcr_model <- pcr(Employed~., data = data1, scale = TRUE, validation = "CV")
          summary(pcr_model)

          We get the summary as:
          Data:  X dimension: 16 5 
          Y dimension: 16 1
          Fit method: svdpc
          Number of components considered: 5

          VALIDATION: RMSEP
          Cross-validated using 10 random segments.
          (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
          CV 3.627 1.194 1.118 0.5555 0.6514 0.5954
          adjCV 3.627 1.186 1.111 0.5489 0.6381 0.5819

          TRAINING: % variance explained
          1 comps 2 comps 3 comps 4 comps 5 comps
          X 72.19 95.70 99.68 99.98 100.00
          Employed 90.42 91.89 98.32 98.33 98.74

          Here in the RMSEP the root mean square errors are being denoted. While in 'Training: %variance explained' the cumulative % of variance explained by principle components is being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
          We can also create a plot depicting the mean squares error for the number of various PCs.
          validationplot(pcr_model,val.type = "MSEP")
          By writing val.type = "R2"we can plot the R square for various no. of PCs.
          validationplot(pcr_model,val.type = "R2")
           If we want to fit pcr for 3 principal components and hence get the predicted values we can write:
          pred = predict(pcr_model,data1,ncomp = 3)

          9. Partial Least Squares (PLS) Regression 

          It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables.

          Difference between PLS and PCR
          Both techniques create new independent variables called components which are linear combinations of the original predictor variables but PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. While PLS takes the dependent variable into account, and therefore often leads to models that are able to fit the dependent variable with fewer components.
          PLS Regression in R
          library(plsdepot)
          data(vehicles)
          pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
          # R-Square
          pls.model$R2


          10. Support Vector Regression

          Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models.

          The main idea of SVR is to minimize error, individualizing the hyperplane which maximizes the margin.
          library(e1071)
          svr.model <- svm(Y ~ X , data)
          pred <- predict(svr.model, data)
          points(data$X, pred, col = "red", pch=4)

          11. Ordinal Regression

          Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables - Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

          Why we can't use linear regression when dealing with ordinal target variable?

          In linear regression, the dependent variable assumes that changes in the level of the dependent variable are equivalent throughout the range of the variable. For example, the difference in weight between a person who is 100 kg and a person who is 120 kg is 20kg, which has the same meaning as the difference in weight between a person who is 150 kg and a person who is 170 kg. These relationships do not necessarily hold for ordinal variables.
          library(ordinal)
          o.model <- clm(rating ~ ., data = wine)
          summary(o.model)

          12. Poisson Regression

          Poisson regression is used when dependent variable has count data.

          Application of Poisson Regression -
          1. Predicting the number of calls in customer care related to a particular product
          2. Estimating the number of emergency service calls during an event
          The dependent variable must meet the following conditions
          1. The dependent variable has a Poisson distribution.
          2. Counts cannot be negative.
          3. This method is not suitable on non-whole numbers

          In the code below, we are using dataset named warpbreaks which shows the number of breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool tension and the interaction between the two.
          pos.model<-glm(breaks~wool*tension, data = warpbreaks, family=poisson)
          summary(pos.model)

          13. Negative Binomial Regression

          Like Poisson Regression, it also deals with count data. The question arises "how it is different from poisson regression". The answer is negative binomial regression does not assume distribution of count having variance equal to its mean. While poisson regression assumes the variance equal to its mean.
          When the variance of count data is greater than the mean count, it is a case of overdispersion. The opposite of the previous statement is a case of under-dispersion.
          library(MASS)
          nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
          summary(nb.model)

          14. Quasi Poisson Regression

          It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.
          qs.pos.model <- glm(Days ~ Sex/(Age + Eth*Lrn), data = quine,  family = "quasipoisson")
          Quasi-Poisson regression can handle both over-dispersion and under-dispersion.


          15. Cox Regression

          Cox Regression is suitable for time-to-event data. See the examples below -
          1. Time from customer opened the account until attrition.
          2. Time after cancer treatment until death.
          3. Time from first heart attack to the second.
          Logistic regression uses a binary dependent variable but ignores the timing of events. 
          As well as estimating the time it takes to reach a certain event, survival analysis can also be used to compare time-to-event for multiple groups.

          Dual targets are set for the survival model 
          1. A continuous variable representing the time to event.
          2. A binary variable representing the status whether event occurred or not.
          library(survival)
          # Lung Cancer Data
          # status:2=death
          lung$SurvObj <- with(lung, Surv(time, status == 2))
          cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data =  lung)
          cox.reg

          How to choose the correct regression model?
          1. If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.
          2. If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
          3. To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.
          4. Try support vector regression when you have non-linear model.

          Run Python from R

          $
          0
          0
          This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It's always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

          In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.
          Combine Python and R

          RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

          Install and Load Reticulate Package

          Run the command below to get this package installed and imported to your system.
          # Install reticulate package
          install.packages("reticulate")

          # Load reticulate package
          library(reticulate)

          Check whether Python is available on your system
          py_available()
          It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

          Import a python module within R

          You can use the function import( ) to import a particular package or module.
          os <- import("os")
          os$getcwd()
          The above program returns working directory.
          [1] "C:\\Users\\DELL\\Documents"

          You can use listdir( ) function from os package to see all the files in working directory
          os$listdir()
           [1] ".conda"".gitignore"".httr-oauth"
          [4] ".matplotlib"".RData"".RDataTmp"
          [7] ".Rhistory""1.pdf""12.pdf"
          [10] "122.pdf""124.pdf""13.pdf"
          [13] "1403.2805.pdf""2.pdf""3.pdf"
          [16] "AIR.xlsx""app.r""Apps"
          [19] "articles.csv""Attrition_Telecom.xlsx""AUC.R"


          Install Python Package

          Step 1 : Create a new environment 
          conda_create("r-reticulate")
          Step 2 : Install a package within a conda environment
          conda_install("r-reticulate", "numpy")
          Since numpy is already installed, you don't need to install it again. The above example is just for demonstration.

          Step 3 : Load the package
          numpy <- import("numpy")

          Working with numpy array

          Let's create a sample numpy array
          y <- array(1:4, c(2, 2))
          x <- numpy$array(y)
               [,1] [,2]
          [1,] 1 3
          [2,] 2 4


          Transpose the above array
          numpy$transpose(x)
              [,1] [,2]
          [1,] 1 2
          [2,] 3 4

          Eigenvalues and eigen vectors
          numpy$linalg$eig(x)
          [[1]]
          [1] -0.3722813 5.3722813

          [[2]]
          [,1] [,2]
          [1,] -0.9093767 -0.5657675
          [2,] 0.4159736 -0.8245648

          Mathematical Functions
          numpy$sqrt(x)
          numpy$exp(x)

          Working with Python interactively

          You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).

          By using repl_python() function, you can make it interactive. Download the datasetused in the program below.
          repl_python()

          # Load Pandas package
          import pandas as pd

          # Importing Dataset
          travel = pd.read_excel("AIR.xlsx")

          # Number of rows and columns
          travel.shape

          # Select random no. of rows
          travel.sample(n = 10)

          # Group By
          travel.groupby("Year").AIR.mean()

          # Filter
          t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

          # Return to R
          exit
          Note : You need to enter exit to return to the R environment.
          call python from R
          Run Python from R

          How to access objects created in python from R

          You can use the pyobject to access objects created within python.
          summary(py$t)
          In this case, I am using R's summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.
          # Line chart using ggplot2
          library(ggplot2)
          ggplot(py$t, aes(AIR, Year)) + geom_line()

          How to access objects created in R from Python

          You can use the r object to accomplish this task. 

          1. Let's create a object in R
          mydata = head(cars, n=15)
          2. Use the R created object within Python REPL
          repl_python()
          import pandas as pd
          r.mydata.describe()
          pd.isnull(r.mydata.speed)
          exit

          Building Logistic Regression Model using sklearn package

          The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.
          repl_python()

          # Load libraries
          from sklearn import datasets
          from sklearn.linear_model import LogisticRegression

          # load the iris datasets
          iris = datasets.load_iris()

          # Developing logit model
          model = LogisticRegression()
          model.fit(iris.data, iris.target)

          # Scoring
          actual = iris.target
          predicted = model.predict(iris.data)

          # Performance Metrics
          print(metrics.classification_report(actual, predicted))
          print(metrics.confusion_matrix(actual, predicted))

          Other Useful Functions

          To see configuration of python

          Run the py_config( ) command to find the version of R installed on your system.It also shows details about anaconda and numpy.
          py_config()
          python:         C:\Users\DELL\ANACON~1\python.exe
          libpython: C:/Users/DELL/ANACON~1/python36.dll
          pythonhome: C:\Users\DELL\ANACON~1
          version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
          Architecture: 64bit
          numpy: C:\Users\DELL\ANACON~1\lib\site-packages\numpy
          numpy_version: 1.14.2


          To check whether a particular package is installed

          In the following program, we are checking whether pandas package is installed or not.
          py_module_available("pandas")
          Viewing all 425 articles
          Browse latest View live