Quantcast
Channel: ListenData
Viewing all 425 articles
Browse latest View live

SAS SQL : Find records only exist in one table but not other

$
0
0
It is one of the most common data manipulation problem to find records that exist only in table 1 but not in table 2. This post includes 3 methods with PROC SQL and 1 method with data step to solve it. This problem statement is also called 'If a and not b' in SAS. It means pull records that exist only in Table A but not in Table B (Exclude the common records of both the tables). See the Venn Diagram below -
If a and not b

The main thing to focus in Venn Diagram is the intersection area of table A and table B. It is NOT highlighted in red because we don't want the records which are common in both the tables.

Let's create a sample data

If you look at the tables below, we are looking to fetch all the records from table1 except 'Ram' and 'Priya' as these two names are in table2.
Input Data

Create two datasets in SAS

The following programs create two data sets in SAS which are used to demonstrate methods to solve this problem.
data dataset1;
input name $;
cards;
Dave
Ram
Sam
Matt
Priya
;
run;
data dataset2;
input name$;
cards;
Ram
Priya
;
run;
In SQL, there are multiple ways to solve this problem. The methods are listed below -

Method I - NOT IN Operator

The simplest method is to write a subquery and use NOT IN operator, It tells system not to include records from dataset 2.
proc sql;
select * from dataset1
where name not in (select name from dataset2);
quit;
The output is shown in the image below -
Output

Method II - LEFT JOIN with NULL Operator

In this method, we are performing left join and telling SAS to include only rows from table 1 that do not exist in table 2.
proc sql;
select a.name from dataset1 a
left join dataset2 b
on a.name = b.name
where b.name is null;
quit;
How it works -
In the first step, it reads common column from the both the tables - a.name and b.name. At the second step, these columns are matched and then the b.name row will be set NULL or MISSING if a name exists in table A but not in table B. At the next step, WHERE statement with 'b,name is null' tells SAS to keep only records from table A.

Method III -  Not Exists Correlated SubQuery

NOT EXISTS subquery writes the observation to the merged dataset only when there is no matching rows of a.name in dataset2. This process is repeated for each rows of variable name.
proc sql;
select a.name from
dataset1 a
where not exists (select name from dataset2 b
where a.name = b.name);
quit;
How it works -

Step 1 - At the background, it performs left join of the tables -
proc sql;
create table step1 as
select a.* from dataset1 a
left join dataset2 b
on a.name = b.name;
quit;
Step 2 - At the next step, it checks common records by applying INNER JOIN
proc sql;
create table step2 as
select a.name from dataset1 a, dataset2 b
where a.name = b.name;
quit;

Step 3 - At the last step, it excludes common records.
proc sql;
select * from step1
where name not in (select distinct name from step2) ;
quit;

Method IV : SAS Data Step MERGE Statement

In SAS Data Step, it is required to sort tables by the common variable before merging them. Sorting can be done with PROC SORT.
proc sort data = dataset1;
by name;
run;
proc sort data = dataset2;
by name;
run;
Data finaldata;
merge dataset1 (in=a) dataset2(in=b);
by name;
if a and not b;
run;
The MERGE Statement joins the datasets dataset1 and dataset2 by the variable name.

Q. Which is the most efficient method?

To answer this question, let's create two larger datasets (tables) and compare the 4 methods as explained above.

Table1 - Dataset Name : Temp, Observations - 1 Million, Number of Variables - 1

Table2 - Dataset Name : Temp2, Observations - 10K, Number of Variables - 1
data temp;
length x $15.;
do i = 1 to 1000000;
x = "AA"||strip(i);
output;
end;
drop i;
run;
data temp2;
length x $15.;
do i = 1 to 10000;
x = "AA"||strip(i);
output;
end;
drop i;
run;

Result
SAS Dataset MERGE (Including prior sorting) took least time (1.3 seconds) to complete this operation, followed by NOT IN operator in subquery which took 1.4 seconds and then followed by LEFT JOIN with WHERE NULL clause (1.9 seconds). The NOT EXISTS took maximum time.

Tip - In many popular forums, it is generally advised to use NOT EXISTS rather than NOT IN. This advise is generally taken out of context. Modern softwares use SQL optimizer to process any SQL query. Some softwares may consider both the queries as same in terms of execution so there would not be a noticeable difference in their CPU timings. Some may be in favor of NOT EXISTS. SAS seems to be in favor of NOT IN operator as it does not require tables to be merged.

SAS : INTCK Function with Examples

$
0
0
The INTCK is one of the most important date function that is used to calculate the difference between two dates, two times or two datetime values.

The following is a list of common real-world examples where INTCK is used -
  1. Calculation of individual's age
  2. Tenure of an employee with company
  3. Customer's tenure with the organization
  4. Number of working days
  5. Number of hours spent on a particular course
  6. Number of quarterly payments paid

INTCK - Syntax

The syntax of INTCK is defined below -
INTCK(date-or-time-interval, start-date-or-time, end-date-or-time, [method])
1. date-or-time-interval : Date or time period needs to be defined in the first parameter. For eg. MONTH, YEAR, QTR, WEEK, HOUR, MINUTE etc. Specify period in single quotes

2. start-date-or-time : Starting date or time to calculate the number of periods.

3. end-date-or-time: End date or time to calculate the number of periods.

4. method : Optional Parameter. Method to calculate the difference. Methods are 'CONTINUOUS' or 'DISCRETE'. By default, it is DISCRETE.

Simplest Example of INTCK

Calculate the number of years between two dates. In this case, two dates are 01JAN2015 and 01JAN2017.
data temp;
date1 = '01JAN2015'd;
date2 = '01JAN2017'd;
no_of_years  = intck ('YEAR', date1, date2);
format date1 date2 date9.;
proc print data = temp;
run;
The 'YEAR' keyword tells SAS to calculate the number of intervals between dates in terms of year. Since 01JAN2015 is a starting date, it is specified in the INTCK function before 01JAN2017. The FORMAT statement is used to display datevalues in date format when we print our results.

The output is shown below -
SAS : INTCK Function

Other alias of year - 'YEARS' and 'YR'-
no_of_years  = intck ('YEARS', date1, date2)
no_of_years  = intck ('YR', date1, date2)

SAS INTCK Examples

Like calculation of years, we can use other intervals such as semiyear, quarter, month, week, day. The examples of these intervals are displayed below -
data temp;
date1 = '01JAN2015'd;
date2 = '01JAN2017'd;
no_of_years  = intck ('YEAR', date1, date2);
no_of_semiyears  = intck ('SEMIYEAR', date1, date2);
no_of_quarters  = intck ('QUARTER', date1, date2);
no_of_months  = intck ('MONTH', date1, date2);
no_of_weeks  = intck ('WEEK', date1, date2);
no_of_days  = intck ('DAY', date1, date2);
format date1 date2 date9.;
proc print data = temp noobs;
run;
INTCK Examples

Custom Intervals

Suppose you are asked to calculate the number of 4 months interval between two dates -
data temp;
date1 = '01JAN2015'd;
date2 = '01JAN2017'd;
no_of_4months  = intck ('MONTH4', date1, date2);
run;

The MONTH4 interval implies interval is of 4 months. It is equal to the number of months divided by 4. Don't confuse it with QUARTERS. QUARTERS is equal to interval of 3 months. Remember 4 Quarters in an year.

Result :no_of_4months = 6
Similarly, we can use the custom intervals in YEAR, QUARTER and other periods. For example, 'YEAR2' tells SAS the interval is of 2 years. It would return 1 for the above mentioned dates.

Set Starting Point for Calculation
data temp;
date1 = '31JAN2015'd;
date2 = '31DEC2016'd;
diff  = intck ('YEAR', date1, date2);
diff2  = intck ('YEAR.3', date1, date2);
format date1 date2 date9.;
proc print;
run;
FunctionResult
intck ('YEAR', '31JAN2015'd, '31DEC2016'd)1
intck ('YEAR.3', '31JAN2015'd, '31DEC2016'd)2

How it works :
  1. intck ('YEAR', date1, date2) - It checks number of times first of January appears as first of january is set as a starting point by default. The variable diff returns 1 as it includes only 01JAN 2016.
  2. intck ('YEAR.3', date1, date2) - It checks number of times first of March appears as YEAR.3 refers to the period starting from 1st of March to end of February next year. The variable diff2 returns 2 as it includes 01 March 2015 and 01March 2016.

Is it a month difference?

INTCK says there is a month difference between 25OCT2016 and 03NOV2016. But there is no month difference between 01OCT2016 and 31OCT2016. How?
data temp;
month1= intck('month', '25OCT2016'd, '03NOV2016'd);
month2= intck('month', '01OCT2016'd, '31OCT2016'd);
proc print;
run;

FunctionResult
intck ('month', '25OCT2016'd, '03NOV2016'd)1
intck ('month', '01OCT2016'd, '31OCT2016'd)0
INTCK checks whether the first day of the month lies with in the range. In the first case, 01 November falls between October 25 and November 03 so it returns 1. In the second case, it returns 0 as 01 November does not fall between 01OCT2016 and 31OCT2016.

How to correct it?

Add one more parameter at end of INTCK function. In the parameter, specify 'C' which refers to continuous method for calculation.
data temp;
month1= intck('month', '25OCT2016'd, '03NOV2016'd, 'C');
proc print;
run;
The above function returns 0.

The CONTINUOUS method calculates continuous time from the start-of-period date specified in the second parameter of INTCK function.

Calculating Weekdays

Suppose you are asked to calculate the number of weekdays -
data eg;
weekdays = intck('WEEKDAY', '11DEC2016'd ,'18DEC2016'd);
proc print;
run;
It returns 5. In this case, saturday and sunday are considered weekends and excluding from the calculation.

Define 6 days working

If you need to calculate number of working days between 2 dates considering 6 weekdays -
data eg;
weekdays = intck('WEEKDAY1W', '11DEC2016'd ,'18DEC2016'd);
proc print;
run;
WEEKDAY1W implies sunday as weekend (1=Sunday, 2= MONDAY... 7=Saturday)

Set Custom Weekends
data eg;
weekdays = intck('WEEKDAY24W', '11DEC2016'd ,'16DEC2016'd);
proc print;
run;
WEEKDAY24W means MONDAY and WEDNESDAY are weekends. The above function returns 3

Calculate between Datetime values

Suppose you need to calculate hours, minutes and seconds between two datetime values.
data temp2;
hours=intck('hour','01jan2016:10:50:00'dt,'01jan2016:11:55:00'dt);
minutes=intck('minute','01jan2016:10:50:00'dt,'01jan2016:11:55:00'dt);
seconds=intck('second','01jan2016:10:50:00'dt,'01jan2016:11:55:00'dt);
proc print noobs;
run;
Time Calculation
Result - 1 hour, 65 minutes and 3900 seconds

Time Difference
data temp3;
hours=intck('hour','12:00:00't, '23:05:00't);
minutes=intck('minute','12:00:00't,'23:05:00't);
seconds=intck('second','12:00:00't,'23:05:00't);
proc print noobs;
run;
Result : 11 hours 665 minutes 39900 seconds

Wish Christmas with SAS

$
0
0
This post is dedicated to the lovers of SAS programming language. Flaunt your knowledge in your peer group with the following SAS program 😊 It's one of the stupidest and coolest way to make simple things complex. The idea is to show off your creativity and solve easy problems with complicated ways to learn some cool tricks which can be used to solve future real complex problems.
Be Creative with SAS
Run the following program and see what i mean -
DATA A(KEEP=WISH);
   LENGTH WISH $15 F1 $5 F2 $9;
   DO J= 1 TO 5;
   IF J = 1 THEN K = TAN(4.702389315) / ATAN(3.584020431);
   ELSE IF J = 2 THEN K = SUM(INPUT(PUT('99',$HEX2.),8.),             INPUT(PUT('0',$HEX2.),8.));
   ELSE IF J <= 4 THEN K = SUBSTR(PUT('12JAN2004'D , 5.),4, 2);
   ELSE IF J = 5 THEN K = ROUND(CONSTANT("PI")*30, 10) - 1;
   SUBSTR(F1,J)=BYTE(K);
   END;

   DO I= 1 TO 9;
   IF I = 1 THEN K = 100 / ARCOS(0.078190328);
   ELSE IF I = 2 THEN K = MOD(CONSTANT("PI") / 3, 2*CONSTANT("PI")) * 360/(2*CONSTANT("PI")) + 3*4;
   ELSE IF I = 3 THEN K = SUBSTR(PUT('12JAN2004'D , 5.),4, 2);
   ELSE IF I = 4 THEN K = ROUND(2**6.189823562,1);
   ELSE IF I = 5 THEN K = EXP(4.418841708);
   ELSE IF I = 6 THEN K = MEDIAN(82,86);
   ELSE IF I = 7 THEN K = TAN(4.702389315) / ATAN(3.584020431);
   ELSE IF I = 8 THEN K = SUM(INPUT(PUT('69',$HEX2.),8.)-1, INPUT(PUT('0',$HEX2.),8.));
   ELSE IF I = 9 THEN K = EXP(4.418841708);
      SUBSTR(F2,I)=BYTE(K);
      END;
WISH = CATX('',F1,F2);
OUTPUT;
PROC PRINT NOOBS;
RUN;

Output

The output of the program is not shown here for surprises. Please post your output in the comment box below!

How it works

1. You can create character value with BYTE() function.  The BYTE function returns character value against the specified ASCII code. Run BYTE(65) and see what it evaluates. Run the program below and see log.
data _null_;
x = byte(65);
put x=;
run;
Result : A . The above program returns A.

Similarly, you can create A to Z alphabets with the following program.
data _null_;
do i = 65 to 90;
x = byte(i);
put x;
end;
run;
See log after submitting the above code.

2. What Trigonometric function does in the code -

Step 1 : TAN(4.702389315) returns 100
Step 2 : ATAN(3.584020431) returns 1.2987
Step 3 : = 100 / 1.2987 is equal to 77
Step 4:  BYTE(77) returns M of 'MERRY'

3. How to know exact angle before using in TRIGNO functions

Use Goal Seek Feature of MS Excel. See the GOAL SEEK Tutorial

SAS : Count Missing and Non missing Across Columns

$
0
0
This post explains how to count number of missing (blanks) and non missing (non blank) values across multiple columns. It's very easy to calculate it with SAS. It's one of the common data manipulation task that SAS programmers deal in a day to day task but it is mostly associated with finding number of rows or number of non-missing values across rows. There is a rule in SAS that we should remember - We use PROCs when we need to aggregate a COLUMN. Whereas, we use FUNCTIONs to aggregate a row.

Let's create a sample data for demonstration -

The program below creates a sample dataset which would be named as TEMP and it would be stored in WORK library.
data temp;
input x y z a b$;
cards;
1 23 24 50 AA
1 . 24 50 AC
1 13 . 50 AB
1 23 . 50 .
;
run;
The sample data looks like below -

Sample Data

Count Missing and Nonmissing NUMERIC Values

The SAS function N calculates the number of non-blank numeric values across multiple columns. To count the number of missing numeric values, you can use NMISS function.
data outdata;
set temp;
nvalues = N(of x--a);
nmiss = nmiss(of x--a);
proc print;
run;
Output
Note - The N(of x--a) is equivalent to N(x, y, z, a). In this case, we are not using the variable b in the above program.

Count total missing and nonmissing values

Suppose you need to calculate number of both character and numeric non-missing and missing values.

Since SAS has no inbuilt function to calculate the number of variables, we need to use PROC CONTENTS to calculate the number of variables. Later we are storing the number of variables information in a macro variable which is totvar.

CMISS Function

The function CMISS counts the number of missing values across columns. It considers missing values of both numeric and character variables.
proc contents data=temp out=cols noprint;
run;

data _null_;
set cols nobs=total;
call symputx('totvar', total);
run;

data outdata;
set temp;
totalvar=&totvar;
totmiss=cmiss(of x--b);
totnonmiss=totalvar- cmiss(of x--b);
proc print ;
run;
SAS : Output

Free SAS Access - No Install Required

$
0
0
Accessing SAS was never so easy. With SAS OnDemand for Academics software, SAS is just a click away! Yes, you hear it right. We can access SAS software for free by just logging to a SAS website. No installation is required and it's available for everyone - Instructors, Students and Individual Learners. In short, it's available for every other use than commercial. It's an important step taken by SAS marketing team to counter competitors such as R and Python. There was a time when accessing SAS software and learning SAS programming is such as pain. With this software and SAS University Edition (UE) for free, learning SAS programming is no more a distressing task.

Steps to access SAS OnDemand for Academics

1. Register yourself and create your account by visiting registration page 

2. Submit the required details (first name, last name and Email ID) in the registration page.

3. You will get an email from SAS team with the link to activate your profile.

4. You need to enter your email address and password information and accept the license agreement and then click Create Account.

5. After completing step 4, you will get an email with the subject 'You are ready to start using SAS OnDemand for Academics'and user id. Click on the link specified in the email.

6. Enter your user id and password to log in to the software.

7. Click on SAS Studio link on the dashboard page. See the screenshot below -
Free SAS Software

Detailed User Guide is stored on this link -User Guide for SAS Academics

Important Points : SAS OnDemand for Academics
  1. No installation required
  2. Internet required to access the software
  3. Free 5GB of personal data storage. Additional 2 GB data storage for Instructors.
  4. If you do not use your account for a period of one year, then your account might expire.

List of SAS softwares available for FREE

You can access not only Base SAS, SAS STAT modules but also access SAS GRAPH, SAS modules for econometrics (ETS) and operation research (OR). And if you are searching for free SAS 9.4 or free SAS Enterprise Miner, your search is over. You will get the latest version of SAS and popular SAS packages for free.
  1. SAS Studio (Includes Base SAS, SAS STAT, SAS GRAPH, ETS, IML, OR)
  2. SAS Enterprise Guide
  3. SAS Enterprise Miner
  4. SAS Credit Scoring
  5. SAS Text Miner
  6. SAS Forecast Studio
  7. SAS High Performance Forecasting
If you are a statistician or data scientist, you can practise data mining and text mining techniques with SAS Enterprise Miner and Text Miner for absolutely free.

System Requirements

Since it runs on cloud, it requires a web browser. The following web browsers are supported :
  1. Microsoft Internet Explorer 9+
  2. Mozilla Firefox 14+
  3. Google Chrome 21+
  4. Apple Safari 5+
  5. Mobile - Apple Safari 6+
Support
If you face any issue(s) related to the software, you can write an email to SASAnalyticsU@sas.com or post a message in the SAS Community

Basics of SAS Academic Software

1. Upload External Data

First select the Files (Home) area and then the fourth icon under 'Server Files and Folders'would be visible and then click on Choose Files box and hit Upload button.

Upload data to SAS Studio Server

2. How to access uploaded data
FILENAME REFFILE '/home/youruserid/sasuser.v94/Files/df.csv';

PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
OUT=WORK.MYFILE;
GETNAMES=YES;
RUN;
In the code above, the first highlighted blue portion of the code (youruserid) refers to your user id and the second highlighted portion refers to name of your uploaded file.
Tips - Alternatively, you can access uploaded data by clicking on 'Import Data'option under the first icon of the 'Server Files and Folders' area (See screenshot of the next step)

3. You can create a new program or folder by clicking on the first icon. Follow the screenshot below -

SAS Studio - Files / Programs

If you are already using SAS University Edition which is also available for free, you might not find this academic version as useful but there are certain features in the academic software which is not available in the University Edition such as Operation Research module or SAS Enterprise Miner and Text Miner. Check out this link -SAS University Edition vs. SAS OnDemand For Academics

Companies using R

$
0
0
R is a free programming language for data analysis, statistical modeling and visualization. It is one of the most popular tool in predictive modeling world. Its popularity is getting better day by day. In 2016 data science salary survey conducted by O'Reilly, R was ranked second in a category of programming languages for data science (SQL ranked first). In another popular KDnuggets Analytics software survey poll, R scored top rank with 49% vote. These survey polls answers the question about scope of R. If you really want to boost your career in analytics, R is the language you need to focus on.

Companies using R
If you are skeptical about the scope of R in future, this tutorial would help you clear doubts. It covers some of the great facts about R and why R is considered as the top programming language in predictive modeling and what are the companies primarily using it.

What's great about R?

1. It supports close to 10,000 free packages which adds muscle to R for data science and analysis (CRAN figure as on December 27,2016). In the graph below, we can see the growth rate of packages is very high. It implies more and more people are taking interest in R. I used CRANpackages dataset from Ecdat library to produce this graph.

R : Packages Download

2. R can perform various data analysis and data science tasks for free. To achieve the similar stuff in SAS, you need to purchase a bundle of SAS softwares and modules. See the list below -
  • Interactive Visualization with Shiny package (Equivalent SAS Product : Visual Analytics)
  • Ensemble Learning / Machine Learning (SAS Product : SAS Enterprise Miner)
  • Text / Social Media Mining (SAS Product : SAS Text Miner)
  • Optimization and Forecasting (SAS Product : SAS ETS, PROC OPTMODEL)
  • RStudio IDE (SAS Product : SAS Enterprise Guide)

3.  R is one of the highest paid IT skill and holds large share in advanced analytics software (Source : LinkedIn Skills, O'Reilly Survey).

4. Easily integrated with popular softwares such as Tableau, SQL Server etc. Last year Microsoft acquired Revolution Analytics. Post acquisition, Microsoft merged enterprise version of R with SQL server, Visual Studio, PowerBI.

5. New statistical and machine learning algorithms are implemented in R much more quicker than any other statistical tool as R is the first choice of researchers.

Companies using R

The following is a list of companies using R. It's not a complete list but a sample of the actual population.
The data used here were collected from MicrosoftR blog, RStudio, Quora, Indeed job portal and other jobs consulting firms and career community websites.
Click on the link below to navigate to the desired section -


Top Tier Companies using R

The following is a list of top brands or large organizations using R.
  1. Facebook - For behavior analysis related to status updates and profile pictures.
  2. Google - For advertising effectiveness and economic forecasting.
  3. Twitter - For data visualization and semantic clustering
  4. Microsoft - Acquired Revolution R company and use it for a variety of purposes.
  5. Uber - For statistical analysis
  6. Airbnb - Scale data science.
  7. IBM - Joined R Consortium Group
  8. ANZ - For credit risk modeling
  9. HP
  10. Ford
  11. Novartis
  12. Roche
  13. New York Times - For data visualization
  14. Mckinsey
  15. BCG
  16. Bain



IT Companies using R

It includes major companies providing IT and professional services using R in India and other parts of the world.
  1. Accenture
  2. Amadeus IT Group
  3. Capgemini
  4. Cognizant
  5. CSC
  6. HCL Technologies
  7. Hexaware Technologies
  8. HP
  9. IBM
  10. IGATE
  11. Infosys
  12. Larsen & Toubro Infotech
  13. Microsoft
  14. Mindtree
  15. Mphasis
  16. NIIT Tech
  17. Oracle Financial Services Software
  18. Paytm
  19. Snapdeal
  20. R Systems Ltd
  21. Tata Consultancy Services
  22. Tech Mahindra
  23. Wipro



Analytics and Consulting Companies using R

The below list comprises of niche analytics companies as well as consulting companies providing analytics or market research services.

  1. A.T. Kearney
  2. AbsolutData
  3. AC Nielsen
  4. Accenture
  5. Bain & Company
  6. Booz Allen Hamilton
  7. Capgemini
  8. Convergytics
  9. Deloitte Consulting
  10. Evalueserve
  11. EXL
  12. EY
  13. Fractal Analytics
  14. Gartner
  15. Genpact
  16. IBM
  17. KPMG
  18. Latent View
  19. Manthan Systems
  20. McKinsey & Company
  21. Mu Sigma
  22. PricewaterhouseCoopers
  23. SIBIA Analytics
  24. Simplify360
  25. SmartCube
  26. Target
  27. The Boston Consulting Group
  28. Tiger Analytics
  29. Tower Watson
  30. WNS
  31. ZS Associate



Financial Institutions

It includes major US and European Banks, Insurance Companies and Other financial institutions using R.

  1. American Express
  2. ANZ
  3. Bank of America
  4. Barclays Bank
  5. Bazaj allianz Insurance
  6. Bharti Axa insurance
  7. Blackrock
  8. Citibank
  9. Dun &Bradstreet
  10. Fidelity
  11. HSBC
  12. JP Morgan
  13. KeyBank
  14. Lloyds Banking
  15. RBS
  16. Standard Chartered
  17. UBS
  18. Wells Fargo
  19. Goldman Sachs
  20. Morgan Stanley
  21. PNC Bank
  22. Citizens Bank
  23. Fifth Third Bank




End Notes
I tried to cover maximum of the names but i'm sure many are missed in the list. Would you mind sharing the remaining names? :)

SAS Date Formats and Informats

$
0
0
This tutorial describes the usage of SAS Date formats and informats. It includes practical real-world data problems related to SAS formats.

What are Formats and Informats?

Informats is used to tell SAS how to read a variable whereas Formats is used to tell SAS how to display or write values of a variable.

Informats is basically used when you read or import data from either an external file (Text/Excel/CSV) or read in sample data which was created using CARDS/DATALINES statement. It is also used when you create a new variable in a dataset.

Formats can be used in both Data Steps and PROC Steps whereas Informat can be used only in Data Steps. Let's understand by examples -

Example 1 - Read Dates in SAS

In the program below, we have used INFORMATS ddmmyy8. and ddymmyy10. to read dates in SAS. It creates a dataset called sampledata which is stored in WORK library.

DATA sampledata;
     INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
    CARDS;
     30-12-16 30-12-2016
;
RUN;
The INFORMATS ddmmyy8. is used to read 30-12-16 date and ddmmyy10. to read 30-12-2016 date. In this case, 8 and 10 refers to width of the date.

The created dataset looks like below -
Read SAS Date
It returns 20818 as it is in SAS date value form. It is not meaningful if you look at the value. You cannot tell which date it is. To display in real date form, use FORMAT statement.
DATA sampledata;
     INPUT @6 date1 ddmmyy8. @15 date2 ddmmyy10.;
     FORMAT date1 ddmmyy8. date2 ddmmyy10.;
cards;
     30-12-16 30-12-2016
;
RUN;
Output : SAS Format

Example 2 - Display Today's Date

The today() function can be used to generate current date.
data _null_;
    dt=today();
    format dt yymmdd10.;
    put dt ;
run;
Result : It returns 2016-12-30 as 30DEC2016 is the today's date. It's in YYYY-MM-DD format because we've used yymmdd10. format. The 10 refers to the width of the date as 2016-12-30 contains 10 elements. The PUT statement is used to show value in log window.

To display date in WORD format

1. Short Word Date Format

The format date9. returns 30DEC2016.
format dt date9.;

2. Complete Word Date Format

The format WORDDATE. returns DECEMBER 30, 2016. No need to specify width in this format. It automatically adjusts the width depending on the month.
format dt WORDDATE.;

3. Including WEEK

The format WEEKDATE. gives Friday, December 30, 2016
format dt WEEKDATE.;

Display DAY / MONTH / YEAR

In this section, we will see how we can write only day, month, year and weekday.
data _null_;
dt=today();
put "Day :" dt  DAY.;
put "Month :" dt MONTH.;
put "YEAR:" dt YEAR.;
put "WEEKDAY:" dt DOWNAME.;
run;
SAS : Date Formats
We can also use FORMAT in the PUT statement without specifying FORMAT statement explicitly. The DAY. format returned 30, MONTH. format returned 12 and YEAR. format returned 2016. In addition, we have used DOWNAME. format to extract weekday (Friday).

Other Popular Formats

Some of the commonly used date formats are listed below -
FormatsResult
DDMMYYP10.30.12.2016
DDMMYYS10.30/12/2016
MMDDYYP10.12.30.2016
MMDDYYS10.12/30/2016
WORDDATX19.30 DECEMBER 2016

Endnotes

Hope you have a better understanding of the difference between SAS Date Formats and Informats and how they are used after completing this tutorial.

Data Analyst Interview Questions

$
0
0
This tutorial explains common and tricky data analyst interview questions with answers. The main responsibility of data analyst is to generate insights from data and present it to stakeholders such as external or internal clients. During this process, he/she extracts data from database and then clean it up to prepare it for analysis. Later data analysis step involves exploration of data with descriptive statistics and then building predictive model for prediction. Data analyst must know basics and intermediary statistics and know how to apply it with SAS / SPSS. Excel and SQL are the two most popular tools used in data analytics so candidate must possess a good knowledge of these tools. Excel is used for a variety of purposes such as generating quick summaries and presenting it in an interactive excel dashboard. Mostly offline reporting deliverables are in either excel or powerpoint report formats.

This tutorial covers interview questions on the following topics:
  1. MS Excel
  2. Basic and Intermediate Statistics
  3. SAS
  4. SQL
  5. HR / Project related questions

Excel Questions

The following is a list of some tricky or advanced excel interview questions.

1. What is the default value of last parameter of VLOOKUP?

TRUE/1 . It refers to finding the closest (approximate) match and assuming the table is sorted in ascending order. Whereas, FALSE/0 refers to exact match.

2. What is the main limitation of VLOOKUP function?

The lookup value should be at the most left side column in the table array. VLOOKUP only looks right. It cannot look right to left.

3. Does VLOOKUP look up case-sensitive values?

No, it is not case-sensitive. The text 'ram' and 'RAM' is identical for VLOOKUP.

4. 2 ways to extract unique values in excel

Use Advanced Filter option (shortcut key : ALT D F A) and 'Remove Duplicates' option under Data tab.

5. How to find duplicates in a column?

Use CONDITIONAL FORMATING to highlight duplicate values. OR use COUNTIF function as shown below. For example, values are stored in cells D4:D7.
=COUNTIF(D4:D7,D4)
Apply filter on the column wherein you applied COUNTIF function and select values greater than 1.

6. How to insert a drop down?

Go to Data tab >> Select Data Validation. Another way to insert a drop down is to enable Developer tab and Insert Combo box.

7. How to sum values based on some conditions?

Use SUMIF or SUMPRODUCT functions. The SUMIF function is explained below -
=SUMIF(range, criteria, sum_range)
=SUMIF(A2:A5,"A",B2:B5)
Excel : SUMIF Function

8. How to create cross tabulation in Excel?
Use Pivot Table and select one variable in Row label and the other variable in Column label.

9. What is Excel Array Formula?


10. How to extract First Name from a full name?

Suppose you need to pull 'Neha' from 'Neha Sharma'. Use MID and FIND functions.
=MID (A2,1,FIND("",A2)-1)
Tutorial :Practical Uses of MID Function

11. How Index and Match Function works?

Index function returns a value from a range based on row number.
= INDEX(range, row_number)
See the image below -

Excel : Index Function
In this case, we are telling EXCEL to return second value of the range A2:A4. It returns 30.

Match function returns the relative position of a value in range.
= MATCH(lookup_value, range, match_type)
match_type can be exact match, largest/smallest value that is less than or greater than equal to lookup_value.

Excel : Match Function
In this case, we are asking EXCEL to find the relative position of 30 in the range A2:A4. It returns 2.

12. How Index and Match Function works together?
=INDEX(range, MATCH(lookup_value, lookup_range, match_type))
Suppose information of Product and Sales are stored in columns A and B. You need to look for product against sales value so you need to tell EXCEL to look from right to left as sales value is placed in the right hand side of the range/table.
=INDEX(A2:A5,MATCH(45,B2:B5,0))
Nested INDEX MATCH Excel Functions


Basic and Intermediate Statistics

The following questions touch upon some basics and intermediate statistics topics. These topics are generally taught in undergraduate / graduate courses.

1. What is p-value?

It is the lowest level of significance at which you can reject the null hypothesis. If p-value < 0.05, you reject the null hypothesis at 5% level of significance.

2. Difference between MEAN. MEDIAN, MODE

Mean is calculated by summing all the values divided by number of observations. Median is the middle value. And Mode is the most occurring value.

3. In which data types MEAN, MEDIAN and MODE are more suitable?

MEAN is suitable for continuous data with no outliers. It is affected by extreme values (Outliers).
MEDIAN is suitable for continuous data with outliers or ordinal data. Mode is suitable for categorical data (including both nominal and ordinal data).

4. Different Types of Sampling Techniques?

There are following four main types of sampling techniques.
  1. Simple random sampling
  2. Stratified sampling
  3. Cluster sampling
  4. Systematic sampling

5. Difference between Cluster and Stratified Sampling?

The main difference between cluster and stratified sampling is that in stratified sampling all the strata need to be sampled. In cluster sampling one proceeds by first selecting a number of clusters at random and then sampling each cluster or conduct a census of each cluster. But usually not all clusters would be included.

6. When should we use T-test than Z-test?

Theoretically, we should use T-test when sample size (N) is less than 30. Practically, we always use t-test. It is because t-test and z test are equivalent as N tends to infinity.

7. What is the difference between R-square and Adjusted R-square?
Check out this link - R-square vs. Adjusted R-square

8. How to detect outliers?

Box Plot Method - If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier.

Standard Deviation Method - If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier.


9. Define Homoscedasticity?

In a linear regression model, there should be homogeneity of variance of the residuals. In other words, the variance of residuals are approximately equal for all predicted dependent variable values.

10. Difference between Standardized and Unstandardized Coefficients?

To calculate Standardized Coefficients, first we need to standardize both dependent and independent variables and use the standardized variables in the regression model to get standardized estimates. By 'standardize', it implies subtracting mean from each observation and divide that by the standard deviation. The standardized coefficient is interpreted in terms of standard deviation. Whereas, unstandardized coefficient is measured in actual units.

11. Difference between Factor Analysis and Principal Component Analysis?

Both the analysis are very much similar but they are different in terms of calculation and their practical usage :
  1. In Principal Components Analysis, the components are calculated as linear combinations of the raw input variables. In Factor Analysis, the raw input variables are defined as linear combinations of the factors.
  2. The main idea of using PCA is to explain as much of the total variance in the variables as possible. Whereas, the factor analysis explains the covariances or correlations between the variables.
  3. PCA is used when we need to reduce the number of variables (dimensionality reduction) whereas FA is used when we need to group variables into some factors.

12. Difference between Linear and Logistic Regression?

There are more than 10 differences between these two algorithms. Check out the link below -

13. How to statistically compare means between groups?

Use Independent T-test when a continuous variable and a categorical variable having two independent categories.

Use Paired T-test when a continuous variable and a categorical variable having two dependent or paired categories.

Use one way ANOVA when a continuous variable and a categorical variable having more than two independent categories.

Use GLM Repeated Measures when a continuous variable and a categorical variable more than twodependent categories.

14. Explain eigenvalues and eigenvectors intuitively

Eigenvalues are variances explained by principal components. By 'variances', i mean the diagonal values of the covariance matrix below -
xyz
1.34-0.160.19
-0.160.62-0.13
0.19-0.131.49

The sum of the diagonal values is 3.45.

Why eigenvalue greater than 1 is considered to retain components? It is because the average eigenvalue will be 1, so > 1 implies higher than average.

Eigenvectors
are the coefficients of orthogonal (uncorrelated) transformation of variables into principal components.


SAS

The following questions would help you to prepare for SAS interview round.

1. Difference between WHERE and IF statement?
  • WHERE statement can be used in procedures to subset data while IF statement cannot be used in procedures.
  • WHERE can be used as a data set option while IF cannot be used as a data set option.
  • WHERE statement is more efficient than IF statement. It tells SAS not to read all observations from the data set
  • WHERE statement can not be used when reading data using INPUT statement whereas IF statement can be used.
  • When it is required to use newly created variables, use IF statement as it doesn't require variables to exist in the READIN data set.

2. How PROC MEANS works?
PROC MEANS DATA = dataset_name;
VAR analysis_variable;
CLASS grouping_variable;
RUN;
Detailed Explanation :PROC MEANS 


3. Difference between INFORMAT and FORMAT?

Informat is used to read data whereas Format is used to write or display data.

4. Difference between NODUPKEY and NODUP in PROC SORT?

The NODUPKEY option removes duplicate observations where value of a variable listed in BY statement is repeated while NODUP option removes duplicate observations where values in all the variables are repeated.

5. How many maximum characters SAS library name can take?

A valid library name must start with an alphabet and cannot have more than 8 characters.

6. Which is more faster - Proc SQL or SAS data step?

The SQL procedure performed better with the smaller datasets whereas the data step performed better with the larger datasets (more than approx. 100 MB).

7. Two main advantages of Proc SQL Joins over Data Step Merging?
  1. Proc SQL JOINS do not require variables to be sorted prior to joining them whereas Data Step Merging requires.
  2. Proc SQL works perfectly when key variables have different names

8. What would happen if i don't use 'BY statement in MERGE?

Without 'BY' statement, SAS will merge the 1st observation from dataset A and 1st observation from dataset B to form the 1st observation of the final dataset. It might lead to meaningless results.

9. What are the ways to create a macro variable?

There are 5 ways to create macro variables:
  • %Let
  • Iterative %DO statement
  • Call Symput 
  • Proc SQl into clause 
  • Macro Parameters

10. How to rename columns with PROC SQL?

Use AS alias.
Proc SQL;
select name as fullname from table1;
quit;

11. How to calculate percentile values with SAS?

We can use PROC MEANS or PROC UNIVARIATE to calculate percentile values. For example, specify options P10, P90 to calculate 10th and 90th percentile score. PROC MEANS cannot calculate custom percentile values such as 97.5th or 99.5th percentile. To calculate these custom percentile values, you can use PCTLPTS= option in PROC UNIVARIATE.

Tutorial : PROC UNIVARIATE


12. How to replace missing values of all the numeric variables to 0 in a single run?

We can use _numeric_ to specify all the numeric variables and dim functions in array to count the number of numeric variables.
data temp;
set sampledata;
array Q(*) _numeric_;
do i= 1 to dim(Q);
if Q(i) = . then Q(i)= 0;
end;
run;

SQL

1. How to write conditional statements (IF ELSE) in SQL?

In SQL, it is possible with CASE WHEN statements.
select
        case when sex='M' then 1 else 0 end as males
        , case when sex='F' then 1 else 0 end as females
    from sashelp.class;
quit;

2. What are the common SQL data types ?

DataType Format
NumericNUMBER(L,D), INTEGER,SMALLINT,DECIMAL(L,D)
CharacterCHAR(L),VARCHAR(L)
DateDATE

3. How to subset or filter data in SQL?

We can use WHERE clause to subset or filter data.
SELECT *
FROM PRODUCT
WHERE SALES > 200

4. Difference between WHERE and HAVING clauses

The HAVING clause comes into effect when GROUP BY is used. It runs after GROUP BY so it filters grouping variable. Whereas WHERE clause runs prior to GROUP BY clause so it does not filter a grouping variable.


5. Difference between Full Join and Cross Join?

A full join keeps all rows from both of the input tables even if we cannot find a matching row.
Cross Join returns cartesian product of tables. It matches every row of one table with every row of another table.

6. Difference between UNION and UNION ALL

The main use of UNION and UNION ALL is to join two tables. The main difference between them is that UNION removes duplicate records and UNION ALL keeps the duplicate records. By 'duplicate records', all the values of two or more rows are same.

7. How to create a blank table

Method I :

The following method creates a new table called temp2 with the same column names and attributes of table temp.
CREATE TABLE TEMP2 LIKE TEMP;
Method II :

In this case, we are creating a blank table by subsetting data method. As 1 is not equal to 2, it returns zero row.
create table temp3 as
select * from temp
where 1=2;

8. What will be the result of the query below?
select case when null = null then 'Yes' else 'No' end as Result;
It will return NO as the above code is not right way to compare null values. The correct way would be to use 'is' keyword to compare -
select case when null is null then 'Yes' else 'No' end as Result;

9. Suppose you have a table named TEMP. You need to recode values of column Y, Swap values 2 and 3 in column Y

Table

UPDATE TEMP
SET Y= CASE WHEN Y = 2 THEN 3
WHEN Y = 3 THEN 2
ELSE Y END;

10. Identify second maximum value
select max(y) from temp
where y not in (select max(y) from temp);
 In this code, the logic is to remove the maximum value from the main table and then calculate the max value to figure out the second max value of the main table.

11. Identify second maximum value by a group

The code below first removes all the max values by a group from the main table. Later we calculated the second max value by a group.
select a.x, max(a.y) as maxy from temp a left join
(select x, max(y) as maxy from temp group by 1) b
on a.x = b. x and a.y = b.maxy
where b.x is null and b.maxy is null
group by 1;

12. Is the query below correct? If not, what's the issue?
SELECT custid, YEAR(ref_date) AS ref_year
FROM custmart
WHERE ref_year  >= 2015;
 The calculated column cannot be used in WHERE condition so we need to modify the above code like this -
SELECT custid, YEAR(ref_date) AS ref_year
FROM custmart
WHERE YEAR(ref_date)  >= 2015;

HR / Project Related Questions

1. Explain one of your project

  • Start from problem definition
  • Explain Data Cleaning, Exploration and Data Preparation Steps
  • What technique / algorithm is used in the project?
  • Financial (Dollar) value impact of the project


2. What are your strengths and weaknesses?

3. Why are you leaving the current organization?

4. Why should we hire you?

5. Where do you see yourself five years from now?

6. What was the toughest decision you ever had to make?

End Notes

The above list of questions would assist you in preparing for interviews for roles of senior / lead data analyst. Don't just mug up answers, understand the concepts of the topics covered in these questions.

Support Vector Machines Simplified using R

$
0
0
This tutorial describes theory and practical application of Support Vector Machines (SVM) with R code. It's a very popular supervised learning algorithm. It works both for classification and regression problems.

What is Support Vector Machine?

The main idea of support vector machine is to find the optimal hyperplane (line in 2D, plane in 3D and hyperplane in more than 3 dimensions) which maximizes the margin between two classes. In this case, two classes are red and blue balls. In layman's term, it is finding the optimal separating boundary to separate two classes (events and non-events).
Support Vector Machine (SVM)
Support Vectors are observations that supports hyperplane on either sides. In the image above, filled red and blue boxes and circles are support vectors.

Why Hyperplane?

Hyperplane is just a line in 2D and plane in 3D. In higher dimensions (more than 3D), it's called hyperplane. SVM help us to find a hyperplane (or separating boundary) that can separate two classes (red and blue dots).

What is Margin?

It is the distance between the hyperplane and the closest data point. If we double it, it would be equal to the margin.
Objective: Maximize the margin between two categories

How to find the optimal hyperplane?

In your dataset, select two hyperplanes which separate the data with no points between them and maximize the distance between these two hyperplanes. The distance here is 'margin'.

How to treat Non-linear Separable Data?
Imagine a case - if there is no straight line (or hyperplane) which can separate two classes? In the image shown below, there is a circle in 2D with red and blue data points all over it such that adjacent data points are of different colors.
SVM : Nonlinear Separable Data

SVM handles the above case by using a kernel function to handle non-linear separable data. It is explained in the next section.

What is Kernel?

In simple words, it is a method to make SVM run in case of non-linear separable data points. The kernel function transforms the data into a higher dimensional feature space to make it possible to perform the linear separation. See the image below -
SVM Kernel
In this case, we have transformed our data into a 3 dimensional space. Now data points are separated by a simple plane. It is possible by kernel function. We can say a nonlinear function is learned by a support vector (linear learning) machine in a high-dimensional feature space.

Different Kernels

1. linear: u'*v
2. polynomial: (gamma*u'*v + coef0)^degree
3. radial basis (RBF) : exp(-gamma*|u-v|^2)
4. sigmoid : tanh(gamma*u'*v + coef0)

RBF is generally the most popular one.

How SVM works?
  1. Choose an optimal hyperplane which maximize margin
  2. Applies penalty for misclassification (cost 'c' tuning parameter).
  3. If non-linearly separable data points, transform data to high dimensional space where it is easier to classify with linear decision surfaces (Kernel trick)

Advantages of SVM
  1. SVM performs well in case of non-linear separable data using kernel trick.
  2. It works well in high dimensional space (i.e. large number of predictors) 
  3. It works well in case of text or image classification.
  4. It does not suffer multicollinearity problem

Disadvantages of SVM
  1. It takes a lot of time on large sized data sets
  2. It does not directly return probability estimates.
  3. The linear kernel is almost similar to logistic regression in case of linear separable data.


Multi-Category Classes and SVM
Multi-category classes can be split into multiple one-versus-one or one-versus-rest binary classes.

Support Vector Machine - Regression

Yes, Support Vector Machine can also be used for regression problem wherein dependent or target variable is continuous.

The goal of SVM regression is same as classification problem i.e. to find maximum margin. Here, it means minimize error. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM. The primary goal is to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.

Tuning Parameters of SVM

If Linear Kernel SVM

There is only one parameter in linear kernel - Cost (C). It implies misclassification cost on training data.
  1. A large C gives you low bias and high variance. Low bias because you penalize the cost of misclassification a lot. Large C makes the cost of misclassification high, thus forcing the algorithm to explain the input data stricter and potentially overfit.
  2. A small C gives you higher bias and lower variance. Small C makes the cost of misclassification low, thus allowing more of them for the sake of wider "cushion"
The goal is to find the balance between "not too strict" and "not too loose". Cross-validation and resampling, along with grid search, are good ways to finding the best C.

If Non-Linear Kernel (Radial)

Two parameters for fine tuning in radial kernel - Cost and Gamma

The parameter cost is already explained above (See the 'Linear Kernel' section).

Gamma explains how far the influence of a single training example reaches. When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data.

R : Grid Search
cost=10^(-1:2), gamma=c(.5,1,2)
Polynomial kernel Tuning Parameters:
  1. degree (Polynomial Degree)
  2. scale (Scale)
  3. C (Cost)

Support Vector Regression Tuning Parameters:
  1. Epsilon (e)
  2. Cost (c)
epsilon = seq(0, 0.9, 0.1),  cost = 2^(2:8)

R Code : Support Vector Machine (SVM)

Read Data
library(caret)
data(segmentationData)

Data Exploration

#Number of rows and columns
dim(segmentationData)
[1] 2019   61

#Distribution of Target Variable
table(segmentationData$Class)
  PS   WS 
1300  719

table(segmentationData$Class) / length(segmentationData$Class)
       PS        WS 
0.6438831 0.3561169 


Split Data into Training and Validation
Index <- createDataPartition(segmentationData$Class,p=.7,list=FALSE)
svm.train <- segmentationData[Index,]
svm.validate  <- segmentationData[-Index,]
trainX <-svm.train[,4:61] 

Build SVM model in R

# Setup for cross validation
set.seed(123)
ctrl <- trainControl(method="cv",
                     number = 2,
                     summaryFunction=twoClassSummary,
                     classProbs=TRUE)

# Grid search to fine tune SVM
grid <- expand.grid(sigma = c(.01, .015, 0.2),
                    C = c(0.75, 0.9, 1, 1.1, 1.25)
)

#Train SVM
svm.tune <- train(x=trainX,
                  y= svm.train$Class,
                  method = "svmRadial",
                  metric="ROC",
                  tuneGrid = grid,
                  trControl=ctrl)

svm.tune

# Predict Target Label
valX <-svm.validate[,4:61]
pred <- predict(svm.tune, valX, type="prob")[2]

# Model Performance Statistics
library(ROCR)
pred_val <-prediction(pred[,2], svm.validate$Class)

# Calculating Area under Curve
perf_val <- performance(pred_val,"auc")
perf_val

# Calculating True Positive and False Positive Rate
perf_val <- performance(pred_val, "tpr", "fpr")

# Plot the ROC curve
plot(perf_val, col = "green", lwd = 1.5)

#Calculating KS statistics
ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))
ks

Model Performance Metrics -

1. AUC on validation =  0.9168
Area under Curve
2. KS on validation = 0.68

How to choose the right Kernel
There is no thumb rule of choosing the best kernel. The only solution is Cross-validation.  Try several different Kernels, and evaluate their performance metrics such as AUC and select the one with highest AUC. If you want to compare in terms of speed, linear kernels usually compute much faster than radial or polynomial kernels.

Calculating Driving Distance with SAS

$
0
0
This tutorial demonstrates how to calculate driving distance with SAS. Calculating travelling distance between two locations is an important step of Geo Spatial Analysis. Today geo analytics or network analysis is very much in demand. The main idea is to find the shortest path based on existing road networks, or calculate actual travel time or estimate the fair trip based on distance. For example, you hire a private cab (taxi) through Uber App. Prior of booking a cab, you can estimate the fair trip. Open source softwares such as R and Python supports various packages to solve this kind of problem. SAS also has a couple of in-built functions to calculate distance between two points.

Ways to measure traveling distance

There are multiple ways to compute geographical distance. Some of the methods are listed below -
  1. Driving Distance via Google Map
  2. Straight Line Distance such as Euclidean
  3. Manhattan (Taxi cab distance)
  4. Vincenty distance formula
  5. Haversine distance formula

Which method is more suitable
  1. Driving distance via Google map is most accurate when you need to compute the real driving distance which includes waiting time at traffic signals or real-time traffic in some places between two locations.
  2. Straight line distance (Euclidean) is more suitable when you need to calculate the distance between two trees. Or if you are traveling via plane as it ignores earth curvature.
  3. Manhattan Distance (Taxi Cab) is more suitable when you are travelling through cab/taxi. It is better measure when you need to determine the traveling distance between customer's location and office location.
  4. Vincenty or Haversine Distance are calculated based on latitude and longitude of the zip code or postcode. Vincenty is generally more accurate than Haversine method. Vincenty distance between two points is accurate to within 0.5 mm distance.


SAS : Driving Distance via Google Map

In SAS, there is no in-built function to connect to google map and extract driving distance. To workaround this problem, we need to connect to google map via weblink. Check out the link below - https://www.google.com/maps/dir/110051/110053?force=lite

In this link, 110051 is a starting point and 110053 is a destination point.

SAS Macro to extract driving distance and time
*Starting point;
%let p1=12303;

* Destination Points (postcodes);
data postcodes;
input pcode @@;
datalines;
12502 12507
;

proc datasets lib=work nolist;
delete driving_distance;
quit;

data _null_;
call symputx('ncodes',obs);
stop;
set postcodes nobs=obs;
run;

* Macro to connect to Google Map;
%macro driving_distance;
%do j=1 %to &ncodes;
data _null_;
nrec = &j;
set postcodes point=nrec;
call symputx('p2',put(pcode,z5.));
stop;
run;

%put &p1 &p2;
filename x url "https://www.google.com/maps/dir/&p1/&p2/?force=lite";
filename z temp;

data _null_;
infile x recfm=f lrecl=1 end=eof;
file z recfm=f lrecl=1;
input @1 x $char1.;
put @1 x $char1.;
if eof;
call symputx('filesize',_n_);
run;

* drive time as a numeric variable;
data temp;
retain pcode &p2;
infile z recfm=f lrecl=&filesize. eof=done;
input @ 'miles' +(-15) @ '"' distance :comma12. text $30.;
units    = scan(text,1,'"');
text     = scan(text,3,'"');

* convert times to seconds;
  select;
* combine days and hours;
   when (find(text,'d') ne 0)  time = sum(86400*input(scan(text,1,''),best.),
                                        3600*input(scan(text,3,''),best.));
* combine hours and minutes;
   when (find(text,'h') ne 0)  time = sum(3600*input(scan(text,1,''),best.),
                                        60*input(scan(text,3,''),best.));
* just minutes;
   otherwise                   time = 60*input(scan(text,1,''),best.);
  end;

output;
keep pcode distance units time;
stop;
done:
output;
run;

filename x clear;
filename z clear;

proc append base=driving_distance data=temp;
run;
%end;
%mend;

* use the macro;
%driving_distance;

proc print data=driving_distance;
format pcode z5. distance comma6. time time6.;
run;
The output is shown in the image below -
SAS Driving Distance and Time
Straight Line Distance / Euclidean Distance

Euclidean distance measure is one of the most popular method to calculate distance between two points. It is also used in cluster analysis to see the similarity between two observations.
Euclidean = √( x2 - x1)² + (y2 - y1)²
In the geographical points, x and y coordinates are easting and northing. Easting refers to the eastward measure distance (x coordinate) and Northing refers to the northward measure distance (y coordinate)

For zip code - 12502,  x and y coordinates are listed below -
x - 610435  y - 4660250
For zip code - 12303, x and y coordinates are as follows -
x - 589594  y - 4732566
data _null_;
x1 = 610435;
y1 = 4660250;
x2 = 589594;
y2 = 4732566;
euclidean_dist =  round(sqrt(( x2 - x1)**2 + (y2 - y1)**2)/1609,0.1);
put euclidean_dist =;
run;
Result - 46.8 miles

Since it returns the distance in metres, we need to divide it by 1609. 1609 metres is equal to 1 mile. If you want to measure distance in km, you need to divide it by 1000. The resulted value 46.8 is far below than actual distance of 61 miles. Since Euclidean distance ignores earth curvature, it's not a good approximation of actual driving distance.


Manhattan Distance

It is a good approximation of actual driving distance. It is also called taxi cab distance. It is useful when we need to compute the road distance between two geographical points.
Manhattan  = (abs(x2 - x1) + abs(y2 - y1)
data _null_;
x1 = 610435;
y1 = 4660250;
x2 = 589594;
y2 = 4732566;
Manhattan_dist =  round((abs(x2 - x1) + abs(y2 - y1))/1609,0.1);
put Manhattan_dist =;
run;
Result - 57.9 miles

Vincenty or Haversine Distance

SAS has two inbuilt functions - ZIPCITYDISTANCE and GEODIST which are based on Vincenty distance method to compute geographical distance.

The ZIPCITYDISTANCE function uses latitude and longitude from the SASHELP.ZIPCODE data set and computes the distance between zip code centroids for any two zip codes specified by the user.
proc print data = SASHELP.ZIPCODE (where = (ZIP IN(12502, 12303))) noobs;
var zip x y;
run;
LAT / LON
In this case, Y refers to latitude and X refers to longitude. In ZIPCITYDISTANCE function, you just need to specify zip codes.
data _null_;
zip_dist = zipcitydistance(12502, 12303);
put zip_dist =;
run;
Result : 48.5 miles

Note : The ZIPCITYDISTANCE function works only for US zipcodes.

The GEODIST functions works for all the countries. In the function, you need to specify latitude and longitude of zip codes.
data _null_;
long1 = -73.679520;
lat1 = 42.107220;
long2 = -73.938595;
lat2 = 42.783122;
zip_dist = geodist(lat1, long1, lat2, long2, 'M');
put zip_dist =;
run;
It returns 48.5 miles (same as ZIPCITYDISTANCE) as both the functions are based on the same algorithm i.e. vincenty distance.

Haversine Distance
data _null_;
long1 = -73.679520;
lat1 = 42.107220;
long2 = -73.938595;
lat2 = 42.783122;
conversion = constant('pi')/180;
lat1  = lat1  * conversion;
long1 = long1 * conversion;
lat2  = lat2  * conversion;
long2 = long2 * conversion;
zip_dist = 3949.99 * arcos(sin(lat1) * sin(lat2) + cos(lat1) *
               cos(lat2) * cos(long2 - long1));

put zip_dist =;
run; 
Result : 48.4 miles

Comparison of Distance Methods

Google MapEuclideanManhattanVincentyHaversine
6146.857.948.548.4
Calculating driving distance via Google Map is the most precise method. But it is not efficient in terms of speed as it takes time to connect and pull values from map. Sometimes windows firewall blocks program to connect to google map. In that case, we should use manhattan distance as it stands out as a best estimation of actual driving distance.

SAS : INTNX Function with Examples

$
0
0
This tutorial explains how SAS INTNX function works. It includes explanation of INTNX function with practical examples which would help you to understand it.

SAS INTNX : Introduction

SAS function INTNX is used to increment SAS date by a specified number of intervals. It helps to answer the following questions.

Examples 
  1. When is next Monday?
  2. When was last Friday?
  3. What would be date after 21 weeks?
  4. Subtract 2 quarters from the current date

SAS INTNX Syntax

The first three parameters of the INTNX function is mandatory and the fourth one is optional.
INTNX(interval, start-from, increment,  [alignment])
  1. Interval is the unit of measurement. The intervals can be days, weeks, months, quarters, years.
  2. Start-from is a SAS date value which would be incremented.
  3. Increment is number of intervals by which date is incremented. It can be zero, positive or negative. Negative value refers to previous dates.
  4. Alignment [Optional Parameter] is where datevalue is aligned within interval prior to being incremented. The values you can specify - 'beginning', 'middle', 'end', 'sameday'. Default value - 'beginning'.

INTNX : Examples

1. Add 7 days to a specific date

In the following code, we are adding seven days to 02 January 2017.
data temp;
mydate = '02JAN2017'd;
day=intnx('day', mydate , 7);
format mydate day date9.;
run;

Result : day = 09JAN2017 
SAS INTNX
If you are wondering how INTNX is different to 'simply adding 7 to mydate variable' like code below. You would get answer to this question in the next example.
day = mydate + 7;

2. Find Next Sunday

In this case, we need to find answer of the question 'when is next sunday?'. The 02January,2017 is Monday.
data temp;
mydate = '02JAN2017'd;
nextsunday=intnx('week', mydate , 1);
format mydate nextsunday date9.;
run;
Result : nextsunday = 08JAN2017

It returns 08JAN2017 as it aligns to the 'beginning' period. The 'beginning' alignment is default in INTNX function. In other words, if you change the mydate to '04JAN2017'd, it still returns '08JAN2017' as the next sunday would be same within this week interval. 

If you want to add exactly 1 week to the date, you can use the 'sameday' in the fourth parameter of this function. See the statement below -
nextsunday=intnx('week', mydate , 1, 'sameday'); returns 09JAN2017

3. Get First Date

Suppose you need to find out the first day of a specific day. For example, today is 09January, 2017 and the first day of this date is 01January,2017.
data temp;
set sashelp.citiday;
firstday=intnx('month', date , 0);
format firstday date9.;
proc print data = temp;
var date firstday;
run;
SAS : Get First Day
By specifying 0 in the third parameter of INTNX function, we can calculate the first day of the dates.

4. When was Last Tuesday?

It is tricky to figure out the date when it was last tuesday. 13January,2017 is Friday. In real world dataset, we don't have the exact days of a list of dates when we need to code to get the next tuesday.
data temp;
mydate = '13JAN2017'd;
lasttuesday = intnx('week.3', mydate , 0);
format mydate lasttuesday date9.;
proc print;
run;
It returns 10JAN2017. In this case, week.3 refers to tuesday within week as a unit of measurement. Similarly, week.2 refers to monday.

5. Adjustment within the Interval

This program explains how INTCK function adjusts / align dates within the interval specified.
data temp;
mydate = '31JAN2017'd;
beginning=intnx('year ', mydate , 1, 'b');
middle=intnx('year ', mydate , 1, 'm');
end=intnx('year ', mydate , 1, 'e');
sameday=intnx('year ', mydate , 1, 's');
format mydate beginning middle end sameday date9.;
proc print;
run;
The abbreviation 'b' refers to beginning, 'm' - middle,  'e' - end, 's' - sameday. The default value is 'b' if you don't specify anything in the fourth parameter.

Result
  1. beginning = 01JAN2018
  2. middle = 02JUL2018
  3. end = 31DEC2018
  4. sameday = 31JAN2018
SAS INTNX Alignment

6. Datetime Formats

Like date formats, we can use time and datetime formats in INTNX function to increment time (seconds / minutes / hours).
data temp;
mydt = '29JAN2017:08:34:00'dt;
seconds=intnx('second', mydt , 1);
minutes=intnx('minute', mydt , 1);
hours=intnx('hour', mydt , 1);
days=intnx('dtDay', mydt , 1);
weeks=intnx('dtWeek', mydt , 1);
format mydt seconds minutes hours days weeks datetime20.;
proc print NOOBS;
run;
SAS Datetime Formats


Understanding Bias-Variance Tradeoff

$
0
0
This tutorial explains the concept of bias-variance tradeoff in machine learning. It is one of the most commonly confused term in predictive modeling world.

Bias

It refers to model fitting the training data poorly but able to produce similar result in data outside training data. It is related to underfitting. In simple/layman's term, we are building simple models that predicts terribly far from the reality but they don't change much from dataset to dataset.

For example, a linear regression model would have high bias when trying to model a non-linear relationship. It is because linear regression model does not fit non-linear relationship well.

  1. High bias means linear regression applied to quadratic relationship.
  2. Low bias means second degree polynomial applied to quadratic data.


High Bias Techniques
Linear Regression, Linear Discriminant Analysis and Logistic Regression
Low Bias Techniques
Decision Trees,  K-nearest neighbours and Gradient Boosting

We can also interpret like this -
Parametric algorithms which assume something about the distribution of the data points suffer from High Bias. Whereas non-parametric algorithms which does not assume anything special about distribution have low bias.

Variance

We are building complex models that fits well on training data but they cannot generalise the pattern well which results to overfitting. It means they don't fit well on data outside training (i.e. validation / test datasets). In simple terms, it means they might predict close to reality on average, but they tend to change much more with small changes in the input.

An algorithm like Decision Tree has low bias but high variance, because it can easily change as small change in input variable. In general, it does not generalize the pattern well. It leads to overfitting.

Low Variance Techniques
Linear Regression, Linear Discriminant Analysis, Random Forest, Logistic Regression
High Variance Techniques
Decision Trees,  K-nearest neighbours and Support Vector Machine (SVM)

Bias Variance Trade-off
It means there is a trade-off between predictive accuracy and generalization of pattern outside training data. Increasing the accuracy of the model will lead to less generalization of pattern outside training data. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.
Bias Variance Tradeoff
In the image above, the red ball is the target. Any hit close to it is considered as low bias data points. If each subsequent hit is close to the previous hit is considered as low variance cases.


How to correct Bias-Variance Error

1. Try smaller number of predictors (only important ones) when you have high variance

2. Try larger number of predictors or transform predictors when you have high bias

3. Get more training data when you have high variance


Practical Approaches

1. In support vector machine (SVM), cost (c) parameter decides bias-variance. A large C gives you low bias and high variance. Low bias because you penalize the cost of misclassification a lot. Large C makes the cost of misclassification high, thus forcing the algorithm to explain the input data stricter and potentially overfit. A small C gives you higher bias and lower variance. Small C makes the cost of misclassification low, thus allowing more of them for the sake of wider "cushion"

2. In k-nearest neighbors algorithm, trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model and low variance.

3. In decision trees, pruning of tree is a method to reduce variance. It reduces the size of decision trees by removing sections of the tree that provide little power to classify instances.

Training and Cross-Validation Error
  1. High Variance - High difference between cross-validation error and the training set error. To overcome it, use more training data. If it is not possible to include more data, include only important predictors or independent variables.
  2. High Bias - Low difference between cross-validation and training error. Try more predictors or transform existing predictors.

Import Data in Python

$
0
0
This tutorial explains various methods to read data into Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Rdata (R) etc.
Import Data into Python
While importing external files, we need to check the following points -
  1. Check whether header row exists or not
  2. Treatment of special values as missing values
  3. Consistent data type in a variable (column)
  4. Date Type variable in consistent date format.
  5. No truncation of rows while reading external data

Install and Load pandas Package

pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.

If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -
import pandas as pd
If pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.
!pip install pandas
If you are using Anaconda, you can try the following line of code to install pandas -
!conda install pandas
1. Import CSV files

It is important to note that a single backslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below
import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
If no header (title) in raw data file
mydata1  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)
You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.
mydata2  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])
The variable names can also be added separately by using the following command.
mydata1.columns = ['ID', 'first_name', 'salary']


2. Import File from URL

You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).
mydata  = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File 

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.
mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata  = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

4. Read Excel File

The read_excel() function can be used to import excel data into Python.
mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)
If you do not specify name of sheet in sheetname= option, it would take by default first sheet.

5. Read delimited file

Suppose you need to import a file that is separated with white spaces.
mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)
To include variable names, use the names= option like below -
mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])
6. Read SAS File

We can import SAS data file by using read_sas() function.
mydata4 = pd.read_sas('cars.sas7bdat')
7. Read SQL Table

We can extract table from SQL database (Teradata / SQL Server). See the program below -
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
query = "SELECT * FROM flight;"
results = pd.read_sql(query, con=conn)
print results.head()

8. Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.
mydata7  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))
nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

9. Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)
mydata8  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
10. Specify values as missing values

By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.
mydata9  = pd.read_csv("workingfile.csv", na_values=['.'])

Detecting Interaction in Regression Model

$
0
0
This tutorial talks about the easy and effective method to detect interaction in a regression model.

What is Interaction

Interaction is defined as a combinations of variables. If the dependent variable is Y and there is an interaction between two predictors X1 and X2, it means that the relationship between X1 and Y differs depending on the value of X2.

Example -

Suppose you need to predict an employee attrition - whether an employee will leave the organisation or not (Binary - 1 / 0). Employee Attrition is dependent on various factors such as Tenure within the organization, educational qualification, last year rating (, type of job, skill type etc.

Let's build a simple predictive employee attrition model - 

For demonstration, take only two independent variables - Tenure within the organization (Tenure) and Last Year Rating (Rating. Two categories - Average / Above Average). Target Variable - Attrition (1/0). The logistic regression equation looks like below -
logit(p) = Intercept + B1*(Tenure) + B2*(Rating)

Adding Interaction of Tenure and Rating

Adding interaction indicates that the effect of Tenure on the attrition is different at different values of the last year rating variable. The revised logistic regression equation will look like this:

logit(p) = Intercept + B1*(Tenure) + B2*(Rating) + B3*Tenure*Rating

Run Logistic Regression without Interaction

In SAS, you can run logistic regression with PROC LOGISTIC.
proc logistic data = mydata;
class Rating;
model Attrition = Tenure Rating;
run;
Model Statistics
The c-statistics is 0.905. It is also called Area under Curve (AUC). It is an important metrics which helps to compare models.

Run Logistic Regression with Interaction
proc logistic data = mydata;
class Rating;
model Attrition = Tenure | Rating @2 / selection = stepwise slentry=0.15 slstay=0.20;
run;
To include all possible interactions, you can use '|' in the MODEL statement of PROC LOGISTIC. The @n specifies the number of predictors that can be involved in an interaction. For example, '@2' refers to 2-way interactions. @3 refers to3-way interactions. In this code, the two way interactions refers to main effects - Tenure, Rating and Interaction - Tenure * Rating

In the code, we are performing stepwise logistic regression which considers 0.15 significance level for adding a variable and 0.2 significance level for deleting a variable.

Model Statistics - Model II
AUC score has increased from 0.905 to 0.926. It means it's worth adding interaction in the predictive model.

Important Points to Consider
  1. Make sure you check both training and validation scores when adding interactions. It is because adding interaction may overfit the model.
  2. Check AUC and Lift in top deciles while comparing models.
  3. Make sure no break in rank ordering when interactions are included.
  4. Adding transformed variables with Interactions make model more robust.
  5. You can add more than 2-way interactions but that would be memory intensive.

SAS : Calculate AUC of Validation Data

$
0
0
This tutorial explains how to calculate area under curve (AUC) of validation sample. The AUC of validation sample is calculated by applying coefficients (estimates) derived from training sample to validation sample. This process is called Scoring. The detailed explanation is listed below -

Steps of calculating AUC of validation data

1. Split data into two parts - 70% Training and 30% Validation. It can be 60/40 or 80/20.

2. Run logistic regression model on training sample.

3. Note coefficients (estimates) of significant variables coming in the model run in Step 2.

4. Apply the following equation to calculate predictive probability in the validation sample

Logistic Regression Equation
In this case, b0 is intercept and b1...bk - coefficients derived from training sample (Step2)

5. Calculate Area under Curve (AUC) considering probability scores derived in Step 4.


Method I : PROC LOGISTIC to calculate AUC of Validation
Proc Logistic Data = trainingoutest=coeff descending;
class rank / param = ref;
Model admit = gre gpa rank / selection = stepwise slstay=0.15 slentry=0.15 stb;
Run;
Proc Logistic Data = validationinest=coeff descending;
class rank / param = ref;
Model admit = gpa rank / MAXITER=0;
Run;
The OUTEST= option in the PROC LOGISTIC stores final estimates in the SAS dataset. In this case, it is stored on the dataset named COEFF. We have run stepwise regression which drops an insignificant variable named GRE.

The INEST= option in the PROC LOGISTIC uses the final parameter estimates calculated from training dataset.

The MAXITER= option in the MODEL statement specifies the maximum number of iterations to perform. The combination of DATA=validation data, INEST=final estimates from training data, and MAXITER=0 causes PROC LOGISTIC to score, not refit, the validation data.

Important Points
  1. Use only significant variables from the training data in MODEL Statement of PROC LOGISTIC DATA = VALIDATION. In the above example, GRE variable was not included.
  2. Parameter Estimates (Coefficients) would remain same produced by both PROC LOGISTIC programs as we are scoring in second PROC LOGISTIC program, not building the model.
  3. Ignore warning 'Convergence was not attained in 0 iteration' when running second PROC LOGISTIC statement.
AUC - Validation

The c-statistics is AUC (Area under Curve). In this case, it is 0.607.

Method II : PROC NPAR1WAY

In this method, we are using Wilcoxon method to calculate AUC of validation sample. First, we are scoring using SCORE statement with VALIDATION sample.
Proc Logistic Data = training descending;
class rank / param = ref;
Model admit = gre gpa rank / selection = stepwise slstay=0.15 slentry=0.15 stb;
score data=validation out = valpred;
Run;

ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data= valpred ;
where admit^=.;
class admit;
var  p_1;
run;
ods select all;

data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;    AUC=d+0.5;   
put AUC=  GINI=;
keep AUC Gini;
output;
end;
run;
The above program returns AUC score of 0.6062.

Extract last 4 values of variable in SAS

$
0
0
This tutorial explains how to extract last n characters or numbers of a variable in SAS. In this tutorial, we will cover several cases in which we pull last 4 character or numeric values from a column. In MS Excel, it is easily possible with RIGHT() function but there is no-inbuilt function to do it in SAS. The RIGHT() function of SAS is used for something else i.e. it right aligns string or character value.

Example 1

Suppose you have a product ID in which last 4 characters refers to a product category so you are asked to pull product category information.
data example;
productID = "AB123ZX253";
run;

data example;
set example;
referenceid = substr(productID,length(productID)-3,4);
run;

The output is shown in the image below -
Get last N Characters
Explanation

The SUBSTR() function returns sub-string from a character variable.
= SUBSTR(character-variable, beginning-position, number-of-characters-to-pull)
The LENGTH() function returns the length of a character variable. In this case, it is 10 characters long. The calculated SUBSTR() function would work like below -
= SUBSTR(productID, 10-3, 4)
= SUBSTR(productID, 7, 4)

Example 2 : Handle missing while extracting

Suppose you have multiple product IDs in which some of them are missing.
data example2;
input productID $10.;
cards;
AB123ZX253

AB123ZX254
AB123ZX255
;
run;
When you apply SUBSTR() function in case of missing cases, it returns a note in log 'Invalid second argument'. See the image below

SUBSTR : Missing Cases
To workaround this problem, we can use SUBSTRN() which handles missing cases while extracting.
data example2;
set example2;
referenceid = substrn(productID,length(productID)-3,4);
run;
Example 3 : Get last N digits from a numeric variable

Suppose product ID variable is numeric.
data example3;
input productID;
cards;
12345253

12354234
12354235
;
run;
The SUBSTR() / SUBSTRN() function works only for character variables. In order to make it run for numeric variables, we first need to convert numeric variables to character variables with PUT() function.
data example3;
set example3;
x = put(productID, 10.);
referenceid = input(substrn(x,length(x)-3,4),8.);
drop x;
proc print noobs;
run;
Since SUBSTRN() returns character variable, we need to convert it to numeric with INPUT() function.

Another Method -
data example3;
set example3;
referenceid = input(substrn(productID,INT(LOG10(productID))+1-3,4),8.);
proc print noobs;
run;
How it works -

Step 1 : LOG10(1021) returns 3.009
Step 2 : INT(LOG10(1021)) makes it 3
Step 3 : Adding 1 to step2 makes it 4, which is result of number of digits in 1021.

Predictive Modeling Interview Questions and Answers

$
0
0
Predictive modeling knowledge is one of the most sought-after skill today. It is in demand these days. It is being used in almost every domain ranging from finance, retail to manufacturing. It is being looked as a method of solving complex business problems. It helps to grow businesses e.g. predictive acquisition model, optimization engine to solve network problem etc.

It is not easy to get into these roles as it requires technical understanding of various statistical techniques and machine learning algorithms with tools like SAS/R/Python. Hence, it is important to prepare well before going for interview. To help you in interview preparation, I’ve jot down most frequently asked interview questions on logistic regression, linear regression and predictive modeling concepts. In general, an analytics interview process includes multiple rounds of discussion. Possible rounds are as follows -
  1. Technical Round on Statistical Techniques and Machine Learning Concepts
  2. Technical Round on Programming Languages such as SAS/R/Python/SQL
  3. Managerial Round on Business/Domain Knowledge
During these multiple rounds, they also check your communication skill and logical/ problem solving skill.
Predictive Modeling Interview Questions

Let's start with a list of some basic and tricky predictive modeling interview questions with answers.

1. What are the essential steps in a predictive modeling project?

It consists of the following steps -
  1. Establish business objective of a predictive model
  2. Pull Historical Data - Internal and External
  3. Select Observation and Performance Window
  4. Create newly derived variables
  5. Split Data into Training, Validation and Test Samples
  6. Clean Data - Treatment of Missing Values and Outliers
  7. Variable Reduction / Selection
  8. Variable Transformation
  9. Develop Model
  10. Validate Model
  11. Check Model Performance
  12. Deploy Model
  13. Monitor Model

2. What are the applications of predictive modeling?

Predictive modeling is mostly used in the following areas -
  1. Acquisition - Cross Sell / Up Sell
  2. Retention - Predictive Attrition Model
  3. Customer Lifetime Value Model
  4. Next Best Offer
  5. Market Mix Model
  6. Pricing Model
  7. Campaign Response Model
  8. Probability of Customers defaulting on loan
  9. Segment customers based on their homogenous attributes
  10. Demand Forecasting
  11. Usage Simulation
  12. Underwriting
  13. Optimization - Optimize Network

3. Explain the problem statement of your project. What are the financial impacts of it?

Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).


4. Define observation and performance window?
Tutorial : Observation and Performance Window


5. Difference between Linear and Logistic Regression?

Two main difference are as follows -
  1.  Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
  2. Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)
Please note there are more than 10 difference between these two techniques, refer the link below -

6. How to handle missing values?

We fill/impute missing values using the following methods. Or make missing values as a separate category.
  1. Mean Imputation for Continuous Variables (No Outlier)
  2. Median Imputation for Continuous Variables (If Outlier)
  3. Cluster Imputation for Continuous Variables
  4. Impute Continuous Variables with Zero (Require business knowledge)
  5. Conditional Mean Imputation for Continuous Variables
  6. Other Imputation Methods for Continuous  - Predictive mean matching, Bayesian linear regression, Linear regression ignoring model error etc.
  7. WOE for missing values in categorical variables
  8. Decision Tree, Random Forest, Logistic Regression for Categorical Variables
  9. Decision Tree, Random Forest works for both Continuous and Categorical Variables
  10. Multiple Imputation Method

7. How to treat outliers?

There are several methods to treat outliers -
  1. Percentile Capping
  2. Box-Plot Method
  3. Mean plus minus 3 Standard Deviation
  4. Weight of Evidence


8. Explain Dimensionality / Variable Reduction Techniques

Unsupervised Method (No Dependent Variable)
  1. Principal Component Analysis (PCA)
  2. Hierarchical Variable Clustering (Proc Varclus in SAS)
  3. Variance Inflation Factor (VIF)
  4. Remove zero and near-zero variance predictors
  5. Mean absolute correlation. Removes the variable with the largest mean absolute correlation. See the detailed explanation of mean absolute correlation
Supervised Method (In respect to Dependent Variable)

For Binary / Categorical Dependent Variable
  1. Information Value
  2. Wald Chi-Square
  3. Random Forest Variable Importance
  4. Gradient Boosting Variable Importance
  5. Forward/Backward/Stepwise - Variable Significance (p-value)
  6. AIC / BIC score
For Continuous Dependent Variable
  1. Adjusted R-Square
  2. Mallows' Cp Statistic
  3. Random Forest Variable Importance
  4. AIC / BIC score
  5. Forward / Backward / Stepwise - Variable Significance

9. Explain equation of logistic regression model

10. What is multicollinearity and how to deal it?

Multicollinearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity.

It can be handled by iterative process : first step - remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5


11. How VIF is calculated and interpretation of it?

VIF measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

Steps of calculating VIF
  1. Run linear regression in which one of the independent variable is considered as target variable and all the other independent variables considered as independent variables
  2. Calculate VIF of the variable. VIF = 1/(1-RSquared)

12. Do we remove intercepts while calculating VIF?

No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case one can get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.

13. What is p-value and how it is used for variable selection?

The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent variables, it implies whether coefficient of a variable is significantly different from zero,

14. How AUC, Concordance and Discordance are calculated?


15. Explain important model performance statistics
  1. AUC > 0.7. No significant difference between AUC score of training vs validation.
  2. KS should be in top 3 deciles and it should be more than 30
  3. Rank Ordering. No break in rank ordering.
  4. Same signs of parameter estimates in both training and validation

16. Explain Gain and Lift Charts
Check out this tutorial : Understanding Gain and Lift Charts

17. Explain collinearity between continuous and categorical variables. Is VIF a correct method to compute collinearity in this case?

Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of cases.

VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.
We can also safely ignore collinearity between dummy variables. To avoid high VIFs in this case, just choose a reference category with a larger fraction of the cases

18. Assumptions of Linear Regression Model
Linear Regression Explained

19. How WOE and Information Value are calculated?
WOE and Information Value Explained

20. Difference between Factor Analysis and PCA?

The main 3 difference between these two techniques are as follows -
  1. In Principal Components Analysis, the components are calculated as linear combinations of the original variables. In Factor Analysis, the original variables are defined as linear combinations of the factors.
  2. Principal Components Analysis is used as a variable reduction technique whereas Factor Analysis is used to understand what constructs underlie the data.
  3. In Principal Components Analysis, the goal is to explain as much of the total variance in the variables as possible. The goal in Factor Analysis is to explain the covariances or correlations between the variables.

Technical Interview Questions on SAS and R

The following is a list of SAS/R technical interview questions that are generally asked. It includes some tricky questions which requires hands-on experience.

SAS
  1. Difference between INPUT and PUT Functions
  2. How to generate serial numbers with SAS
  3. Difference between WHERE and IF statements
  4. Difference between '+' operator and SUM Function
  5. Use of COALESCE Function
  6. Difference between FLOOR and CEIL functions
  7. How to use arrays to recode all the numeric variables
  8. Number of ways you can create macro variables
  9. Difference between MERGE and SQL Joins
  10. How to calculate cumulative sum in SAS
You would find answers of the above questions in the links below -

R
  1. Difference between sort() and order() functions
  2. Popular R packages for decision tree
  3. How to transpose data in R
  4. How to remove duplicates in R
  5. Popular packages to handle big data
  6. How to perform LEFT join in R
  7. How R handles missing values
  8. How to join vertically two data frames
  9. Use of with() and by() functions
  10. Use of which() function
Check out the link below for solutions of the above questions plus other interview questions on R

SQL and Excel

Prior to interview, you can also look at questions on SQL concepts and Advanced Excel. SQL and Excel are still the most widely used tools for basic and intermediate analytics.

R : If Else and Nested If Else

$
0
0
If Else and Nested If Else in R

The If Else statements are important part of R programming. In this tutorial, we will see various ways to apply conditional statements (If..Else If..Else) in R. In R, there are a lot of powerful packages for data manipulation. In the later part of this tutorial, we will see how IF ELSE statements are used in popular packages.

Sample Data

Let's create a sample data to show how to perform IF ELSE function. This data frame would be used further in examples.
x1x2x3
1129A
3178B
5140C
7186D
9191E
11104F
13150G
15183H
17151I
19142J

Run the program below to generate the above table in R.
set.seed(123)
mydata = data.frame(x1 = seq(1,20,by=2),
                    x2 = sample(100:200,10,FALSE),
                    x3 = LETTERS[1:10])
x1 = seq(1,20,by=2) : The variable 'x1' contains alternate numbers starting from 1 to 20. In total, these are 10 numeric values.

x2 = sample(100:200,10,FALSE) : The variable 'x2' constitutes 10 non-repeating random numbers ranging between 100 and 200.

x3 = LETTERS[1:10] : The variable 'x3' contains 10 alphabets starting from A to Z.

Syntax of ifelse() function :

The ifelse() function in R works similar to MS Excel IF function. See the syntax below -
ifelse(condition, value if condition is true, value if condition is false)
Example 1 : Simple IF ELSE Statement

Suppose you are asked to create a binary variable - 1 or 0 based on the variable 'x2'. If value of a variable 'x2' is greater than 150, assign 1 else 0.
mydata$x4 = ifelse(mydata$x2>150,1,0)
In this case, it creates a variable x4 on the same data frame 'mydata'. The output is shown in the image below -
ifelse : Output
Create variable in a new data frame

Suppose you need to add the above created binary variable in a new data frame. You can do it by using the code below -
x = ifelse(mydata$x2>150,1,0)
newdata = cbind(x,mydata)
The cbind() is used to combine two vectors, matrices or data frames by columns.

Apply ifelse() on Character Variables

If variable 'x3' contains character values - 'A', 'D', the variable 'x1' should be multiplied by 2. Otherwise it should be multiplied by 3.
mydata$y = ifelse(mydata$x3 %in% c("A","D") ,mydata$x1*2,mydata$x1*3)
The output is shown in the table below

x1x2x3y
1129A2
3178B9
5140C15
7186D14
9191E27
11104F33
13150G39
15183H45
17151I51
19142J57

Example 2 : Nested If ELSE Statement in R

Multiple If Else statements can be written similarly to excel's If function. In this case, we are telling R to multiply variable x1 by 2 if variable x3 contains values 'A''B'. If values are 'C''D', multiply it by 3. Else multiply it by 4.
mydata$y = ifelse(mydata$x3 %in% c("A","B") ,mydata$x1*2,
                  ifelse(mydata$x3 %in% c("C","D"), mydata$x1*3,
                         mydata$x1*4))
Do you hate specifying data frame multiple times with each variable?

You can use with() function to avoid mentioning data frame each time. It makes writing R code faster.
mydata$y = with(mydata, ifelse(x3 %in% c("A","B") , x1*2,
                  ifelse(x3 %in% c("C","D"), x1*3, x1*4)))

Special Topics related to IF ELSE

In this section, we will cover the following topics -

  1. How to treat missing (NA) values in IF ELSE.
  2. How to use OR and AND operators in IF ELSE
  3. Aggregate or Summary Functions and IF ELSE Statement

Handle Missing Values

Incorrect Method
x = NA
ifelse(x==NA,1,0)
Result : NA
It should have returned 1.

Correct Method 
x = NA
ifelse(is.na(x),1,0)
Result : 1
The is.na() function tests whether a value is NA or not.

Use OR and AND Operators

The & symbol is used to perform AND conditions
ifelse(mydata$x1<10 & mydata$x2>150,1,0)
Result : 0 1 0 1 1 0 0 0 0 0

The | symbol is used to perform OR conditions
ifelse(mydata$x1<10 | mydata$x2>150,1,0)
Result : 1 1 1 1 1 0 0 1 1 0

Count cases where condition meets

In this example, we can counting the number of records where the condition meets.
sum(ifelse(mydata$x1<10 | mydata$x2>150,1,0))
Result : 7

If Else Statement : Another Style

There is one more way to define if..else statement in R. This style of writing If Else is mostly used when we use conditional statements in loop and R functions. In other words, it is used when we need to perform various actions based on a condition.

Syntax -
if(condition) yes else no
k = 99
if(k > 100) 1 else 0
Result : 0

If..Else If..Else Statements
k = 100
if(k > 100){
  print("Greater than 100")
} else if (k < 100){
  print("Less than 100")
} else {
  print ("Equal to 100")
}
Result : "Equal to 100"

If Else in Popular Packages

1. dplyr package

if_else(condition, value if condition is true, value if condition is false, value if NA)

The following program checks whether a value is a multiple of 2
library(dplyr)
x=c(1,NA,2,3)
if_else(x%%2==0, "Multiple of 2", "Not a multiple of 2", "Missing")
Result :"Not a multiple of 2""Missing""Multiple of 2""Not a multiple of 2"

The %% symbol returns remainder after a value is divided by divisor. In this case, first element 1 is divided by 2.

2. sqldf package

We can write SQL query in R using sqldf package. In SQL, If Else statement is defined in CASE WHEN.
df=data.frame(k=c(2,NA,3,4,5))
library(sqldf)
sqldf(
  "SELECT *,
  CASE WHEN (k%2)=0  THEN 'Multiple of 2'
  WHEN  k is NULL  THEN 'Missing'
  ELSE 'Not a multiple of 2'
  END AS T
  FROM df"
)
Output

kT
2Multiple of 2
NAMissing
3Not a multiple of 2
4Multiple of 2
5Not a multiple of 2 

Partial and Semipartial Correlation

$
0
0
Partial correlation measures linear relationship between two variables, while controlling the effect of one or more variable. In this tutorial, we will see the difference between partial and semipartial correlation and how these statistical metrics are calculated mathematically. Also we will cover how it is used in regression analysis.

What is Partial Correlation?

Partial correlation explains the correlation between two continuous variables (let's say X1 and X2) holding X3 constant for both X1 and X2.

Partial Correlation Mathematical Formula

In this case, r12.3 is the correlation between variables x1 and x2 keeping x3 constant. r₁3 is the correlation between variables x1 and x3.
Partial Correlation

Let's take an example -

Suppose we want to see the relationship between sales and number of high performing employees keeping promotion budget constant. In this case, sales is the variable1 and high performing employees is the variable 2 and promotion budget the variable3.
Formula : Compute Partial Correlation
Examples

  1. Relationship between demand of coffee and tea keeping prices of tea controlled.
  2. Relationship between GMAT score and number of hours studied keeping SAT score constant.
  3. Relationship between weight and number of meals intake while controlling age
  4. Relationship between bank deposits and interest rate keeping household rate constant.


What is Semipartial Correlation

Semipartial correlation measures the strength of linear relationship between variables X1 and X2 holding X3 constant for just X1 or just X2. It is also called part correlation.

Semipartial Correlation
In the above image,  r1(2.3) means the semipartial correlation between variables X1 and X2 where X3 is constant for X2.
Semi Partial Correlation Score

Difference between Partial and Semipartial Correlation
Partial correlation holds variable X3 constant for both the other two variables. Whereas, Semipartial correlation holds variable X3 for only one variable (either X1 or X2). Hence, it is called 'semi'partial.
Assumptions : Partial and Semipartial Correlation

  1. Variables should be continuous in nature. For example, weight, GMAT score, sales etc
  2. There should be linear relationship between all the three variables. If a variable has non-linear relationship, transform it or ignore the variable.
  3. There should be no extreme values (i.e outliers). If outliers are present, we need to treat them either by percentile capping or remove the outlier observations
  4. Variables you want to hold constant can be one or more than one


SAS Code : Partial Correlation Coefficient

In this example, we are checking association between height and weight keeping age constant.
PROC CORR data=sashelp.class;
 Var Height;
 With weight;
 Partial age;
 Run;

Partial Correlation Coefficient

The partial correlation coefficient between weight and height is 0.70467 holding age constant. The p-value for the coefficient is 0.0011. It means we can reject the null hypothesis and concludes that coefficient is significantly different from zero.

R Script : Partial Correlation Coefficient
# Load Library
library(ppcor)
# Read data
mydata=read.csv("C:\\Users\\Deepanshu\\Documents\\Example1.csv")
# Partial correlation between "height" and "weight" given "age"
with(mydata, pcor.test(Height,Weight,Age))


R Script : Semipartial Correlation Coefficient

Semi partial correlation - Age constant for Weight only
with(mydata, spcor.test(Height,Weight,Age))
Output
estimate    p.value statistic  n gp  Method
0.4118409 0.08947395  1.807795 19  1 pearson

The estimate value is Pearson Semipartial correlation coefficient.

Semi partial correlation coefficient - Age constant for Height only
with(mydata, spcor.test(Weight,Height,Age))
   estimate    p.value statistic  n gp  Method

1 0.4732797 0.04727912  2.149044 19  1 pearson

Squared Partial and Semipartial Correlation

In regression, squared partial and squared semipartial correlation coefficients are used.

Squared partial correlation tells us how much of the variance in dependent variable (Y) that is not explained by variable X2 but explained by X1. In other words, it is the proportion of the variation in dependent variable that was left unexplained by other predictors / independent variables but has been explained by independent variable X1.
Squared Partial and Semipartial Correlation

Here, R²y.12 is the r-squared from the regression model in which X1 and X2 are independent variables.

Squared Semi-partial correlation tells us how much of the unique contribution of an independent variable to the total variation in dependent variable. In other words, it explains increment in R-square when an independent variable is added.

Squared Partial correlation will always be greater than or equal to squared semi-partial correlation.
Squared Partial Correlation >= Squared Semi-partial Correlation

SAS Code  : Squared Partial and Semi-Partial Correlation

In PROC REG, the PCORR2 option tells SAS to produce squared-partial correlation and SCORR2 option tells SAS to produce squared semi-partial correlation. The STB option is used to generate standardized estimate and TOL is used to calculate tolerance.
Proc Reg data= Readin;
Model Overall = VAR1 - VAR5 / SCORR2 PCORR2 STB TOL ;
run;
Regression Output

The squared semi-partial correlation between Overall and VAR1 tells us model R-square is added by 0.18325 if  VAR1 is included in the model.

The squared partial correlation between Overall and VAR1 tells us the proportion of variance in Overall that is not explained by the other independent variables, 43% is explained by VAR1.

Which indicates variable importance?

Squared Semipartial correlation indicates variable importance because it measures incremental value in R-Square. We can rank variables based on high to low values of squared semipartial correlation score.

Relationship between Squared Semipartial correlation and Standardized Estimate
Squared Semipartial Correlation = (Standardized Estimate)² * Tolerance

Can individual squared semi-partial correlation add to R-squared?

Answer is NO. It is because the total variation in dependent variable also constitutes a portion that is due to within correlation between two independent variables.

Facebook Data Mining using R

$
0
0
In this tutorial, we will see how to extract and analyze facebook data using R. Facebook has crossed more than 1 billion active users.  Facebook has gathered the most extensive data set ever about behavior of human. In R, we can extract data from Facebook and later analyze it. Social media mining is one of the most interesting piece in data science. You can analyze sentiments of an important event by pulling information about the event from Facebook and get insights from data in R.
Extract Facebook Data using R
Step by Step Guide : Extract Data from Facebook

Step I : Facebook Developer Registration

Go to https://developers.facebook.com and register yourself by clicking on Get Started button at the top right of page (See the snapshot below). After it would open a form for registration which you need to fill it to get yourself registered.
Facebook  Developer Registration



Step II : Add a new App

Once you are done with registration as shown in step 1, you need to click on My Apps button (check out the snapshot below). Then select Add a New App from the drop down.

Facebook : My Apps

Then you need to write Display Name of App ID (Type any  name) and select drop down option in Category (Choose Education). press Create App ID button.
Create a new App

Step 3 : Get App ID and App Secret

In this step, we need to note down our App ID and App Secret (Refer the screenshot below).
Fb App ID and App Secret

Step 4 : OAuth Settings

  1. On the left hand side menu, click on Add Product Button
  2. Click on Facebook Login link
  3. Under Settings, make sure YES is selected in Client OAuth Login
  4. Type http://localhost:1410/ in Valid OAuth redirect URIs box
  5. Click on Save Changes button

OAuth redirect URIs

If you don't put information correctly, you would get the following error -
Can't Load URL: The domain of this URL isn't included in the app's domains. To be able to load this URL, add all domains and subdomains of your app to the App Domains field in your app settings. 
Step 5 :  Write R Script

1. Install required packages

Go to R and install Rfacebook and RCurl packages. Run the following code to install them.
install.packages("Rfacebook")
install.packages("RCurl")
The package Rfacebook lets you to access Facebook App via R.

2. Load desired packages

In this step, we will load the above installed packages.
library(Rfacebook)
library(RCurl)
3. Paste your app id and app secret below 
fb_oauth <- fbOAuth(app_id="183xxxxxxxx3748", app_secret="7bfxxxxxxxxcf0",extended_permissions = TRUE)
Press ENTER in R Console or CTRL+ENTER in R Studio.

It would return the following message -
Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/ 
When done, press any key to continue...
Waiting for authentication in browser...
Press Esc/Ctrl + C to abort

Authentication in Browser

Authentication Status

4. Check your profile account information
me <- getUsers("me",token=fb_oauth, private_info=TRUE)
me$name
[1] "Deepanshu Bhalla"


5. List of all the pages you have liked

Suppose you want to see all the pages you have liked in the past.
likes = getLikes(user="me", token = fb_oauth)
sample(likes$names, 10)
The sample() function is used to list some 10 random pages you have liked.

 [1] "The Hindu"                 "ADGPI - Indian Army"       "Brain Humor"             
 [4] "Jokes Corner"              "The New York Times"        "Oye! Extra Pen Hai?"     
 [7] "So You Think You Can Dance""Shankar Tucker"            "Rihanna"                 
[10] "Lindsey Stirling"


6. Update Facebook Status from R

You can also update status in Facebook via R.
updateStatus("this is just a test", token=fb_oauth)

7. Search Pages that contain a particular keyword
pages <- searchPages( string="trump", token=fb_oauth, n=200)
 In the above code, we are telling R to search all the pages that contain 'trump' as keyword. The n= 200 refers to the number of pages to return.

It returns 16 variables. See the list of variables -

[1] "id"                 "about"              "category"       
 [4] "description"        "general_info"       "likes"           
 [7] "link"               "city"               "state"           
[10] "country"            "latitude"           "longitude"       
[13] "name"               "talking_about_count""username"       
[16] "website"
head(pages$name)
[1] "Donald J. Trump"                "Ivanka Trump"                 
[3] "President Donald Trump Fan Club""President Donald J. Trump"   
[5] "Donald Trump Is My President"   "Donald Trump For President"   


8. Extract list of posts from a Facebook page

See the status posted by BBC News. The facebook page name of BBC News is bbcnews.
page <- getPage(page="bbcnews", token=fb_oauth, n=200) 
Posts Details
The above image is truncated. It returns in total 11 variables. See the variables' list -

 [1] "from_id"       "from_name"     "message"       "created_time"
 [5] "type"          "link"          "id"            "story"     
 [9] "likes_count"   "comments_count""shares_count

9. Get all the posts from a particular date

You can also put the beginning and end date of the posts you wanted to extract.
page <- getPage("bbcnews", token=fb_oauth, n=100,
since='2016/06/01', until='2017/03/20')

10. Which of these posts got maximum likes?

To know the most popular BBCNews post, you can submit the following line of code.
summary = page[which.max(page$likes_count),]
summary$message
[1] "Could circular runways take off? (via BBC World Hacks)"

11. Which of these posts got maximum comments?

Some posts are not so popular in terms of likes but they fetch max comments. It might be because they are controversial.
summary1 = page[which.max(page$comments_count),]
"When Angela Merkel met Donald J. Trump, did her reactions speak louder than words?

12. Which post was shared the most?
summary2 = page[which.max(page$shares_count),]
"Islam will be the world's largest religion by 2070, new research suggests."

13. Extract a list of users who liked the maximum liked posts

In terms of marketing or growth of a website, it is very important to know about the users who liked a certain post.
post <- getPost(summary$id[1], token=fb_oauth, comments = FALSE, n.likes=2000)
To view the list of people:
likes <- post$likes
head(likes)
Result - 
from_name           from_id
Tommy Johnson 10154527932013108
Mirtunjay Raj   399490251425210
Sony Joseph   142559101272027

Note - I have edited the IDs to maintain privacy


14. Extract FB comments on a specific post

To know what users think about a post, it is important to analyze their comments.
post <- getPost(page$id[1], token=fb_oauth, n.comments=1000, likes=FALSE)
comments <- post$comments
fix(comments)

15. What is the comment that got the most likes?
comments[which.max(comments$likes_count),]
16. What are the most common first names in the user list?
head(sort(table(users$first_name), dec=TRUE), n=3)
  David   John Daniel
    14     13     10

17. Extract Reactions for most recent post

Facebook has more than a like button. Last year, it launched emoji (emoticons). If a post got 1k likes, it does not mean everyone really loves the comment. The reaction can be happy, sad or angry.
post <- getReactions(post=page$id[1], token=fb_oauth)
love_count = 60, haha_count = 286, wow_count = 62, sad_count = 169, angry_count = 532

End Notes

Text Mining (Social) has gained a lot of interest in a last couple of years. Every company has started analyzing customers' opinion about their products and what customers talk about the company in social media world. It helps marketing team to define marketing strategies and development team to modify the upcoming products based on customer feedback.
Viewing all 425 articles
Browse latest View live