ListenData

In this article, we will walk you through an application of topic modelling and sentiment analysis to solve a real world business problem. This approach has a onetime effort of building a robust taxonomy and allows it to be regularly updated as new topics emerge. This approach is widely used in topic mapping tools. Please note that this is not a replacement of the topic modelling methodologies such as Latent Dirichlet allocation (LDA) and it is beyond them.

Text Mining Case Study using Python

Case Study : Topic Modeling and Sentiment Analysis

Suppose you are head of the analytics team with a leading Hotel chain “Tourist Hotel”. Each day, you receive hundreds of reviews of your hotel on the company’s website and multiple other social media pages. The business has a challenge of scale in analysing such data and identify areas of improvements. You use a taxonomy based approach to identify topics and then use a built-in functionality of Python NLTK package to attribute sentiment to the comments. This will help you in identifying what the customers like or dislike about your hotel.

Data Structure

The customer review data consists of a serial number, an arbitrary identifier to identify each review uniquely and a text field that has the customer review.

Example : Sentiment Analysis

Steps to topic mapping and sentiment analysis

1. Identify Topics and Sub Topics
2. Build Taxonomy
3. Map customer reviews to topics
4. Map customer reviews to sentiment

Step 1 : Identifying Topics

The first step is to identify the different topics in the reviews. You can use simple approaches such as Term Frequency and Inverse Document Frequency or more popular methodologies such as LDA to identify the topics in the reviews. In addition, it is a good practice to consult a subject matter expert in that domain to identify the common topics. For example, the topics in the “Tourist Hotel” example could be “Room booking”, “Room Price”, “Room Cleanliness”, “Staff Courtesy”, “Staff Availability ”etc.

Step 2 : Build Taxonomy

I. Build Topic Hierarchy

Based on the topics from Step 1, Build a Taxonomy. A Taxonomy can be considered as a network of topics, sub topics and key words.

Topic Hierarchy

II. Build Keywords
The taxonomy is built in a CSV file format. There are 3 levels of key words for each sub topic namely, Primary key words, Additional key words and Exclude key words. The keywords for the topics need to be manually identified and added to the taxonomy file. The TfIDf, Bigram frequencies and LDA methodologies can help you in identifying the right set of keywords. Although there is no one best way for building key words, below is a suggested approach.

i. Primary key words are the key words that are mostly specific to the topic. These key words need to be mutually exclusive across different topics as far as possible.

ii. Additional key words are specific to the sub topic. These key words need not be mutually exclusive between the topics but it is advised to maintain exclusivity between sub topics under the same sub topic. To explain further, let us say, there is a sub topic “Price” under the topics “Room” as well as “Food”, then the additional key words will have an overlap. This will not create any issue as the primary key words are mutually exclusive.

iii. Exclude key words are key words that are used relatively less than the other two types. If there are two sub topics that have some overlap of additional words OR for example, if the sub topic “booking” is incorrectly mapping comments regarding taxi bookings as room booking, such key words could be used in exclude words to solve the problem.

Snapshot of sample taxonomy:

Sample Taxonomy

Note: while building the key word list, you can put an “*” at the end as it helps as wild character. For example, all the different inflections of “clean” such as “cleaned”, “cleanly”, “cleanliness” can be handled by one keyword “clean*”. If you need to add a phrase or any keyword with a special character in it, you can wrap it in quotes. For example, “online booking”, Wi-Fi” etc need to be in double quotes.

Benefits of using taxonomic approach

Topic modelling approaches identify topics based on the keywords that are present in the content. For novel keywords that are similar to the topics but may come up in the future are not identified. There could be use cases where businesses want to track certain topics that may not always be identified as topics by the topic modelling approaches.

Step 3 : Map customer reviews to topic

Each customer comment is mapped to one or more sub topics. Some of the comments may not be mapped to any comment. Such instances need to be manually inspected to check if we missed any topics in the taxonomy so that it can be updated. Generally, about 90% of the comments have at least one topic. The rest of the comments could be vague. For example: “it was good experience” does not tell us anything specific and it is fine to leave it unmapped.

Snapshot of how the topics are mapped:

Topic Mapping

Below is the python code that helps in mapping reviews to categories. Firstly, import all the libraries needed for this task. Install these libraries if needed.

import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Download Datafiles
Customer Review
Taxonomy

Import reviews data

df = pd.read_csv("D:/customer_reviews.csv");

Import taxonomy

df_tx = pd.read_csv("D:/ taxonomy.csv");

Build functions for handling the various repetitive tasks during the mapping exercise. This function identifies taxonomy words ending with (*) and treats it as a wild character. This takes the Keywords as input and uses regular expression to identify all the other keyword matches as output.

def asterix_handler(asterixw, lookupw):
mtch = "F"
for word in asterixw:
for lword in lookupw:
if(word[-1:]=="*"):
if(bool(re.search("^"+ word[:-1],lword))==True):
mtch = "T"
break
return(mtch)

This function removes all punctuations. This is helpful in terms of data cleaning. You can edit the list of punctuations for your own custom punctuation removal at the place highlighted in amber.

def remov_punct(withpunct):
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
without_punct = ""
char = 'nan'
for char in withpunct:
if char not in punctuations:
without_punct = without_punct + char
return(without_punct)

Function to remove just the quotes(""). This is different from the above as this only handles double quotes. Recall that we wrap phrases or key words with special characters in double quotes.

def remov_quote(withquote):
quote = '"'
without_quote = ""
char = 'nan'
for char in withquote:
if char not in quote:
without_quote = without_quote + char
return(without_quote)

Split each document by sentences and append one below the other for sentence level topic mapping.

sentence_data = pd.DataFrame(columns=['slno','text'])
for d in range(len(df)):    
    doc = (df.iloc[d,1].split('.'))
    for s in ((doc)):        
        temp = {'slno': [df['slno'][d]], 'text': [s]}
sentence_data=  pd.concat([sentence_data,pd.DataFrame(temp)])
        temp = ""

Drop empty text rows if any and export data

sentence_data['text'].replace('',np.nan,inplace=True);      
sentence_data.dropna(subset=['text'], inplace=True);  

data = sentence_data
cat2list = list(set(df_tx['Subtopic']))
#data = pd.concat([data,pd.DataFrame(columns = list(cat2list))])            
data['Category'] = 0
mapped_data = pd.DataFrame(columns = ['slno','text','Category']);
temp=pd.DataFrame()
for k in range(len(data)):
    comment = remov_punct(data.iloc[k,1])
data_words = [str(x.strip()).lower() for x in str(comment).split()]
data_words = filter(None, data_words)
    output = []

    for l in range(len(df_tx)):
key_flag = False
and_flag = False
not_flag = False
        if (str(df_tx['PrimaryKeywords'][l])!='nan'):
kw_clean = (remov_quote(df_tx['PrimaryKeywords'][l]))
        if (str(df_tx['AdditionalKeywords'][l])!='nan'):
aw_clean = (remov_quote(df_tx['AdditionalKeywords'][l]))
        else:
aw_clean = df_tx['AdditionalKeywords'][l]
        if (str(df_tx['ExcludeKeywords'][l])!='nan'):
nw_clean = remov_quote(df_tx['ExcludeKeywords'][l])
        else:
nw_clean = df_tx['ExcludeKeywords'][l]
Key_words = 'nan'
and_words = 'nan'
        and_words2 = 'nan'
not_words = 'nan'
        not_words2 = 'nan'

        if(str(kw_clean)!='nan'):
key_words = [str(x.strip()).lower() for x in kw_clean.split(',')]
            key_words2 = set(w.lower() for w in key_words)

        if(str(aw_clean)!='nan'):
and_words = [str(x.strip()).lower() for x in aw_clean.split(',')]
            and_words2 = set(w.lower() for w in and_words)

        if(str(nw_clean)!= 'nan'):
not_words = [str(x.strip()).lower() for x in nw_clean.split(',')]
            not_words2 = set(w.lower() for w in not_words)

        if(str(kw_clean) == 'nan'):
key_flag = False        
        else:
            if set(data_words) & key_words2:
key_flag = True
            else:
if(asterix_handler(key_words2, data_words)=='T'):
key_flag = True

        if(str(aw_clean)=='nan'):
and_flag = True
        else:
            if set(data_words) & and_words2:
and_flag = True
            else:
                if(asterix_handler(and_words2,data_words)=='T'):
and_flag = True
        if(str(nw_clean) == 'nan'):
not_flag = False
        else:
            if set(data_words) & not_words2:
not_flag = True
            else:
if(asterix_handler(not_words2, data_words)=='T'):
not_flag = True
if(key_flag == True and and_flag == True and not_flag == False):
output.append(str(df_tx['Subtopic'][l]))            
            temp = {'slno': [data.iloc[k,0]], 'text': [data.iloc[k,1]], 'Category': [df_tx['Subtopic'][l]]}
mapped_data = pd.concat([mapped_data,pd.DataFrame(temp)])
    #data['Category'][k] = ','.join(output)

#output mapped data
mapped_data.to_csv("D:/ mapped_data.csv",index = False)

Step 4: Map customer reviews to sentiment

#read category mapped data for sentiment mapping
catdata = pd.read_csv("D:/mapped_data.csv")

#Build a function to leverage the built-in NLTK functionality of identifying sentiment. The output 1 means positive, 0 means neutral and -1 means negative. You can choose your own set of thresholds for positive, neutral and negative sentiment.

def findpolar(test_data):
sia = SentimentIntensityAnalyzer()
    polarity = sia.polarity_scores(test_data)["compound"];
if(polarity >= 0.1):    
foundpolar = 1
if(polarity <= -0.1):
foundpolar = -1 
    if(polarity>= -0.1 and polarity<= 0.1):
foundpolar = 0     
    return(foundpolar)

Output the sentiment mapped data

catdata.to_csv("D:/sentiment_mapped_data.csv",index = False)

Output : Sentiment Analysis

Additional Reading

Polarity Scoring Explained:

NLTK offers Valence Aware Dictionary for sEntiment Reasoning(VADER) model that helps in identifying both the direction (polarity) as well as the magnitude(intensity) of the text. Below is the high-level explanation of the methodology.

VADER is a combination of lexical features and rules to identify sentiment and intensity. Hence, this does not need any training data. To explain further, if we take an example of the sentence “the food is good”, it is easy to identify that it is positive in sentiment. VADER goes a step ahead and identifies intensity based on rule based approach such as punctuation, capitalised words and degree modifications.

The polarity scores for the different variations of similar sentences is as follows:

Polarity Score

Use cases where training sentiment models is suggested over Sentiment Intensity Analyzer:

Although VADER works well on multiple domains, there are could be some domains where it is preferred to build one’s own sentiment training models. Below are the two examples of such use cases.

Customer reviews on alcoholic beverages:

Patient reviews regarding hospital treatment

Programmatically taking screenshots of a web page is very essential in a testing environment to see about the web page. But the same can be used for automation like getting the screenshot of the news website every morning into your Inbox or generating a report of candidates’ github activities. But this wasn’t possible in command line until the rise of headless browsers and javascript libraries supporting them. Even when such JavaScript libraries where made available, R programmers did not have any option to integrate such functionality in their code.
That is when webshot an R package that helps R programmers take web screenshots programmatically with the help of phantomJS running in the backend.

Take Screenshot from R

What is PhantomJS?

PhantomJS is a headless webkit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

PhantomJS is an optimal solution for the following:

Headless website testing
Screen Capture
Page Automation
Network Monitoring

Webshot : R Package

The webshot package allows users to take screenshots of web pages from R with the help of PhantomJS. It also can take screenshots of R Shiny App and R Markdown Documents (both static and interactive).

Install and Load Package

The stable version of webshot is available on CRAN hence can be installed using the below code:

install.packages('webshot')
library('webshot')

Also, the latest development version of webshot is hosted on github and can be installed using the below code:

#install.packages('devtools')
devtools::install_github('wch/webshot')

Initial Setup

As we saw above, the R package webshot works with PhantomJS in the backend, hence it is essential to have PhantomJS installed on the local machine where webshot package is used. To assist with that, webshot itself has an easy function to get PhantomJS installed on your machine.

webshot::install_phantomjs()

The above function automatically downloads PhantomJS from its website and installs it. Please note this is only a first time setup and once both webshot and PhantomJS are installed these above two steps can be skipped for using the package as mentioned in the below sections.

Now, webshot package is installed and setup and is ready to use. To start with let us take a PDF copy of a web page.

Screenshot Function

webshot package provides one simple function webshot() that takes a webpage url as its first argument and saves it in the given file name that is its second argument. It is important to note that the filename includes the file extensions like '.jpg', '.png', '.pdf' based on which the output file is rendered. Below is the basic structure of how the function goes:

library(webshot)

#webshot(url, filename.extension)
webshot("https://www.listendata.com/", "listendata.png")

If no folder path is specified along with the filename, the file is downloaded in the current working directory which can be checked with getwd().

Now that we understood the basics of the webshot() function, It is time for us to begin with our cases - starting with downloading/converting a webpage as a PDFcopy.

Case #1: PDF Copy of WebPage

Let us assume, we would like to download Bill Gates' notes on Best Books of 2017 as a PDF copy.

#loading the required library
library(webshot)

#PDF copy of a web page / article
webshot("https://www.gatesnotes.com/About-Bill-Gates/Best-Books-2017",
"billgates_book.pdf",
delay = 2)

The above code generates a PDF whose (partial) screenshot is below:

Snapshot of PDF Copy

Dissecting the above code, we can see that the webshot( ) function has got 3 arguments supplied with it.

URL from which the screenshot has to be taken.
Output Filename along with its file extensions.
Time to wait before taking screenshot, in seconds. Sometimes a longer delay is needed for all assets to display properly.

Thus, a webpage can be converted/downloaded as a PDF programmatically in R.

Case #2: Webpage Screenshot (Viewport Size)

Now, I'd like to get an automation script running to get screenshot of a News website and probably send it to my inbox for me to see the headlines without going to the browser. Here we will see how to get a simple screenshot of livemint.com an Indian news website.

#Screenshot of Viewport
webshot('https://www.livemint.com/','livemint.png', cliprect = 'viewport')

While the first two arguments are similar to the above function, there's a new third argument cliprect which specifies the size of the Clipping rectangle.

If cliprectis unspecified, the screenshot of the complete web page is taken (like in the above case). Since we are updated in only the latest news (which is usually on the top of the website), we use cliprect with the value 'viewport'which clips only the viewport part of the browser, as below.

Screenshot of Viewport of Browser

Case #3: Multiple Selector Based Screenshots

All the while we have seen taking simple screenshots of the whole pages and we dealt with one screenshot and one file, but that is not what usually happens when you are dealing with automation or perform something programmatically. In most of the cases we end up performing more than one action, hence this case deals with taking multiple screenshotsand saving multiple files. But instead of taking multiple screenshots of different urls (which is quite straightforward), we will screenshots of different sections of the same web page with different CSS selector and save them in respective files.

#Multiple Selector Based Screenshots
webshot("https://github.com/hadley",
file = c("organizations.png","contributions.png"),
selector = list("div.border-top.py-3.clearfix","div.js-contribution-graph"))

In the above code, we take screenshot of two CSS Selectors from the github profile page of Hadley Wickham and save them in two PNG files - organizations.png and contributions.png.

Contributions.png

Organizations.png

Thus, we have seen how to use the R package webshot for taking screenshots programmatically in R. Hope, this post helps fuel your automation needs and helps your organisation improve its efficiency.

References

This article outlines the concept and python implementation of Named Entity Recognition using StanfordNERTagger. The technical challenges such as installation issues, version conflict issues, operating system issues that are very common to this analysis are out of scope for this article.

NER NLP using Python

Table of contents:

1. Named Entity Recognition defined
2. Business Use cases
3. Installation Pre-requisites
4. Python Code for implementation
5. Additional Reading: CRF model, Multiple models available in the package
6. Disclaimer

1. Named Entity Recognition Defined

The process of detecting and classifying proper names mentioned in a text can be defined as Named Entity Recognition (NER). In simple words, it locates person name, organization and location etc. in the content. This is generally the first step in most of the Information Extraction (IE) tasks of Natural Language Processing.

NER Sample

2. Business Use Cases

There is a need for NER across multiple domains. Below are a few sample business use cases for your reference.

Investment research: To identify the various announcements of the companies, people’s reaction towards them and its impact on the stock prices, one needs to identify people and organisation names in the text
Chat-bots in multiple domains: To identify places and dates for booking hotel rooms, air tickets etc.
Insurance domain: Identify and mask people’s names in the feedback forms before analyzing. This is needed for being regulatory compliant(example: HIPAA)

3. Installation Prerequisites

1.Download Stanford NER from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip

2.Unzip the zipped folder and save in a drive.

3.Copy the “stanford-ner.jar” from the folder and save it just outside the folder as shown in the image

4.Download the caseless models from https://stanfordnlp.github.io/CoreNLP/history.html by clicking on “caseless” as given below. The models in the first link work as well. However, the caseless models help in identifying named entities even when they are not capitalised as required by formal grammar rules.

5.Save the folder in the same location as the Stanford NER folder for ease of access

Stanford NER Installation - Step1

NER Installation - Step2

4. Python Code for implementation:

#Import all the required libraries.
import os
from nltk.tag import StanfordNERTagger
import pandas as pd

#Set environmental variables programmatically.
#Set the classpath to the path where the jar file is located
os.environ['CLASSPATH'] = "<path to the file>/stanford-ner-2015-04-20/stanford-ner.jar"

#Set the Stanford models to the path where the models are stored
os.environ['STANFORD_MODELS'] = '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner'

#Set the java jdk path
java_path = "C:/Program Files/Java/jdk1.8.0_161/bin/java.exe"
os.environ['JAVAHOME'] = java_path

#Set the path to the model that you would like to use
stanford_classifier = '<path to the file>/stanford-corenlp-caseless-2015-04-20-models/edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz'

#Build NER tagger object
st = StanfordNERTagger(stanford_classifier)

#A sample text for NER tagging
text = 'srinivas ramanujan went to the united kingdom. There he studied at cambridge university.'

#Tag the sentence and print output
tagged = st.tag(str(text).split())
print(tagged)

Output

[(u'srinivas', u'PERSON'), 
(u'ramanujan', u'PERSON'), 
(u'went', u'O'), 
(u'to', u'O'), 
(u'the', u'O'), 
(u'united', u'LOCATION'), 
(u'kingdom.', u'LOCATION'), 
(u'There', u'O'), 
(u'he', u'O'), 
(u'studied', u'O'), 
(u'at', u'O'), 
(u'cambridge', u'ORGANIZATION'), 
(u'university', u'ORGANIZATION')]

5. Additional Reading

StanfordNER algorithm leverages a general implementation of linear chain Conditional Random Fields (CRFs) sequence models. CRFs seem very similar to Hidden Markov Model but are very different.

Below are some key points to note about the CRFs in general.

It is a discriminative model unlike the HMM model and thus models the conditional probability
It does not assume independence of features unlike the HMM model. This means that the current word, previous word, next word are all considered for model as features
Relative to HMM or Max ent Markov Models, CRFs are the slowest

6. Disclaimer

This article explains the implementation of StanfordNER algorithm for research purposes and does not promote it for commercial gains. For any questions on commercial aspects of implementing this algorithm, please contact Stanford University

This tutorial covers various concepts of regular expression (regex) with hands-on examples. It also includes usage of regex using various tools such as R and Python.

Introduction

regex is an acronym for 'Regular Expression'. It is mainly used in extracting sub-string from string by searching a specific search pattern. The search pattern is defined by regular expression.

The search pattern can be finding a single letter, a fixed string or a complex pattern which consists of numeric, punctuation and character values.

Regular expressions can be used to search and replace text.

Regex Made Easy

Uses of Regular expression

There are several use-cases of regular expression in real-world. Some of them are as follows -

Fetch email addresses mentioned in the long paragraph
Validate 10-digit phone number, Social Security Number and email address
Extract text from HTML or XML code
Rename multiple files at a single run
Remove punctuation specified in the text
Web scraping : Searching specific content from all the web pages that contain a specific string
Replace complex pattern with blank or specific character

Lets start with the basics

1. Anchor and Word Boundaries

Symbol	Description
^	Beginning of line
$	End of line
\b	Whole word

Examples

1. ^abc matches the string that begins with abc in text 'abcd'
Test it yourself!

2. ^the matches the string that starts with the in text 'the beginning'
Test it yourself!

3. done$ matches the string that ends with done in text 'I am done'
Test it yourself!

4. \ban\b matches the whole word an in text 'Elephant an animal'
\ban\b does not match an from Elephant and animal as it only perform the whole word searching.
Test it yourself!

2. OR Condition

OR condition can be defined by symbols | or [ ]. See the examples below.

1. the[m|n] matches strings them or then in text 'them then there theme'
Test it yourself!

2. the[mn] is equivalent to the[m|n]
Test it yourself!

3. \bthe[mn]\b matches the complete them or then in text 'them then there theme'
Test it yourself!

3. Case Insensitive

Search patterns mentioned in all of the above examples are case-sensitive. To make it case insensitive, we have to use the expression (?i)

1. (?i)abc matches both abc and ABC in text 'abc ABC'
Test it yourself!

2. (?i)a[bd]a performs insensitive match 'a' followed by either b or d and then a in text 'abc ABA Ada'
Test it yourself!

4. Quantifiers

It talks about quantity of element(s). In simple words, it means how often a particular regex element can occur.

Expression	Description
*	Item occurs zero or more times
+	Item occurs one or more times
?	Item occurs zero or one time
{A}	Item occurs A number of times
{A,B}	Item occurs between A and B times
.	Any character
.*	Matches zero or more of any character

1. def* matches strings that contains de then followed by f zero or more times. Example - dedefdeff defff
Test it yourself!

2. def+ matches strings having de then followed by f at least 1 time. Example - def deff defff
Test it yourself!

3. \bdef?\b matches strings having exact match of whole de then followed by f zero or one time. Example - de def
Test it yourself!

4. \bdef{2}\b matches strings having exact match of de then followed by f exactly two times. Example - deff
Test it yourself!

5. \bdef{2,}\b matches strings having exact match of de then followed by f two or more times. Example - deff defff
Test it yourself!

6. \bdef{3,4}\b matches strings having exact match of de then followed by f either 3 or 4 times. Example - deff defff
Test it yourself!

7. a.* matches all characters after a
Test it yourself!

5. Create Grouping

By using regular expression inside ( ), you can create a group which would let you apply OR condition to portion of regex or you can put in quantifier to the entire group.

It also helps to extract a portion of information from strings.

ab(cd|de)* matches strings having ab then followed by either cd or de zero or more times.
Test it yourself!

6. Back Reference

(name)\1 matches text 'name' that is matched first.
Test it yourself!

Replace (Substitution) using Back-reference

(ab|cd)e(fg|hi) matches either ab or cd then followed by e then either fg or hi
Enter \1\2 in substitution, it will return values of first and second group.
Test it yourself!

7. Lazy Quantifier

Lazy (or non-greedy) quantifier matches a regex element as few times as possible. However greedy quantifier matches a regex element as many as possible.

You can covert a greedy quantifier into a lazy quantifier by simply adding a ?

<.*?> matches strings having <character(s) >.

Regex lazy quantifier

Test it yourself!

8. How to program literal meaning of dot, asterisk

By using backslash \ you can avoid asterisk and dot. In other words, it makes regex understand the literal meaning of character.

abc\* means abc* not abcc

Test it yourself!

9. POSIX Regular Expressions

POSIX expressions use square brackets. Like regular expressions, it matches characters, digits, punctuations and many more

POSIX	Description	ASCII
[:digit:]	Digits	[0-9]
[:lower:]	Lowercase letters	[a-z]
[:upper:]	Uppercase letters	[A-Z]
[:alpha:]	Lower and uppercase letters	[a-zA-Z]
[:alnum:]	Lower and uppercase letters and digits	[a-zA-Z0-9]
[:blank:]	Space and tab	[ \t]
[:space:]	All whitespace characters, including line breaks	[ \t\r\n\v\f]
[:punct:]	Punctuations	"[!\#$%()*+,\-./:;?@\\\\]^_'{\|}~]"

Select string having first letter character followed by numeric

[[:alpha:]][[:digit:]]+

[[:alpha:]] means any letter character
[[:digit:]] means any digit
+ means previous one or more time

Test it yourself!

How to use regex with R and Python

R

1. grep(pattern, x)
Search for a particular pattern in each element of a vector x

2. gsub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x

x = "sample text B2 testing B52"
gsub('[[:alpha:]][[:digit:]]+', '',x)

Python

The package re can be used for regular expressions in Python.

1. re.search(pattern, x)
Search for a particular pattern in each element of a vector x

2. re.sub(pattern, replacement, x)
Replace a particular pattern in each element of a vector x

import re
x = 'Welcome to Python3.6'
re.sub( '[a-zA-Z]+[0-9|.]+','', x)

Exercises : Regular Expression

1. Replace abbreviation of thousand (K) with 000?

x = "K 25K 2K"
Desired Output : K 25000 2000

Show Solution

gsub('([0-9])K', '\\1000',x)

Using two backward slash as a single backward slash not allowed in R

2. Remove extra characters

x = "var1_avg_a1 var1_a_avg_7"
Desired Output :var1 var1_a

Show Solution

gsub('_avg_.*?[0-9]', '',x)

? making the regular expression non-greedy (lazy) quantifier

In enterprise environment, we generally need to automate the process of installing multiple R packages so that user does not have to install them separately before submitting your program.

The function below performs the following operations -

First it finds all the already installed R packages
Check packages which we want to install are already installed or not.
If package is already installed, it doesnot install it again.
If package is missing (not installed), it installs the package.
Loop through steps 2, 3 and 4 for multiple packages we want to install
Load all the packages (both already available and new ones).

Install_And_Load <- function(packages) {
k <- packages[!(packages %in% installed.packages()[,"Package"])];
if(length(k))
{install.packages(k, repos='https://cran.rstudio.com/');}

for(package_name in packages)
{library(package_name,character.only=TRUE, quietly = TRUE);}
}
Install_And_Load(c("fuzzyjoin", "quanteda", "stringdist", "stringr", "stringi"))

Explanation

1. installed.packages() returns details of all the already installed packages. installed.packages()[,"Package"] returns names of these packages.

To see version of the packages, submit the following command

installed.packages()[,c("Package","Version")]

2. You can use any of the following repositories (URL of a CRAN mirror). You can experiment with these 3 repositories if one of them is blocked in your company due to firewall restriction.

https://cloud.r-project.org
https://cran.rstudio.com
http://www.stats.ox.ac.uk/pub/RWin

3. quietly = TRUE tells R not to print errors/warnings if package attaching (loading) fails.

How to check version of R while installation

In the program below, the package RDCOMClient refers repository - http://www.omegahat.net/R if R version is greater than or equal to 3.5. Else refers the repository http://www.stats.ox.ac.uk/pub/RWin

if (length("RDCOMClient"[!("RDCOMClient" %in% installed.packages()[,"Package"])])) {
if (as.numeric(R.Version()$minor)>= 5)
install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
else
install.packages("RDCOMClient", repos = "http://www.stats.ox.ac.uk/pub/RWin")
}
library("RDCOMClient")

In this tutorial, I will cover how to include your own JavaScript, CSS and HTML code in your R shiny app. By including them, you can make a very powerful professional web app using R.

First let's understand the basics of a Webpage

In general, web page contains the following section of details.

Content (Header, Paragraph, Footer, Listing)
Font style, color, background, border
Images and Videos
Popups, widgets, special effects etc.

HTML, CSS and JavaScript

These 3 web programming languages in conjunction take care of all the information webpage contains (from text to adding special effects).

HTML determines the content and structure of a page (header, paragraph, footer etc.)
CSS controls how webpage would look like (color, font type, border etc.)
JavaScript decides advanced behaviors such as pop-up, animation etc.

Fundamentals of Webpage

One of the most common web development term you should know : rendering. It is the act of putting together a web page for presentation.

Shiny Dashboard Syntax

In this article, I will use shinydashboard library as it gives more professional and elegant look to app. The structure of shinydashboard syntax is similar to shiny library. Both requires ui and server components. However, functions are totally different. Refer the code below. Make sure to install library before using the following program.

# Load Library
library(shiny)
library(shinydashboard)

# User Interface
ui = 
  dashboardPage(
    dashboardHeader(title = "Blank Shiny App"),
    dashboardSidebar(),
    dashboardBody() 
    )

# Server
server = function(input, output) { }

# Run App
runApp(list(ui = ui, server = server), launch.browser =T)

Example : Create Animation Effect

The program below generates animation in the web page. To test it, you can check out this link. When user hits "Click Me" button, it will trigger demojs() JavaScript which will initiate animation. It's a very basic animation. You can edit the code and make it as complex as you want.

HTML

<p><button onclick="demojs()">Click Me</button></p><div id="myContainer"><div id="sampleanimation"></div></div>

CSS

#sampleanimation {
width: 50px;
height: 50px;
position: absolute;
background-color: blue;
}

#myContainer {
  width: 400px;
  height: 400px;
  position: relative;
  background: black;
}

JS

function demojs() {
  var elem = document.getElementById('sampleanimation');   
  var position = 0;
  var id = setInterval(frame, 10);
  function frame() {
    if (position == 350) {
      clearInterval(id);
    } else {
      position++; 
      elem.style.top = position + 'px'; 
      elem.style.left = position + 'px'; 
    }
  }
}

There are several ways to include custom JavaScript and CSS codes in Shiny. Some of the common ones are listed below with detailed explanation -

Method I : Use tags to insert HTML, CSS and JS Code in Shiny

HTML

tags$body(HTML("Your HTML Code"))

CSS

tags$head(HTML("<style type='text/css'>
Your CSS Code
</style>"))

OR

CSS code can also be defined using tags$style.

tags$head(tags$style(HTML(" Your CSS Code ")))

tags$head(HTML("<script type='text/javascript'>
Your JS Code
</script>"))

OR

JS code can be described with tags$script.

tags$head(tags$script(HTML(" Your JS Code ")))

Code specified in tags$head means it will be included and executed under <head> </head>. Similarly tags$body can also be used to make shiny run code within <body> </body>

tags$head vs. tags$body

In general, JavaScript and CSS files are defined inside <head> </head>. Things which we want to display under body section of the webpage should be defined within <body> </body>.

Animation Code in Shiny

library(shiny) library(shinydashboard)  # User Interface  ui <-    dashboardPage(      dashboardHeader(title = "Basic Use of JS and CSS"),      dashboardSidebar(),      dashboardBody(         # Javasript Code    tags$head(HTML("")),       # CSS Code  tags$head(HTML("<style type="text/css"> #sampleanimation {  width: 50px;  height: 50px;  position: absolute;  background-color: blue;  }  </style>")),  # HTML Code     box(tags$body(HTML("<p><button onclick="demojs()">Click Me</button></p><div id="sampleanimation"></div>")), height = 400)  ))  server = function(input, output) { }  runApp(list(ui = ui, server = server), launch.browser =T)

Important Note

In JS, CSS and HTML code, make sure to replace double quotation mark with single quotation mark under shiny's HTML("") function as it considers double quotation mark as closing the function.

Method II : Call JavaScript and CSS files in Shiny

You can use includeScript( ) and includeCSS( ) functions to refer JS and CSS codes from files saved in your local directory. You can save the files anywhere and mention the file location of them in the functions.

How to create JS and CSS files manually

Open notepad and paste JS code and save it with .js file extension and file type "All files" (not text document). Similarly you can create css file using .css file extension.

library(shinydashboard)  # User Interface ui <-    dashboardPage(     dashboardHeader(title = "Basic Use of JS and CSS"),     dashboardSidebar(),     dashboardBody(          # Call Javasript and CSS Code from file   tags$head(   includeScript("C:\\Users\\DELL\\Documents\\animate.js"),   includeCSS("C:\\Users\\DELL\\Documents\\animation.css")   ),  # HTML Code       box(tags$body(HTML("<p><button onclick="demojs()">Click Me</button></p><div id="sampleanimation"></div>")), height = 400) ))  server = function(input, output) { }  runApp(list(ui = ui, server = server), launch.browser =T)

When to use Method 2?

When you want to include a big (lengthy) JS / CSS code, use method 2. Method 1 should be used for small code snippets as RStudio does not support coloring and error-checking of JS / CSS code. Also it makes code unnecessary lengthy which makes difficult to maintain.

Method III : Add JS and CSS files under www directory

Step 1 :
Create an app using shinyApp( ) function and save it as app.R. Refer the code below.

library(shiny) library(shinydashboard)  app <- shinyApp( ui <- dashboardPage(     dashboardHeader(title = "Basic Use of JS"),     dashboardSidebar(),     dashboardBody(          # Javasript and CSS Code   tags$head(tags$script(src='animate.js')),   tags$head(tags$link(rel="stylesheet", type = "text/css", href = "animation.css")),      # HTML Code   box(tags$body(HTML("<p><button onclick="demojs()">Click Me</button></p><div id="sampleanimation"></div>")), height = 400) )) , server = function(input, output) { } )

Step 2 :
Create a folder named www in your app directory (where your app app.r file is stored) and save .js and .css files under the folder. Refer the folder structure below.

├── app.R
└── www
└── animate.js
└── animation.css

Step 3 :
Submit runApp( ) function. Specify path of app directory.

runApp(appDir = "C:/Users/DELL/Documents", launch.browser = T)

Method IV : Using Shinyjs R Package

The shinyjs package allows you to perform most frequently used JavaScript tasks without knowing JavaScript programming at all. For example, you can hide, show or toggle element. You can also enable or disable input.

Example : Turn content on and off by pressing the same button

Make sure to install shinyjs package before loading it. You can install it by using install.packages("shinyjs").

Important Point : Use function useShinyjs( ) under dashboardBody( ) to initialize shinyjs library

library(shiny) library(shinydashboard) library(shinyjs)  ui <- dashboardPage(   dashboardHeader(),   dashboardSidebar(),   dashboardBody(     useShinyjs(),     actionButton("button", "Click me"),     div(id = "id1", "Sample Text")   ) )  server <- function(input, output) {   observeEvent(input$button, {     toggle("id1")   }) }  runApp(list(ui = ui, server = server), launch.browser =T)

In the above program, we have used toggle( ) function to turn content on and off.

Example : Enable or disable Numeric Input based on checkbox selection

library(shiny) library(shinydashboard) library(shinyjs)  ui <- dashboardPage(   dashboardHeader(),   dashboardSidebar(),   dashboardBody(     useShinyjs(),     numericInput("sampleinput", "Categories", 1),     checkboxInput("id1", label="Enable Input Box")   )    )      server = function(input, output, session) {     observeEvent(input$id1, {       if(input$id1 == F){         disable("sampleinput")       } else {         enable("sampleinput")       }     })   }   runApp(list(ui = ui, server = server), launch.browser =T)

Communication between R and JavaScript

You can also define and call your own JavaScript function using shinyjs package with the use of extendShinyjs( ) function inside dashboardBody( ).

Make sure to define custom JavaScript function beginning with word shinyjs
JS function should be inside quotes
In server, you can call the function by writing js$function-name

The program below closes app when user clicks on action button.

library(shiny) library(shinydashboard) library(shinyjs)  jscode <- "shinyjs.exitWindow = function () {   window.open('','_parent','');   window.close(); }"  ui <- dashboardPage(   dashboardHeader(),   dashboardSidebar(),   dashboardBody(     shinyjs::useShinyjs(),     extendShinyjs(text = jscode),     actionButton("close", "Close app")   )    )      server = function(input, output, session) {     observeEvent(input$close, {       js$exitWindow();     })   }  runApp(list(ui = ui, server = server), launch.browser =T)

End Notes

With the huge popularity of JavaScript and many recent advancements, it is recommended to learn basics of JavaScript so that you can use them in R Shiny app. According to latest survey, JavaScript is used by 95% of websites. Its huge popularity is because of active broad JS developers community and being used by big players like Google, Facebook, Microsoft, etc.

Do comment on how you use shiny app in the comment box below. If you are beginner and want to learn building webapp using shiny, check out this tutorial

Python is one of the most popular programming language for data science and analytics. It is widely used for a variety of tasks in startups and many multi-national organizations. The beauty of this programming language is that it is open-source which means it is available for free and has very active community of developers across the world. Python developers share their solutions in the form of package or module with other python users. This tutorial explains various ways how to install python package.

Ways to Install Python Package

Method 1 : If Anaconda is already installed on your System

Anaconda is the data science platform which comes with pre-installed popular python packages and powerful IDE (Spyder) which has user-friendly interface to ease writing of python programming scripts.

If Anaconda is installed on your system (laptop), click on Anaconda Prompt as shown in the image below.

Anaconda Prompt

To install a python package or module, enter the code below in Anaconda Prompt -

pip install package-name

Install Python Package using PIP Windows

Method 2 : NO Need of Anaconda

1. Open RUN box using shortcut Windows Key + R

2. Enter cmd in the RUN box

Command Prompt

Once you press OK, it will show command prompt screen.

3. Search for folder named Scripts where pip applications are stored.

Scripts Folder

4. In command prompt, type cd <file location of Scripts folder>

cd refers to change directory.

For example, folder location is C:\Users\DELL\Python37\Scripts so you need to enter the following line in command prompt :

cd C:\Users\DELL\Python37\Scripts

Change Directory

5. Type pip install package-name

Install Package via PIP command prompt

Syntax Error : Installing Package using PIP

Some users face error "SyntaxError: invalid syntax"in installing packages. To workaround this issue, refer the command line below -

python -m pip install package-name

python -m pip tells python to import a module for you, then run it as a script.

Install Specific Versions of Python Package

python -m pip install Packagename==1.3 # specific version
python -m pip install "Packagename>=1.3" # version greater than or equal to 1.3

How to load or import package or module

Once package is installed, next step is to make the package in use. In other words, it is required to import package once installed. There are several ways to load package or module in Python :

1. import math loads the module math. Then you can use any function defined in math module using math.function. Refer the example below -

import math
math.sqrt(4)

2. from math import * loads the module math. Now we don't need to specify the module to use functions of this module.

from math import *
sqrt(4)

3. from math import sqrt, cos imports the selected functions of the module math.

4.import math as m imports the math module under the alias m.

m.sqrt(4)

Other Useful Commands

Description	Command
To uninstall a package	pip uninstall package
To upgrade a package	pip install --upgrade package
To search a package	pip search "package-name"
To check all the installed packages	pip list

The most common issue in installing python package in a company's network is failure of verification of SSL Certificate. Sometimes company blocks some websites in their network so employees can't access these websites. Whenever they try to visit these websites, it shows "Access Denied because of company's policy". It causes connection error in reachingmain python website.

Error looks like this :

Could not fetch URL https://pypi.python.org/: connection error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:598)

PIP SSL Certification Issue

Solution :

Run the following command. Make sure to specify package name in <package_name>

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package_name> -vvv

Suppose you want to install pandas package, you should submit the following line of command

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org pandas -vvv

The --trusted-host option mark the host as trusted, even though it does not have valid or any HTTPS

This article explains various ways to create dummy or random data in Python for practice. Like R, we can create dummy data frames using pandas and numpy packages. Most of the analysts prepare data in MS Excel. Later they import it into Python to hone their data wrangling skills in Python. This is not an efficient approach. The efficient approach is to prepare random data in Python and use it later for data manipulation.

1. Enter Data Manually in Editor Window

The first step is to load pandas package and use DataFrame function

import pandas as pd
data = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"], 
"MonthSales" : [25,30,35,40,45]})

       A  MonthSales
0   John          25
1   Deep          30
2  Julia          35
3   Kate          40
4  Sandy          45

Note : Character values should be defined in single or double quotes.

2. Prepare Data using sequence of numeric and character values

Let's import two popular python packages for this task - string and numpy. The package string is used to generate series of alphabets. Whereas numpy package is used to generate sequence of numbers incremented by a specific value.

import pandas as pd
import string
import numpy as np
data2 = pd.DataFrame({"A": np.arange(1,10,2),
"B" : list(string.ascii_lowercase)[0:5],
                     })

Explanation
1. np.arange(1,10,2) tells python to generate values between 1 and 10, incremented by 2.
2.string.ascii_lowercase returns abcdefghijklmnopqrstuvwxyz. list(string.ascii_lowercase)[0:5] is used to pick first 5 letters.

3. Generate Random Data

In numpy, there are many functions to generate random values. The two most popular random functions are random.randint( ) and random.normal( )

import pandas as pd
import numpy as np
np.random.seed(1)
data3 = pd.DataFrame({"C" : np.random.randint(low=1, high=100, size=10),
"D"  : np.random.normal(0.0, 1.0, size=10)
                     })

    C         D
0  38 -0.528172
1  13 -1.072969
2  73  0.865408
3  10 -2.301539
4  76  1.744812
5   6 -0.761207
6  80  0.319039
7  65 -0.249370
8  17  1.462108
9   2 -2.060141

Explanation
np.random.seed(1) tells python to generate same random values with this seed when you run it next time. np.random.randint(low=1, high=100, size=10) returns 10 random values between 1 and 100. np.random.normal(0.0, 1.0, size=10) returns 10 random values following standard normal distribution having mean 0 and standard deviation 1.

Check mean and standard deviation of normal distribution

np.round(np.std(np.random.normal(0.0, 1.0, size=1000)))
np.round(np.mean(np.random.normal(0.0,1.0, size=1000)))

4. Create Categorical Variables

In this step, we will create two types of categorical variables :

Categories ranging from 1 to 4
Binary variable (0 / 1)

import pandas as pd
import numpy as np
np.random.seed(1)
data4 =pd.DataFrame({"X"  : np.random.choice(range(1,5), 20, replace=True),
"X1" : np.where(np.random.normal(0.0, 1.0, size=20)<=0,0,1)})

Explanation

np.random.choice(range(1,5), 20, replace=True) means generating 20 values from 1 to 4 (excluding 5) with replacement (i.e. repeated values).
np.where(np.random.normal(size=20)<=0,0,1) implies if random value is either zero or negative, make it 0. Otherwise 1. np.where( ) is used to construct IF-ELSE statement in python.

Like R's factor( ) function, you can define variable(s) as categorical variables. See the code below.

data4.X = data4.X.astype("category")
data4.X1 = data4.X1.astype("category")

5. Import CSV or Excel File

Using pandas functions read_csv( ) and read_excel( ) functions, you can read data from excel or CSV to Python.

import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\samplefile.csv")
mydata = pd.read_excel("C:\\Users\\Deepanshu\\samplefile.xlsx")

Detailed Tutorial : How to import data in Python

This tutorial covers various ways to execute loops in python. Loops is an important concept of any programming language which performs iterations i.e. run specific code repeatedly until a certain condition is reached.

1. For Loop

Like R and C programming language, you can use for loop in Python. It is one of the most commonly used loop method to automate the repetitive tasks.

How for loop works?

Suppose you are asked to print sequence of numbers from 1 to 9, increment by 2.

for i in range(1,10,2):
  print(i)

Output

range(1,10,2) means starts from 1 and ends with 9 (excluding 10), increment by 2.

Iteration over list
This section covers how to run for in loop on a list.

mylist = [30,21,33,42,53,64,71,86,97,10]
for i in mylist:
    print(i)

Output

Suppose you need to select every 3rd value of list.

for i in mylist[::3]:
    print(i)

Output

mylist[::3] is equivalent to mylist[0::3] which follows this syntax style list[start:stop:step]

Python Loop Explained with Examples

Example 1 : Create a new list with only items from list that is between 0 and 10

l1 = [100, 1, 10, 2, 3, 5, 8, 13, 21, 34, 55, 98]

new = [] #Blank list
for i in l1:
    if i > 0 and i <= 10:
        new.append(i)

new

Output: [1, 10, 2, 3, 5, 8]

It can also be done via numpy package by creating list as numpy array. See the code below.

import numpy as np
k=np.array(l1)
new=k[np.where(k<=10)]

Example 2 : Check which alphabet (a-z) is mentioned in string

Suppose you have a string named k and you want to check which alphabet exists in the string k.

k = "deepanshu"

import string
for n in string.ascii_lowercase:
    if n in k:
        print(n + ' exists in ' + k)
    else:
        print(n + ' does not exist in ' + k)

string.ascii_lowercase returns 'abcdefghijklmnopqrstuvwxyz'.

Practical Examples : for in loop in Python

Create sample pandas data frame for illustrative purpose.

import pandas as pd
np.random.seed(234)
df = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
"Month1" : np.random.normal(size=10),
"Month2" : np.random.normal(size=10),
"Month3" : np.random.normal(size=10),
"price"  : range(10)
                     })

df

1. Multiple each month column by 1.2

for i in range(1,4):
    print(df["Month"+str(i)]*1.2)

range(1,4) returns 1, 2 and 3. str( ) function is used to covert to string."Month" + str(1) means Month1.

2. Store computed columns in new data frame

import pandas as pd
newDF = pd.DataFrame()
for i in range(1,4):
    data = pd.DataFrame(df["Month"+str(i)]*1.2)
    newDF=pd.concat([newDF,data], axis=1)

pd.DataFrame( ) is used to create blank data frame. The concat() function from pandas package is used to concatenate two data frames.

3. Check if value of x1 >= 50, multiply each month cost by price. Otherwise same as month.

import pandas as pd
import numpy as np
for i in range(1,4):
    df['newcol'+str(i)] = np.where(df['x1'] >= 50,
                                   df['Month'+str(i)] * df['price'],
                                   df['Month'+str(i)])

In this example, we are adding new columns named newcol1, newcol2 and newcol3.np.where(condition, value_if condition meets, value_if condition does not meet) is used to construct IF ELSE statement.

4. Filter data frame by each unique value of a column and store it in a separate data frame

mydata = pd.DataFrame({"X1" : ["A","A","B","B","C"]})

for name in mydata.X1.unique():
    temp = pd.DataFrame(mydata[mydata.X1 == name])
    exec('{} = temp'.format(name))

The unique( ) function is used to calculate distinct values of a variable. The exec( ) function is used for dynamic execution of Python program. See the usage of string format( ) function below -

s= "Your Input"
"i am {}".format(s)

Output: 'i am Your Input'

Loop Control Statements

Loop control statements change execution from its normal iteration. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

Python supports the following control statements.

Continue statement
Break statement

Continue Statement
When continue statement is executed, it skips the further code in the loop and continue iteration.
In the code below, we are avoiding letters a and d to be printed.

for n in "abcdef":
    if n =="a" or n =="d":
       continue
    print("letter :", n)

letter : b
letter : c
letter : e
letter : f

Break Statement
When break statement runs, it breaks or stops the loop.
In this program, when n is either c or d, loop stops executing.

for n in "abcdef":
    if n =="c" or n =="d":
       break
    print("letter :", n)

letter : a
letter : b

for loop with else clause

Using else clause with for loop is not common among python developers community.

The else clause executes after the loop completes. It means that the loop did not encounter a break statement.

The program below calculates factors for numbers between 2 to 10. Else clause returns numbers which have no factors and are therefore prime numbers:

for k in range(2, 10):
    for y in range(2, k):
        if k % y == 0:
            print( k, '=', y, '*', round(k/y))
            break
    else:
        print(k, 'is a prime number')

2 is a prime number
3 is a prime number
4 = 2 * 2
5 is a prime number
6 = 2 * 3
7 is a prime number
8 = 2 * 4
9 = 3 * 3

While Loop

While loop is used to execute code repeatedly until a condition is met. And when the condition becomes false, the line immediately after the loop in program is executed.

i = 1
while i < 10:
    print(i)
    i += 2 #means i = i + 2
    print("new i :", i)

Output:
1
new i : 3
3
new i : 5
5
new i : 7
7
new i : 9
9
new i : 11

While Loop with If-Else Statement

If-Else statement can be used along with While loop. See the program below -

counter = 1 
while (counter <= 5): 
    if counter < 2:
        print("Less than 2")
    elif counter > 4:
        print("Greater than 4")
    else: 
        print(">= 2 and <=4")    
    counter += 1

This article covers detailed explanation of lambda function of Python. You will learn how to use it in some of the common scenarios with examples.

Table of Contents

Introduction : Lambda Function

In non-technical language, lambda is an alternative way of defining function. You can define function inline using lambda. It means you can apply a function to some data in one line of python code and then join the result. It is called anonymous function as the function can be defined without a name.

Syntax of Lambda Function

lambda arguments: expression

Lambda function can have more than one argument but expression cannot be more than 1. The expression is evaluated and returned.
Example

addition = lambda x,y: x + y
addition(2,3) returns 5

In the above python code, x,y are the arguments and x + y is the expression that gets evaluated and returned.

Difference between Lambda and Def Function

By using both lambda and def, you can create your own user-defined function in python.

def square(x):
     return x**2

square(2) returns 4

square = lambda x:x**2

square(2) returns 4

There are some difference between them as listed below.

1. lambda is a keyword that returns a function object and does not create a 'name'. Whereas def creates name in the local namespace
2. lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def
3. lambda functions are somewhat less readable for most Python users.
4. lambda functions can only be used once, unless assigned to a variable name.

Lambda Function : Examples

In this section of tutorial, we will see various practical examples of lambda functions. Let's create a pandas data frame for illustration purpose.

import pandas as pd
np.random.seed(12)
df = pd.DataFrame(np.random.randn(5, 3), index=list('abcde'), columns=list('XYZ'))

          X         Y         Z
a  0.472986 -0.681426  0.242439
b -1.700736  0.753143 -1.534721
c  0.005127 -0.120228 -0.806982
d  2.871819 -0.597823  0.472457
e  1.095956 -1.215169  1.342356

Example 1 : Add 2 to each value of Data Frame

def add2(x):
     return x+2 
df.apply(add2)

square = df.apply(lambda x: x+2)

Both returns the same output but lambda function can be defined within apply( ) function.

          X         Y         Z
a  2.472986  1.318574  2.242439
b  0.299264  2.753143  0.465279
c  2.005127  1.879772  1.193018
d  4.871819  1.402177  2.472457
e  3.095956  0.784831  3.342356

Example 2 : Create function that returns result of number raised to power

def power(x,n):
     return x**n 
df.apply(power, n=3)

df.apply(lambda x : x**3)

              X         Y         Z
a  1.058143e-01 -0.316414  0.014250
b -4.919381e+00  0.427201 -3.614836
c  1.347751e-07 -0.001738 -0.525523
d  2.368489e+01 -0.213657  0.105460
e  1.316375e+00 -1.794361  2.418820

Example 3 : Conditional Statement (IF-ELSE)
Suppose you want to create a new variable which is missing or blank if value of an existing variable is less than 90. Else copy the same value of existing variable. Let's create a dummy data frame called sample which contains only 1 variable named var1. Condition : If var1 is less than 90, function should return missing else value of var1.

import numpy as np
sample = pd.DataFrame({'var1':[10,100,40] })

sample['newvar1'] = sample.apply(lambda x: np.nan if x['var1'] < 90 else x['var1'], axis=1)

How to read the above lambda function

x: value_if_condition_true if logical_condition else value_if_condition_false

axis=1 tells python to apply function to each row of a particular column. By default, it is 0 which means apply function to each column of a row.

There is one more way to write the above function without specifying axis option. It will be applied to series sample['var1']

sample['newvar1'] = sample['var1'].apply(lambda x: np.nan if x < 90 else x)

The same function can also be written using def. See the code below.

def miss(x):
    if x["var1"] < 90:
        return np.nan
    else:
        return x["var1"]

sample['newvar1'] = sample.apply(miss, axis=1)

   var1  newvar1
0    10      NaN
1   100    100.0
2    40      NaN

Example 4 : Multiple or Nested IF-ELSE Statement
Suppose you want to create a flag wherein it is yes when value of a variable is greater than or equal to 1 but less than or equal to 5. Else it is no if value is equal to 7. Otherwise missing.

mydf = pd.DataFrame({'Names': np.arange(1,10,2)}) 
mydf["flag"] = mydf["Names"].apply(lambda x: "yes" if x>=1 and x<=5 else "no"
        if x==7 else np.nan)

   Names flag
0      1  yes
1      3  yes
2      5  yes
3      7   no
4      9  NaN

Lambda functions are used along with built-in functions like filter(), map().

map() function

map functions executes the function object (i.e. lambda or def) for each element and returns a list of the elements modified by the function object. In the code below, we are multiplying each element by 2.

mylist = [1, 2, 3, 4]
map(lambda x : x*2, mylist)

It returns map object. You cannot see the returned values directly. To view the result, you need to wrap it in list( )

list(map(lambda x : x*2, mylist))
Output : [2, 4, 6, 8]

filter() function

It returns the items where function is true. If none of the element meets condition, it will return nothing. In the code below, we are checking if value is greater than 2.

list(filter(lambda x : x > 2 , mylist))
Output : [3, 4]

It returns filter object. To see the output values, you need to wrap filter( ) function within list( )

This tutorial covers various ways you can create animated charts or plots using R. Animation is a very important element of data visualization. Animated charts are visually appealing and it fetches attention of audience. There are many online data visualization tools available in market which can generate animated charts but most of them are paid tools. Also problem with the online animation tools is that it asks you to upload data to their server, which is a data breach if you work on a real-world data of your client. Since R is open-source, you can download it for free and can create animated charts without moving data to server of any external server.

Simple Animation in R

Let's create dummy data for illustration. In the program below, we are generating 3 columns containing some random observations. First column named A contains 50 observations ranging from 1 to 75. Similarly second column contains similar number of observations but range interval is different.

df = data.frame(A=sample(1:75, 50, replace=TRUE),
                B=sample(1:100, 50, replace=TRUE),
                stringsAsFactors = FALSE)

gganimate package is used for animation in R. It is an extension of popular package for graphics - ggplot2 package.

library(ggplot2)
library(tidyverse)
library(gganimate)
library(directlabels)
library(png)
library(transformr)
library(grid)

ggplot(df, aes(A, B)) +
    geom_line() +
    transition_reveal(A) +
    labs(title = 'A: {frame_along}')

geom_line() is used for creating line chart. transition_reveal(A) allows you to let data gradually appear.frame_along gives the position that the current frame corresponds to.

What is frame and rendering in animation?

In animation, a frame is one of the many still images which compose the complete moving picture. Rendering is a kind of computing to output the final result. In gganimate package, it is by default 100 frames to render. You can change the number of frames under nframes= parameter in animatefunction.

p = ggplot(df, aes(A, B, group = C)) +
    geom_line() +
    transition_reveal(A) +
    labs(title = 'A: {frame_along}')

animate(p, nframes=40)

How to save animated plot in GIF format file?

You can use anim_save(file_location,plot) function to export animated chart in GIF format.

anim_save("basic_animation.gif", p)

Frames per Second (fps)

It is the amount of time spend on each frame per second. You can use parameter fps in animate() function. By default, it is 10 frames per second.

animate(p, nframes=40, fps = 2)

Decreasing fps from 10 means slowing down speed of animation.

How to stop loop in animation?

Loop means continuously repeating animation over and over again. To end loop, you can use renderer = gifski_renderer(loop = FALSE) option in animate function.

animate(p, renderer = gifski_renderer(loop = FALSE))

How to change layout of plot?

You can change height and width of plot by mentioning the size in animate( ) function.

animate(p, fps = 10, duration = 14, width = 800, height = 400)

Advanced Animation in R : Examples

Prepare Data for Example
In this example, we will create bar chart showing change in monthly sales figure of different products.

set.seed(123)
dates = paste(rep(month.abb[1:10], each=10), 2018)
df = data.frame(Product=rep(sample(LETTERS[1:10],10), 10),
                Period=factor(dates, levels=unique(dates)),
                Sales=sample(1:100,100, replace = TRUE))

head(df)
  Product   Period Sales order
1       E Jan 2018    15     1
2       H Jan 2018    34     2
3       F Jan 2018    42     3
4       E Jan 2018    49     4
5       J Jan 2018    49     5
6       C Jan 2018    60     6

# Ranking by Period and Sales
df = df %>% 
  arrange(Period, Sales) %>% 
  mutate(order = 1:n())

# Animation
p = df %>% 
  ggplot(aes(order, Sales)) +
  geom_bar(stat = "identity", fill = "#ff9933") +
  labs(title='Total Sales in {closest_state}', x=NULL) +
  theme(plot.title = element_text(hjust = 0.5, size = 18)) +
  scale_x_continuous(breaks=df$order, labels=df$Product, position = "top") +
  transition_states(Period, transition_length = 1, state_length = 2) +
  view_follow(fixed_y=TRUE) +
  ease_aes('cubic-in-out')

animate(p, nframes=50, fps=4)
anim_save("bar_animation.gif", p)

Detailed Explanation

transition_states() animates plot by categorical or discrete variable. "States" are the animation sequences which plays. When a state transition is triggered, there will be a new state whose animation sequence will then run. In this case, state is Period column. state_length refers to relative length of the pause at the states. transition_length refers to relative length of the transition.
view_follow(fixed_y=TRUE) means y-axis would be fixed when animation is running.
ease_aes( ) refers to motion in animation that starts quickly and then decelerates. Or vice-versa.
You can set theme using theme_set(theme_minimal())

Indian General Election (1984 to 2019) Study

Recently BJP secured majority in Lok Sabha Election. In 1984, they contested first time in Lok Sabha Election. INC (Indian National Congress) used to be the biggest political party in India a decade ago. Here we will see the trend analysis on "% of seats won by these two parties) from 1984 to 2019. Source of Data : Election Commission of India

library(ggplot2)
library(tidyverse)
library(gganimate)
library(directlabels)
library(png)
library(transformr)
library(grid)

# Read Data
df = read.table(text = 
" Year	Perc_Seats	Party
                1984	0.79	INC
                1989	0.38	INC
                1991	0.45	INC
                1996	0.27	INC
                1998	0.27	INC
                1999	0.22	INC
                2004	0.28	INC
                2009	0.4   INC
                2014	0.09	INC
                2019	0.1	  INC
                1984	0	    BJP
                1989	0.17	BJP
                1991	0.23	BJP
                1996	0.31	BJP
                1998	0.35	BJP
                1999	0.35	BJP
                2004	0.27	BJP
                2009	0.23	BJP
                2014	0.52	BJP
                2019	0.56	BJP
", header=TRUE)

# Set Theme
theme_set(theme_minimal())

# Plot and animate
p =  
  ggplot(data = df, aes(x= factor(Year), y=Perc_Seats, group=Party, colour=Party)) +
  geom_line(size=2, show.legend = FALSE) +
  scale_color_manual(values=c("#ff9933", "#006400")) +
  scale_x_discrete(position = "top") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = 'Lok Sabha Election : % of seats won', 
        x = NULL, y = NULL) +
  geom_text(aes(label=scales::percent(Perc_Seats, accuracy = 1),
                vjust= -2), show.legend = FALSE) +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_dl(aes(label=Party), method="last.points") +
  transition_reveal(Year) +
  coord_cartesian(clip = 'off') + 
  ease_aes('cubic-in-out')

animate(p, fps = 10, width = 800, height = 400)
anim_save("election.gif", p)

How to save animated plot as video

Make sure ffmpeg is installed on your system before using the code below. It is available for download for all the operating systems.


animate(nations_plot, renderer = ffmpeg_renderer(), width = 800, height = 450)
anim_save("nations.mp4")

This tutorial outlines various string (character) functions used in Python. To manipulate strings and character values, python has several in-built functions. It means you don't need to import or have dependency on any external package to deal with string data type in Python. It's one of the advantage of using Python over other data science tools. Dealing with string values is very common in real-world. Suppose you have customers' full name and you were asked by your manager to extract first and last name of customer. Or you want to fetch information of all the products that have code starting with 'QT'.

Table of Contents

List of frequently used string functions

The table below shows many common string functions along with description and its equivalent function in MS Excel. We all use MS Excel in our workplace and familiar with the functions used in MS Excel. The comparison of string functions in MS EXCEL and Python would help you to learn the functions quickly and mug-up before interview.

Function	Description	MS EXCEL FUNCTION
mystring[:N]	Extract N number of characters from start of string.	LEFT( )
mystring[-N:]	Extract N number of characters from end of string	RIGHT( )
mystring[X:Y]	Extract characters from middle of string, starting from X position and ends with Y	MID( )
str.split(sep='')	Split Strings	-
str.replace(old_substring, new_substring)	Replace a part of text with different sub-string	REPLACE( )
str.lower()	Convert characters to lowercase	LOWER( )
str.upper()	Convert characters to uppercase	UPPER( )
str.contains('pattern', case=False)	Check if pattern matches (Pandas Function)	SQL LIKE Operator
str.extract(regular_expression)	Return matched values (Pandas Function)	-
str.count('sub_string')	Count occurence of pattern in string	-
str.find( )	Return position of sub-string or pattern	FIND( )
str.isalnum()	Check whether string consists of only alphanumeric characters	-
str.islower()	Check whether characters are all lower case	-
str.isupper()	Check whether characters are all upper case	-
str.isnumeric()	Check whether string consists of only numeric characters	-
str.isspace()	Check whether string consists of only whitespace characters	-
len( )	Calculate length of string	LEN( )
cat( )	Concatenate Strings (Pandas Function)	CONCATENATE( )
separator.join(str)	Concatenate Strings	CONCATENATE( )

LEFT, RIGHT and MID Functions

If you are intermediate MS Excel users, you must have used LEFT, RIGHT and MID Functions. These functions are used to extract N number of characters or letters from string.

1. Extract first two characters from beginning of string

mystring = "Hey buddy, wassup?"
mystring[:2]

Out[1]: 'He'

string[start:stop:step] means item start from 0 (default) through (stop-1), step by 1 (default).
mystring[:2] is equivalent to mystring[0:2]
mystring[:2] tells Python to pull first 2 characters from mystring string object.
Indexing starts from zero so it includes first, second element and excluding third.

2. Find last two characters of string

mystring[-2:]

The above command returns p?.The -2 starts the range from second last position through maximum length of string.

3. Find characters from middle of string

mystring[1:3]

Out[1]: 'ey'

mystring[1:3] returns second and third characters. 1 refers to second character as index begins with 0.

4. How to reverse string?

mystring[::-1]

Out[1]: '?pussaw ,yddub yeH'

-1 tells Python to start it from end and increment it by 1 from right to left.

5. How to extract characters from string variable in Pandas DataFrame?

Let's create a fake data frame for illustration. In the code below, we are creating a dataframe named df containing only 1 variable called var1

import pandas as pd df = pd.DataFrame({"var1": ["A_2", "B_1", "C_2", "A_2"]})


  var1
0  A_2
1  B_1
2  C_2
3  A_2

To deal text data in Python Pandas Dataframe, we can use str attribute. It can be used for slicing character values.

df['var1'].str[0]

In this case, we are fetching first character from var1 variable. See the output shown below.


Output
0    A
1    B
2    C
3    A

Extract Words from String

Suppose you need to take out word(s) instead of characters from string. Generally we consider one blank space as delimiter to find words from string.

1. Find first word of string

mystring.split()[0]

Out[1]: 'Hey'

How it works?

split() function breaks string using space as a default separator
mystring.split() returns ['Hey', 'buddy,', 'wassup?']
0 returns first item or word Hey

2. Comma as separator for words

mystring.split(',')[0]

Out[1]: 'Hey buddy'

3. How to extract last word

mystring.split()[-1]

Out[1]: 'wassup?'

4. How to extract word in DataFrame

Let's build a dummy data frame consisting of customer names and call it variable custname


mydf = pd.DataFrame({"custname": ["Priya_Sehgal", "David_Stevart", "Kasia_Woja", "Sandy_Dave"]})


        custname
0   Priya_Sehgal
1  David_Stevart
2     Kasia_Woja
3     Sandy_Dave


#First Word
mydf['fname'] = mydf['custname'].str.split('_').str[0]

#Last Word
mydf['lname'] = mydf['custname'].str.split('_').str[1]

Detailed Explanation

str.split( ) is similar to split( ). It is used to activate split function in pandas data frame in Python.
In the code above, we created two new columns named fname and lname storing first and last name.


Output
        custname  fname    lname
0   Priya_Sehgal  Priya   Sehgal
1  David_Stevart  David  Stevart
2     Kasia_Woja  Kasia     Woja
3     Sandy_Dave  Sandy     Dave

SQL LIKE Operator in Pandas DataFrame

In SQL, LIKE Statement is used to find out if a character string matches or contains a pattern. We can implement similar functionality in python using str.contains( ) function.


df2 = pd.DataFrame({"var1": ["AA_2", "B_1", "C_2", "a_2"],
"var2": ["X_2", "Y_1", "Z_2", "X2"]})


   var1 var2
0  AA_2  X_2
1   B_1  Y_1
2   C_2  Z_2
3   a_2   X2

How to find rows containing either A or B in variable var1?

df2['var1'].str.contains('A|B')

str.contains(pattern) is used to match pattern in Pandas Dataframe.


Output
0     True
1     True
2    False
3    False

The above command returns FALSE against fourth row as the function is case-sensitive. To ignore case-sensitivity, we can use case=False parameter. See the working example below.

df2['var1'].str.contains('A|B', case=False)

How to filter rows containing a particular pattern?

In the following program, we are asking Python to subset data with condition - contain character values either A or B. It is equivalent to WHERE keyword in SQL.

df2[df2['var1'].str.contains('A|B', case=False)]


Output
 var1 var2
0  AA_2  X_2
1   B_1  Y_1
3   a_2   X2

Suppose you want only those values that have alphabet followed by '_'


df2[df2['var1'].str.contains('^[A-Z]_', case=False)]

^ is a token of regular expression which means begin with a particular item.


  var1 var2
1  B_1  Y_1
2  C_2  Z_2
3  a_2   X2

Find position of a particular character or keyword

str.find(pattern) is used to find position of sub-string. In this case, sub-string is '_'.


df2['var1'].str.find('_')

Replace substring

str.replace(old_text,new_text,case=False) is used to replace a particular character(s) or pattern with some new value or pattern. In the code below, we are replacing _ with -- in variable var1.


df2['var1'].str.replace('_', '--', case=False)


Output
0    AA--2
1     B--1
2     C--2
3     A--2

We can also complex patterns like the following program. + means item occurs one or more times. In this case, alphabet occurring 1 or more times.


df2['var1'].str.replace('[A-Z]+_', 'X', case=False)

Find length of string

len(string) is used to calculate length of string. In pandas data frame, you can apply str.len() for the same.


df2['var1'].str.len()

To find count of occurrence of a particular character (let's say, how many time 'A' appears in each row), you can use str.count(pattern) function.

df2['var1'].str.count('A')

Convert to lowercase and uppercase

str.lower() and str.upper() functions are used to convert string to lower and uppercase values.


#Convert to lower case
mydf['custname'].str.lower()

#Convert to upper case
mydf['custname'].str.upper()

Remove Leading and Trailing Spaces

str.strip() removes both leading and trailing spaces.
str.lstrip() removes leading spaces (at beginning).
str.rstrip() removes trailing spaces (at end).


df1 = pd.DataFrame({'y1': [' jack', 'jill ', ' jesse ', 'frank ']})
df1['both']=df1['y1'].str.strip()
df1['left']=df1['y1'].str.lstrip()
df1['right']=df1['y1'].str.rstrip()


        y1   both    left   right
0     jack   jack    jack    jack
1    jill    jill   jill     jill
2   jesse   jesse  jesse    jesse
3   frank   frank  frank    frank

Convert Numeric to String

With the use of str( ) function, you can convert numeric value to string.


myvariable = 4
mystr = str(myvariable)

Concatenate or Join Strings

By simply using +, you can join two string values.


x = "Deepanshu"
y ="Bhalla"
x+y


DeepanshuBhalla

In case you want to add a space between two strings, you can use this - x+''+y returns Deepanshu BhallaSuppose you have a list containing multiple string values and you want to combine them. You can use join( ) function.


string0 = ['Ram', 'Kumar', 'Singh']
''.join(string0)


Output
'Ram Kumar Singh'

Suppose you want to combine or concatenate two columns of pandas dataframe.

mydf['fullname'] = mydf['fname'] + '' + mydf['lname']

mydf['fullname'] = mydf[['fname', 'lname']].apply(lambda x: ''.join(x), axis=1)


     custname  fname    lname       fullname
0   Priya_Sehgal  Priya   Sehgal   Priya Sehgal
1  David_Stevart  David  Stevart  David Stevart
2     Kasia_Woja  Kasia     Woja     Kasia Woja
3     Sandy_Dave  Sandy     Dave     Sandy Dave

SQL IN Operator in Pandas

We can use isin(list) function to include multiple values in our filtering or subsetting criteria.


mydata = pd.DataFrame({'product': ['A', 'B', 'B', 'C','C','D','A']})
mydata[mydata['product'].isin(['A', 'B'])]


  product
0       A
1       B
2       B
6       A

How to apply NOT criteria while selecting multiple values?

We can use sign ~ to tell python to negate the condition.


mydata[~mydata['product'].isin(['A', 'B'])]

Extract a particular pattern from string

str.extract(r'regex-pattern') is used for this task.


df2['var1'].str.extract(r'(^[A-Z]_)')

r'(^[A-Z]_)' means starts with A-Z and then followed by '_'


0    NaN
1     B_
2     C_
3    NaN

To remove missing values, we can use dropna( ) function.


df2['var1'].str.extract(r'(^[A-Z]_)').dropna()

In this tutorial, we will cover how you can download relational database management system for free to practice SQL queries at home. Most of the people always have a question "Like Python and R, Is there any free software where they can learn and practice SQL queries?". Answer is yes. Before getting into details of installation process, we need to understand what is SQL and how it is connected to relational database management system.

What is SQL (Structured Query Language)?

SQL is a programming language, mainly used for data manipulation on data stored in a relational database management system. We can select, create, modify data (rows and columns) from tables using SQL queries. We can also modify and delete tables using queries.

What is RDBMS (Relational Database Management System)?

Relational Database Management System (RDBMS) is a software system that stores data in a tabular form. Most databases used in businesses these days are relational databases, as opposed to a CSV or Excel Files. SQL is the language used for communicating with data in RDBMS.

How to download SQL Server for free?

Microsoft SQL Server is a powerful relational database management system owned by Microsoft. It is the most popular RBDMS used in both small and big organizations. It is an enterprise system which is not available for free but Microsoft offers a free version of it which is called SQL Server Express edition.

Benefits of using SQL SERVER Express Edition

You can create SQL tables by simply importing CSV files. You don't need to create sample data manually.
You can create and execute stored procedures.
You will get feel of how SQL is used in companies.
It supports Window Functions like ROW_NUMBER, RANK, NTILE and DENSE_RANK etc.

Steps to download and install SQL Server Express Edition

Go to Microsoft website and download SQL Server 2017 Express Edition. Click on Download now button as shown below.
After completing above step, click on the downloaded file. It will take you to the screen shown in the following image.

Select the Basic install option. This new installation feature selects all of the most commonly used configuration options and is ideal for the beginning MSSQL user.
It will install the software. When installation is completed, it will show information like connection settings, and file locations.
Next step is to install SQL Server Management Studio (SSMS) by pressing Install SSMS button. It is IDE like RStudio or Spyder which helps to manage database and code with ease.
Once you click on the above Install SSMS button, it will take you to the page as shown below. Click on Download SQL Server Management Studio 18.0 (GA) link. Downloading of the software will begin after that. It may take some time as it's 0.5 GB sized file.
Install SSMS Software. Open it once installation is completed. It will show the screen where it asks you to connect to Server. Click on Connectbutton.
Server Name : PC_Name\SQLEXPRESS. In the following image, DELL is the PC_NAME.

Are you facing issue in connection and getting the following error?

Cannot connect to XXXXXX. A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections.

To fix this issue, follow the steps below.

Open SQL Server 2017 Configuration Manager and then go to SQL Server Network configuration and then click on Protocols for SQLEXPRESS and make sure both Named Pipes and TCP/IP are enabled. Right-click to enable them.
Right-click on TCP/IP and go to Properties. Now Select IP Addresses Tab and then go to the last item IP All and enter 1433 in TCP Port.
Press Window + R shortcut to open run window and then type services.msc. It will open Services window and then search for SQL SERVER(SQLEXPRESS) and then Start the service by right click on it.
Open SQL Server Management Studio again. If it's already opened, reopen it.

How to use SQL Server Management Studio?

1. Press CTRL + N to open New Query where you can write your SQL query.

2. Check databases by clicking on Databases folder shown under Object Explorer

3. How to check current database in use?


SELECT DB_NAME() AS [Current Database]

Run the above command and press F5 shortcut to execute or submit sql query.

4. How to create a fake dummy table


USE tempdb;
create table employeetbl (employee_id integer, first_name varchar(10), salary float)
insert into employeetbl (employee_id,first_name,salary) values (123, 'Deep', 44561)
Select * from employeetbl

USE tempdb refers to database you want to use.

5. How to import CSV File?

1. Right click on Databases folder and click on New Database option and then type any name you want to assign (let's say newdb.

2. Right click on newdb >> Tasks >> Import Flat File

3. Select CSV file and import it.

6. How to create a simple stored procedure?


CREATE PROCEDURE sampleproc
AS
BEGIN
    SELECT AGE, Attrition, HourlyRate
 FROM 
      Employee_details
    ORDER BY 
        HourlyRate;
END;

EXECUTE sampleproc;

Limitations of SQL SERVER Express Edition

Maximum database size of 10 GB per database
No SQL Server Agent service
SQL Server Integration Services and Analysis Services are not available.

This tutorial covers how you can build login page where user needs to add username and password for authentication in shiny app. RStudio offers paid products like Shiny Server or RStudio Connect which has authentication feature to verify the identify of user. But if you want this feature for free, you can follow the steps mentioned below.

Features of R Program shown in the tutorial below

Dashboard will be opened only when user enters correct username and password
You can hide or show functionalities of dashboard (like tabs, widgets etc) based on type of permission
Encrypt password with hashing algorithm which mitigates brute-force attacks

Steps to add login authentication feature in Shiny

Step 1 : Install the following packages by using the command install.packages(package-name)

shiny
shinydashboard
DT
shinyjs
sodium

Step 2 : Run the program below

library(shiny)
library(shinydashboard)
library(DT)
library(shinyjs)
library(sodium)

# Main login screen
loginpage <- div(id = "loginpage", style = "width: 500px; max-width: 100%; margin: 0 auto; padding: 20px;",
                 wellPanel(
                   tags$h2("LOG IN", class = "text-center", style = "padding-top: 0;color:#333; font-weight:600;"),
                   textInput("userName", placeholder="Username", label = tagList(icon("user"), "Username")),
                   passwordInput("passwd", placeholder="Password", label = tagList(icon("unlock-alt"), "Password")),
                   br(),
                   div(
                     style = "text-align: center;",
                     actionButton("login", "SIGN IN", style = "color: white; background-color:#3c8dbc;
                                 padding: 10px 15px; width: 150px; cursor: pointer;
                                 font-size: 18px; font-weight: 600;"),
                     shinyjs::hidden(
                       div(id = "nomatch",
                           tags$p("Oops! Incorrect username or password!",
                                  style = "color: red; font-weight: 600; 
                                            padding-top: 5px;font-size:16px;", 
                                  class = "text-center"))),
                     br(),
                     br(),
                     tags$code("Username: myuser  Password: mypass"),
                     br(),
                     tags$code("Username: myuser1  Password: mypass1")
                   ))
)

credentials = data.frame(
  username_id = c("myuser", "myuser1"),
  passod   = sapply(c("mypass", "mypass1"),password_store),
  permission  = c("basic", "advanced"), 
  stringsAsFactors = F
  )

header <- dashboardHeader( title = "Simple Dashboard", uiOutput("logoutbtn"))

sidebar <- dashboardSidebar(uiOutput("sidebarpanel")) 
body <- dashboardBody(shinyjs::useShinyjs(), uiOutput("body"))
ui<-dashboardPage(header, sidebar, body, skin = "blue")

server <- function(input, output, session) {

  login = FALSE
  USER <- reactiveValues(login = login)

  observe({ 
    if (USER$login == FALSE) {
      if (!is.null(input$login)) {
        if (input$login > 0) {
          Username <- isolate(input$userName)
          Password <- isolate(input$passwd)
          if(length(which(credentials$username_id==Username))==1) { 
            pasmatch  <- credentials["passod"][which(credentials$username_id==Username),]
            pasverify <- password_verify(pasmatch, Password)
            if(pasverify) {
            USER$login <- TRUE
            } else {
              shinyjs::toggle(id = "nomatch", anim = TRUE, time = 1, animType = "fade")
              shinyjs::delay(3000, shinyjs::toggle(id = "nomatch", anim = TRUE, time = 1, animType = "fade"))
            }
          } else {
            shinyjs::toggle(id = "nomatch", anim = TRUE, time = 1, animType = "fade")
            shinyjs::delay(3000, shinyjs::toggle(id = "nomatch", anim = TRUE, time = 1, animType = "fade"))
          }
        } 
      }
    }    
  })

  output$logoutbtn <- renderUI({
    req(USER$login)
    tags$li(a(icon("fa fa-sign-out"), "Logout", 
              href="javascript:window.location.reload(true)"),
            class = "dropdown", 
            style = "background-color: #eee !important; border: 0;
                    font-weight: bold; margin:5px; padding: 10px;")
  })

  output$sidebarpanel <- renderUI({
    if (USER$login == TRUE ){ 
      sidebarMenu(
        menuItem("Main Page", tabName = "dashboard", icon = icon("dashboard"))
      )
    }
  })

  output$body <- renderUI({
    if (USER$login == TRUE ) {
      tabItem(tabName ="dashboard", class = "active",
              fluidRow(
                box(width = 12, dataTableOutput('results'))
              ))
    }
    else {
      loginpage
    }
  })

  output$results <-  DT::renderDataTable({
    datatable(iris, options = list(autoWidth = TRUE,
                                   searching = FALSE))
  })

}

runApp(list(ui = ui, server = server), launch.browser = TRUE)

How to customize the program

In the above program, two user names and passwords are defined
Username : myuser Password : mypassUsername : myuser1 Password : mypass1. To change them, you can edit the following code in R program.


credentials = data.frame(
  username_id = c("myuser", "myuser1"),
  passod   = sapply(c("mypass", "mypass1"),password_store),
  permission  = c("basic", "advanced"), 
  stringsAsFactors = F
  )

In order to modify sidebar section, you can edit the following section of code.

    if (USER$login == TRUE ){ 
      sidebarMenu(
        menuItem("Main Page", tabName = "dashboard", icon = icon("dashboard"))
      )
    }

In order to edit main body of the app, you can make modification in the following section of code.

  if (USER$login == TRUE ) {
      tabItem(tabName ="dashboard", class = "active",
              fluidRow(
                box(width = 12, dataTableOutput('results'))
              ))
    }
    else {
      loginpage
    }

Suppose you want to show multiple tabs if permission level is set "advanced". Otherwise show a single tab. If you login with credentials Username : myuser1 Password : mypass1, you would find two tabs. Else it would show only one tab named "Main Page". Replace renderUI function of output$sidebarpanel and output$body with the following script.

  output$sidebarpanel <- renderUI({
    if (USER$login == TRUE ){ 
      if (credentials[,"permission"][which(credentials$username_id==input$userName)]=="advanced") {
        sidebarMenu(
        menuItem("Main Page", tabName = "dashboard", icon = icon("dashboard")),
        menuItem("About Page", tabName = "About", icon = icon("th"))
        )
      }
      else{
        sidebarMenu(
          menuItem("Main Page", tabName = "dashboard", icon = icon("dashboard"))
        )

      }
    }
  })


  output$body <- renderUI({
    if (USER$login == TRUE ) {
    if (credentials[,"permission"][which(credentials$username_id==input$userName)]=="advanced") {
    tabItems(
              tabItem(
               tabName ="dashboard", class = "active",
               fluidRow(
                box(width = 12, dataTableOutput('results'))
              ))
        ,
          tabItem(
            tabName ="About",
            h2("This is second tab")
              )
    )
    } 
      else {
        tabItem(
          tabName ="dashboard", class = "active",
          fluidRow(
            box(width = 12, dataTableOutput('results'))
          ))

      }

    }
    else {
      loginpage
    }
  })

Note

Docker-based shinyproxy package is available for free which has an authentication feature along with some other great enterprise features. But you need to know docker to use this package and many users find it complicated.

This tutorial explains how to create charts used for Infographics in R. The word Infographics is made up of two words Information and Graphics. It simply means graphical visual representation of information. They are visually appealing and attracts attention of audience. In presentations, it adds WOW factor and makes you stand out in a crowd.

Install the packages used for Infographic Charts

You can install these packages by running command install.packages(). The package echarts4r.assets is not available on CRAN so you need to install it from github account by running this command devtools::install_github("JohnCoene/echarts4r.assets")

waffle
extrafont
tidyverse
echarts4r
echarts4r.assets

Waffle (Square Pie Chart)

In this section we will see how to create waffle chart in R. Waffle charts are also known as square pie or matrix charts. They show distribution of a categorical variable. It's an alternative to pie chart. It should be used when number of categories are less than 4. Higher the number of categories, more difficult would be read this chart. In the following example, we are showing percentage of respondents who answered 'yes' or 'no' in a survey.


library(waffle)
waffle(
  c('Yes=70%' = 70, 'No=30%' = 30), rows = 10, colors = c("#FD6F6F", "#93FB98"),
  title = 'Responses', legend_pos="bottom"
)

Use Icon in Waffle

Steps to download and install fontawesome fonts

First step is to load extrafont library by running this command library(extrafont)
Download and install fontawesome fonts from this URL https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/fonts/fontawesome-webfont.ttf
Import downloaded fontawesome font by using this command extrafont::font_import (path="C:\\Users\\DELL\\Downloads", pattern = "awesome", prompt = FALSE)
Load fonts by using the command loadfonts(device = "win")
Check whether font awesome is installed successfully by running this command fonts()[grep("Awesome", fonts())]. It should return FontAwesome

In the example below, we are showing performance of girls in a particular subject. The option use_glyph= refers to icon you want to show in the chart and glyph_size= refers to size of the icon.


waffle(
  c(`Poor=10` =10, `Average=18` = 18, `Excellent=7` =7), rows = 5, colors = c("#FD6F6F", "#93FB98", "#D5D9DD"),
  use_glyph = "female", glyph_size = 12 ,title = 'Girls Performance', legend_pos="bottom"
)

How to align multiple waffle charts

By using iron( ) function you can left-align waffle plots. You can use ggplot2 functions to customize the plot (like I did in the program below to center align the title using plot.title = )


iron(
  waffle(
    c('TRUE' = 7, 'FALSE' = 3),
    colors = c("pink", "grey70"),
    use_glyph = "female",
    glyph_size = 12,
    title = "Female vs Male",
    rows = 1,
    legend_pos = "none"
  ) + theme(plot.title = element_text(hjust = 0.5))
  ,
  waffle(
    c('TRUE' = 8, 'FALSE' = 2),
    colors = c("skyblue", "grey70"),
    use_glyph = "male",
    glyph_size = 12,
    rows = 1,
    legend_pos = "none"
  )
  )

Pictorial Charts in R

Pictorial charts show data scaled in picture or image form instead of bars or columns. They are also called pictogram charts. Let's create fake data for illustrative purpose.


df22 <- data.frame(
  x = sort(LETTERS[1:5], decreasing = TRUE),
  y = sort(sample(20:80,5))
)

e_pictorial(value, symbol) function is used for pictorial plots. The second parameter symbol refers to built-in symbols like circle, rect, roundRect, triangle, diamond, pin, arrow, icon, images and SVG Path. Built-in symbols can be used like symbol = "rect"


library(echarts4r)
library(echarts4r.assets)

df22 %>% 
  e_charts(x) %>% 
  e_pictorial(y, symbol = ea_icons("user"), 
              symbolRepeat = TRUE, z = -1,
              symbolSize = c(20, 20)) %>% 
  e_theme("westeros") %>%
  e_title("People Icons") %>% 
  e_flip_coords() %>%
  # Hide Legend
  e_legend(show = FALSE) %>%
  # Remove Gridlines
  e_x_axis(splitLine=list(show = FALSE)) %>%
  e_y_axis(splitLine=list(show = FALSE)) %>%
  # Format Label
  e_labels(fontSize = 16, fontWeight ='bold', position = "right", offset=c(10, 0))

Add Images in Chart

If you are using images, make sure to precede it with image:// before image address. In the code below, we have used paste0( ) function to concatenate it before image address.


Unity <- "https://im.rediff.com/news/2018/oct/29statue-of-unity.png"
Buddha <-"http://im.rediff.com/news/2018/oct/29spring-temple-buddha-china.png"

data <- data.frame(
  x = c("Statue of Unity", "Spring Temple Buddha"), 
  value = c(182, 129),
  symbol = c(paste0("image://", Unity),
             paste0("image://", Buddha))
)

data %>% 
  e_charts(x) %>% 
  e_pictorial(value, symbol) %>% 
  e_theme("westeros") %>% 
  e_legend(FALSE) %>% 
  # Title Alignment
  e_title("Statues Height", left='center', padding=10) %>%
  e_labels(show=TRUE) %>%
  e_x_axis(splitLine=list(show = FALSE)) %>%
  e_y_axis(show=FALSE, min=0,max=200, interval=20, splitLine=list(show = FALSE))

Pencil Chart in R

Instead of bars, we are using pencil to show comparison of values.


df02 <- data.frame(
  x = LETTERS[1:10],
  y = sort(sample(10:80,10), decreasing = TRUE)
)

df02 %>% 
  e_charts(x) %>% 
  e_pictorial(y, symbol = paste0("image://","https://1.bp.blogspot.com/-klwxpFekdEQ/XOubIhkalyI/AAAAAAAAHlE/25psl9x4oNkbJoLc2CKTXgV2pEj6tAvigCLcBGAs/s1600/pencil.png")) %>%
  e_theme("westeros") %>% 
  e_title("Pencil Chart",  padding=c(10,0,0,50))%>%
  e_labels(show = TRUE)%>%
  e_legend(show = FALSE) %>%
  e_x_axis(splitLine=list(show = FALSE)) %>%
  e_y_axis(show=FALSE, splitLine=list(show = FALSE))

Fill Male, Female Icons based on percentage

To find SVG Path, download desired SVG file from https://iconmonstr.com/ and open it in chrome and then find path in page source.


gender = data.frame(gender=c("Male", "Female"), value=c(65, 35),
                    path = c('path://M18.2629891,11.7131596 L6.8091608,11.7131596 C1.6685112,11.7131596 0,13.032145 0,18.6237673 L0,34.9928467 C0,38.1719847 4.28388932,38.1719847 4.28388932,34.9928467 L4.65591984,20.0216948 L5.74941883,20.0216948 L5.74941883,61.000787 C5.74941883,65.2508314 11.5891201,65.1268798 11.5891201,61.000787 L11.9611506,37.2137775 L13.1110872,37.2137775 L13.4831177,61.000787 C13.4831177,65.1268798 19.3114787,65.2508314 19.3114787,61.000787 L19.3114787,20.0216948 L20.4162301,20.0216948 L20.7882606,34.9928467 C20.7882606,38.1719847 25.0721499,38.1719847 25.0721499,34.9928467 L25.0721499,18.6237673 C25.0721499,13.032145 23.4038145,11.7131596 18.2629891,11.7131596 M12.5361629,1.11022302e-13 C15.4784742,1.11022302e-13 17.8684539,2.38997966 17.8684539,5.33237894 C17.8684539,8.27469031 15.4784742,10.66467 12.5361629,10.66467 C9.59376358,10.66467 7.20378392,8.27469031 7.20378392,5.33237894 C7.20378392,2.38997966 9.59376358,1.11022302e-13 12.5361629,1.11022302e-13',
'path://M28.9624207,31.5315864 L24.4142575,16.4793596 C23.5227152,13.8063773 20.8817445,11.7111088 17.0107398,11.7111088 L12.112691,11.7111088 C8.24168636,11.7111088 5.60080331,13.8064652 4.70917331,16.4793596 L0.149791395,31.5315864 C-0.786976655,34.7595013 2.9373074,35.9147532 3.9192135,32.890727 L8.72689855,19.1296485 L9.2799493,19.1296485 C9.2799493,19.1296485 2.95992025,43.7750224 2.70031069,44.6924335 C2.56498417,45.1567684 2.74553639,45.4852068 3.24205501,45.4852068 L8.704461,45.4852068 L8.704461,61.6700801 C8.704461,64.9659872 13.625035,64.9659872 13.625035,61.6700801 L13.625035,45.360657 L15.5097899,45.360657 L15.4984835,61.6700801 C15.4984835,64.9659872 20.4191451,64.9659872 20.4191451,61.6700801 L20.4191451,45.4852068 L25.8814635,45.4852068 C26.3667633,45.4852068 26.5586219,45.1567684 26.4345142,44.6924335 C26.1636859,43.7750224 19.8436568,19.1296485 19.8436568,19.1296485 L20.3966199,19.1296485 L25.2043926,32.890727 C26.1862111,35.9147532 29.9105828,34.7595013 28.9625083,31.5315864 L28.9624207,31.5315864 Z M14.5617154,0 C17.4960397,0 19.8773132,2.3898427 19.8773132,5.33453001 C19.8773132,8.27930527 17.4960397,10.66906 14.5617154,10.66906 C11.6274788,10.66906 9.24611767,8.27930527 9.24611767,5.33453001 C9.24611767,2.3898427 11.6274788,0 14.5617154,0 L14.5617154,0 Z'))

  gender %>% 
  e_charts(gender) %>% 
  e_x_axis(splitLine=list(show = FALSE), 
           axisTick=list(show=FALSE),
           axisLine=list(show=FALSE),
           axisLabel= list(show=FALSE)) %>%
  e_y_axis(max=100, 
           splitLine=list(show = FALSE),
           axisTick=list(show=FALSE),
           axisLine=list(show=FALSE),
           axisLabel=list(show=FALSE)) %>%
  e_color(color = c('#69cce6','#eee')) %>%
  e_pictorial(value, symbol = path, z=10, name= 'realValue', 
              symbolBoundingData= 100, symbolClip= TRUE) %>% 
  e_pictorial(value, symbol = path, name= 'background', 
              symbolBoundingData= 100) %>% 
  e_labels(position = "bottom", offset= c(0, 10), 
           textStyle =list(fontSize= 20, fontFamily= 'Arial', 
                           fontWeight ='bold', 
                           color= '#69cce6'),
           formatter="{@[1]}% {@[0]}") %>%
  e_legend(show = FALSE) %>%
  e_theme("westeros")

Show icon as label in plot

In label =, mention unicode of the fontawesome icon.


library(ggplot2)
ggplot (mtcars)  +  
  geom_text( aes ( mpg , wt , colour = factor ( cyl )), 
             label =  "\uf1b9" , 
             family =  "FontAwesome" , 
             size =  7)

In this tutorial, we will cover how to remove or drop one or multiple columns from pandas dataframe.

What is pandas in Python?

pandas is a python package for data manipulation. It has several functions for the following data tasks:

Drop or Keep rows and columns
Aggregate data by one or more columns
Sort or reorder data
Merge or append multiple dataframes
String Functions to handle text data
DateTime Functions to handle date or time format columns

Import or Load Pandas library

To make use of any python library, we first need to load them up by using import command.

import pandas as pd
import numpy as np

Let's create a fake dataframe for illustration

The code below creates 4 columns named A through D.

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))

          A         B         C         D
0 -1.236438 -1.656038  1.655995 -1.413243
1  0.507747  0.710933 -1.335381  0.832619
2  0.280036 -0.411327  0.098119  0.768447
3  0.858730 -0.093217  1.077528  0.196891
4 -0.905991  0.302687  0.125881 -0.665159
5 -2.012745 -0.692847 -1.463154 -0.707779

Drop a column in python

In pandas, drop( ) function is used to remove column(s).axis=1 tells Python that you want to apply function on columns instead of rows.

df.drop(['A'], axis=1)

Column A has been removed. See the output shown below.

          B         C         D
0 -1.656038  1.655995 -1.413243
1  0.710933 -1.335381  0.832619
2 -0.411327  0.098119  0.768447
3 -0.093217  1.077528  0.196891
4  0.302687  0.125881 -0.665159
5 -0.692847 -1.463154 -0.707779

In order to create a new dataframe newdf storing remaining columns, you can use the command below.

newdf = df.drop(['A'], axis=1)

To delete the column permanently from original dataframe df, you can use the option inplace=True

df.drop(['A'], axis=1, inplace=True)

#Check columns in df after dropping column A
df.columns

Output
Index(['B', 'C', 'D'], dtype='object')

Remove Multiple Columns in Python

You can specify all the columns you want to remove in a list and pass it in drop( ) function.

Method I

df2 = df.drop(['B','C'], axis=1)

Method II

cols = ['B','C']
df2 = df.drop(cols, axis=1)

Select or Keep Columns

If you wish to select a column (instead of drop), you can use the command

df['A']

To select multiple columns, you can submit the following code.

df[['A','B']]

How to drop column by position number from pandas Dataframe?

You can find out name of first column by using this command df.columns[0]. Indexing in python starts from 0.

df.drop(df.columns[0], axis =1)

To drop multiple columns by position (first and third columns), you can specify the position in list [0,2].

cols = [0,2]
df.drop(df.columns[cols], axis =1)

Drop columns by name pattern

df = pd.DataFrame({"X1":range(1,6),"X_2":range(2,7),"YX":range(3,8),"Y_1":range(2,7),"Z":range(5,10)})

   X1  X_2  YX  Y_1  Z
0   1    2   3    2  5
1   2    3   4    3  6
2   3    4   5    4  7
3   4    5   6    5  8
4   5    6   7    6  9

Drop column whose name starts with letter 'X'

df.loc[:,~df.columns.str.contains('^X')]

How it works?

^X is a expression of regex language which refers to beginning of letter 'X'
df.columns.str.contains('^X') returns array [True, True, False, False, False].
True where condition meets. Otherwise False
Sign ~ refers to negate the condition.
df.loc[ ] is used to select columns

It can also be written like :

df.drop(df.columns[df.columns.str.contains('^X')], axis=1)

Other Examples

#Removing columns whose name contains string 'X'
df.loc[:,~df.columns.str.contains('X')]

#Removing columns whose name contains string either 'X' or 'Y'
df.loc[:,~df.columns.str.contains('X|Y')]

#Removing columns whose name ends with string 'X'
df.loc[:,~df.columns.str.contains('X$')]

Drop columns where percentage of missing values is greater than 50%

df = pd.DataFrame({'A':[1,3,np.nan,5,np.nan],
'B':[4,np.nan,np.nan,5,np.nan]
                   })

% of missing values can be calculated by mean of NAs in each column.

cols = df.columns[df.isnull().mean()>0.5]
df.drop(cols, axis=1)

This tutorial outlines how to perform plotting and data visualization in python using Matplotlib library. The objective of this post is to get you familiar with the basics and advanced plotting functions of the library. It contains several examples which will give you hands-on experience in generating plots in python.

Table of Contents

What is Matplotlib?

It is a powerful python library for creating graphics or charts. It takes care of all of your basic and advanced plotting requirements in Python. It took inspiration from MATLAB programming language and provides a similar MATLAB like interface for graphics. The beauty of this library is that it integrates well with pandas package which is used for data manipulation. With the combination of these two libraries, you can easily perform data wrangling along with visualization and get valuable insights out of data. Like ggplot2 library in R, matplotlib library is the grammar of graphics in Python and most used library for charts in Python.

Basics of Matplotlib

First step you need to install and load matplotlib library. It must be already installed if you used Anaconda for setting up Python environment.

Install library

If matplotlib is not already installed, you can install it by using the command

pip install matplotlib

Import / Load Library

We will import Matplotlib’s Pyplot module and used alias or short-form as plt

from matplotlib import pyplot as plt

Elements of Graph

Different elements or parts of a standard graph are shown in the image below -

Figure

You can think of the figure as a big graph consisting of multiple sub-plots. Sub-plot can be one or more than one on a figure. In graphics world, it is called 'canvas'.

Axes

You can call them 'sub-plots'.

Axis

It's the same thing (x or y-axis) which you studied in school or college. A standard graph shows the marks on the axis. In matplotlib library, it is called ticks and text or value in ticks is called ticklabels.

Basic Plot

x = [1, 2, 3, 4, 5]
y = [5, 7, 3, 8, 4]
plt.bar(x,y)
plt.show()

If you are using Jupyter Notebook, you can submit this command %matplotlib inline once to display or show plots automatically without need to enter plt.show() after generation of each plot.

Functions used for different types of plots

The following tables explain different graphs along with functions defined for these graphs in matplotlib library.

Type of Plot	Function
line plot (Default)	plt.plot( )
vertical bar plots	plt.bar( )
horizontal bar plots	plt.barh( )
histogram	plt.hist( )
boxplot	plt.box( )
area plots	plt.area( )
scatter plots	plt.scatter( )
pie plots	plt.pie( )
hexagonal bin plots	plt.hexbin( )

This tutorial explains how to read a CSV file in python using read_csv function of pandas package. Without use of read_csv function, it is not straightforward to import CSV file with python object-oriented programming. Pandas is an awesome powerful python package for data manipulation and supports various functions to load and import data from various formats. Here we are covering how to deal with common issues in importing CSV file.

Table of Contents

Install and Load Pandas Package

Make sure you have pandas package already installed on your system. If you set up python using Anaconda, it comes with pandas package so you don't need to install it again. Otherwise you can install it by using command pip install pandas. Next step is to load the package by running the following command. pd is an alias of pandas package. We will use it instead of full name "pandas".

import pandas as pd

Create Sample Data for Import

The program below creates a sample pandas dataframe which can be used further for demonstration.


dt = {'ID': [11, 12, 13, 14, 15],
'first_name': ['David', 'Jamie', 'Steve', 'Stevart', 'John'],
'company': ['Aon', 'TCS', 'Google', 'RBS', '.'],
'salary': [74, 76, 96, 71, 78]}
mydt = pd.DataFrame(dt, columns = ['ID', 'first_name', 'company', 'salary'])

The sample data looks like below -


  ID first_name company  salary
0  11      David     Aon      74
1  12      Jamie     TCS      76
2  13      Steve  Google      96
3  14    Stevart     RBS      71
4  15       John       .      78

Save data as CSV in the working directory

Check working directory before you save your datafile.


import os
os.getcwd()

Incase you want to change the working directory, you can specify it in under os.chdir( ) function. Single backslash does not work in Python so use 2 backslashes while specifying file location.


os.chdir("C:\\Users\\DELL\\Documents\\")

The following command tells python to write data in CSV format in your working directory.


mydt.to_csv('workingfile.csv', index=False)

Example 1 : Read CSV file with header row

It's the basic syntax of read_csv() function. You just need to mention the filename. It assumes you have column names in first row of your CSV file.


mydata = pd.read_csv("workingfile.csv")

It stores the data the way It should be as we have headers in the first row of our datafile. It is important to highlight that header=0 is the default value. Hence we don't need to mention the header= parameter. It means header starts from first row as indexing in python starts from 0. The above code is equivalent to this line of code. pd.read_csv("workingfile.csv", header=0)

Inspect data after importing


mydata.shape
mydata.columns
mydata.dtypes

It returns 5 number of rows and 4 number of columns. Column Names are ['ID', 'first_name', 'company', 'salary']

See the column types of data we imported. first_name and company are character variables. Remaining variables are numeric ones.


ID             int64
first_name    object
company       object
salary         int64

Example 2 : Read CSV file with header in second row

Suppose you have column or variable names in second row. To read this kind of CSV file, you can submit the following command.

mydata = pd.read_csv("workingfile.csv", header = 1)

header=1 tells python to pick header from second row. It's setting second row as header. It's not a realistic example. I just used it for illustration so that you get an idea how to solve it. To make it practical, you can add random values in first row in CSV file and then import it again.


11    David     Aon  74
0  12    Jamie     TCS  76
1  13    Steve  Google  96
2  14  Stevart     RBS  71
3  15     John       .  78

Define your own column names instead of header row from CSV file


mydata0 = pd.read_csv("workingfile.csv", skiprows=1, names=['CustID', 'Name', 'Companies', 'Income'])

skiprows = 1 means we are ignoring first row and names= option is used to assign variable names manually.


   CustID     Name Companies  Income
0      11    David       Aon      74
1      12    Jamie       TCS      76
2      13    Steve    Google      96
3      14  Stevart       RBS      71
4      15     John         .      78

This tutorial covers how list comprehension works in Python. It includes many examples which would help you to familiarize the concept and you should be able to implement it in your live project at the end of this lesson.

Table of Contents

What is list comprehension?

Python is an object oriented programming language. Almost everything in them is treated consistently as an object. Python also features functional programming which is very similar to mathematical way of approaching problem where you assign inputs in a function and you get the same output with same input value. Given a function f(x) = x², f(x) will always return the same result with the same x value. The function has no "side-effect" which means an operation has no effect on a variable/object that is outside the intended usage. "Side-effect" refers to leaks in your code which can modify a mutable data structure or variable.

List comprehension is a part of functional programming which provides a crisp way to create lists without writing a for loop.

In the image above, the for clause iterates through each item of list. if clause filters list and returns only those items where filter condition meets. if clause is optional so you can ignore it if you don't have conditional statement.

[i**3 for i in [1,2,3,4] if i>2] means take item one by one from list [1,2,3,4] iteratively and then check if it is greater than 2. If yes, it takes cube of it. Otherwise ignore the value if it is less than or equal to 2. Later it creates a list of cube of values 3 and 4. Output : [27, 64]

List Comprehension vs. For Loop vs. Lambda + map()

All these three have different programming styles of iterating through each element of list but they serve the same purpose or return the same output. There are some differences between them as shown below.

1. List comprehension is more readable than For Loop and Lambda function.

List Comprehension


[i**2 for i in range(2,10)]

For Loop


sqr = [] 
for i in range(2,10):
    sqr.append(i**2)
sqr

Lambda + Map


list(map(lambda i: i**2, range(2, 10)))


Output
[4, 9, 16, 25, 36, 49, 64, 81]

List comprehension is performing a loop operation and then combines items to a list in just a single line of code. It is more understandable and clearer than for loop and lambda.

range(2,10) returns 2 through 9 (excluding 10).

**2 refers to square (number raised to power of 2). sqr = [] creates empty list. append( ) function stores output of each repetition of sequence (i.e. square value) in for loop.

map( ) applies the lambda function to each item of iterable (list). Wrap it in list( ) to generate list as output