In this tutorial, we will cover how to extract information from a matrimonial website using R. We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.
We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.
Web Scraping in R |
Install the required packages
To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.
install.packages("rvest")
install.packages("dplyr")
Load the required Libraries
To make the libraries in use, you need to submit the program below.
library(rvest)
library(dplyr)
Scrape Information from Matrimonial Website
First we need to understand the structure of URL. See the URLs below.https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys
The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.
We need to split the main URL into different elements so that we can access it.
Main_URL = Static_URL + Mother_Tongue + Brides_GroomsCheck out the following R code how to prepare the main URL. In the code, you need to provide the following details -
- Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
- Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groomSee the output :
Bride_Groom = "bride"
# Possible Values : bride, groom
# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi
# URL
if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
}
[1] "https://www.jeevansathi.com/punjabi-brides-girls"
Extract Profile IDs
First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
IDThe basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -
1 ZARX0345
2 ZZWX5573
3 ZWVT2173
4 ZAYZ6100
5 ZYTS6885
6 ZXYV9849
7 TRZ8475
8 VSA7284
9 ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18 WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043
- read_html() : you can create a html document from a URL
- html_nodes() : extracts pieces out of HTML documents.
- html_nodes(".class") : calls node based on CSS class
- html_nodes("#class") : calls node based on <div>, <span>, <pre> id
- html_text() : extracts only the text from HTML tag
- html_attr() : extracts contents of a single attribute
Difference between .class and #class
1..class targets the following element:
2.#class targets the following element:
1..class targets the following element:
<div class="class"></div>
2.#class targets the following element:
<div id="class"></div>
Get Detailed Information of Profiles
The following program performs the following tasks -
- Loop through profile IDs
- Pull information about Age, Height, Qualification etc.
- Extract details about appearance
- Fetch 'About Me' section of profiles
# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
html_text()))
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
Description = read_html(link) %>%
html_nodes("#myinfoView") %>%
html_text(),
Appearance = read_html(link) %>%
html_nodes("#section-lifestyle #appearanceView") %>%
html_text(),
FormattedInfo)
finaldf = bind_rows(finaldf, FormattedInfo)
}
# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")
Download Display Pic
To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.# Download Profile Pic of a particular Profile
ID = "SWU6683"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg
Disclaimer
We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -
You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.
library(rvest)
url = "http://www.google.com"
pgsession = html_session(url)
pgform = html_form(pgsession)[[1]]
# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)
# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()
[1] "Data science - Wikipedia"
[2] "Data Science Courses | Coursera"
[3] "Data Science | edX"
[4] "Data science - Wikipedia"
[5] "DataScience.com | Enterprise Data Science Platform Provider"
[6] "Top Data Science Courses Online - Updated February 2018 - Udemy"
[7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"
[8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
[9] "Online Data Science Courses | Microsoft Professional Program"
[10] "News for Datascience"
[11] "Data Science Course - Cognitive Class"