Web Scraping Matrimonial Website with R

In this tutorial, we will cover how to extract information from a matrimonial website using R. We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.

Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.

install.packages("rvest")
install.packages("dplyr")

Load the required Libraries

To make the libraries in use, you need to submit the program below.

library(rvest)
library(dplyr)

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.
https://www.jeevansathi.com/punjabi-brides-girls
https://www.jeevansathi.com/punjabi-grooms-boys

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.

We need to split the main URL into different elements so that we can access it.

Main_URL = Static_URL + Mother_Tongue + Brides_Grooms

Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -

Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
Select Mother Tongue. For example, punjabi, tamil etc.

# Looking for bride/groom
Bride_Groom = "bride"
# Possible Values : bride, groom

# Select Mother Tongue
Mother_Tongue = "punjabi"
# Possible Values
# punjabi
# tamil
# bengali
# telugu
# kannada
# marathi

# URL
if (tolower(Bride_Groom) == "bride") {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls')
} else {
html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys')
}

See the output :

[1] "https://www.jeevansathi.com/punjabi-brides-girls"

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.

text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)

         ID
1  ZARX0345
2  ZZWX5573
3  ZWVT2173
4  ZAYZ6100
5  ZYTS6885
6  ZXYV9849
7   TRZ8475
8   VSA7284
9  ZXTU1965
10 ZZSA6877
11 ZZSZ6545
12 ZYSW4809
13 ZARW2199
14 ZRSY0723
15 ZXAT2801
16 ZYXX8818
17 ZAWA8567
18  WXZ2147
19 ZVRT8875
20 ZWWR9533
21 ZYXW4043

The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -

read_html() : you can create a html document from a URL
html_nodes() : extracts pieces out of HTML documents.
html_nodes(".class") : calls node based on CSS class
html_nodes("#class") : calls node based on <div>, <span>, <pre> id
html_text() : extracts only the text from HTML tag
html_attr() : extracts contents of a single attribute

Difference between .class and #class

1..class targets the following element:

<div class="class"></div>

2.#class targets the following element:

<div id="class"></div>

Get Detailed Information of Profiles

The following program performs the following tasks -

Loop through profile IDs
Pull information about Age, Height, Qualification etc.
Extract details about appearance
Fetch 'About Me' section of profiles

# Get Detailed Information
finaldf = data.frame()
for (i in 1:length(profileIDs$ID)){
ID = profileIDs[i,1]
link = paste0("https://www.jeevansathi.com/profile/viewprofile.php?stype=4&username=", ID)
FormattedInfo = data.frame(t(read_html(link) %>% html_nodes(".textTru li") %>%
                               html_text()))
# Final Table
FormattedInfo = data.frame(ProfileID = ID,
                             Description = read_html(link) %>% 
                             html_nodes("#myinfoView") %>%
                             html_text(), 
                             Appearance = read_html(link) %>% 
                             html_nodes("#section-lifestyle #appearanceView") %>%
                             html_text(),
                             FormattedInfo)

finaldf = bind_rows(finaldf, FormattedInfo)
}

# Assign Variable Names
names(finaldf) = c("ProfileID", "Description", "Appearance", "Age_Height", "Qualification", "Location", "Profession", "Mother Tongue", "Salary", "Religion", "Status", "Has_Children")

Download Display Pic

To download display pic, you first need to fetch image URL of profile and then hit download.file( ) function to download it. In the script below, you need to provide a profile ID.

# Download Profile Pic of a particular Profile
ID = "SWU6683"
text3 = read_html(html) %>% html_nodes(".vtop") %>% html_attr('src')
pic = data.frame(cbind(profileIDs, URL = text3[!is.na(text3)]))
download.file(as.character(pic$URL[match(ID, pic$ID)]), "match.jpg", mode = "wb")
# File saved as match.jpg

Disclaimer

We have accessed only publicly available data which does not require login or registration. The purpose is not to cause any damage or copy the content from the website.

Other Functions of rvest( )
You can extract, modify and submit forms with html_form(), set_values() and submit_form(). Refer the case study below -

You can collect google search result by submitting the google search form with search term. You need to supply search term. Here, I entered 'Datascience' search term.

library(rvest)
url = "http://www.google.com"
pgsession = html_session(url)
pgform = html_form(pgsession)[[1]]

# Set search term
filled_form = set_values(pgform, q="Datascience")
session = submit_form(pgsession,filled_form)

# look for headings of first page
session %>% html_nodes(".g .r a") %>% html_text()

 [1] "Data science - Wikipedia"
 [2] "Data Science Courses | Coursera"
 [3] "Data Science | edX"
 [4] "Data science - Wikipedia"
 [5] "DataScience.com | Enterprise Data Science Platform Provider"
 [6] "Top Data Science Courses Online - Updated February 2018 - Udemy"
 [7] "Data Science vs. Big Data vs. Data Analytics - Simplilearn"
 [8] "What Is Data Science? What is a Data Scientist? What is Analytics?"
 [9] "Online Data Science Courses | Microsoft Professional Program"
[10] "News for Datascience"
[11] "Data Science Course - Cognitive Class"

Web Scraping Matrimonial Website with R

Install the required packages

Load the required Libraries

Scrape Information from Matrimonial Website

Extract Profile IDs

Get Detailed Information of Profiles

Download Display Pic

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112