How Does Web Scraping Help In Extracting Yelp Reviews Data?

Part 1 : How Does Web Scraping Help In Extracting Yelp Reviews Data?

Yelp is a localized search engine for companies in your area. People talk about their experiences with that company in the form of reviews, which is a great source of information. Customer input can assist in identifying and prioritizing advantages and problems for future business development.

We now have access to a variety of sources thanks to the internet, where people are prepared to share their experiences with various companies and services. We may exploit this opportunity to gather some useful data and generate some actionable intelligence to provide the best possible client experience.

We can collect a substantial portion of primary and secondary data sources, analyze them, and suggest areas for improvement by scraping all of those evaluations. Python has packages that make these activities very simple. We can choose the requests library for web scraping since it performs the job and is very easy to use

By analyzing the website via a web browser, we may quickly understand the structure of the website. Here is the list of possible data variables to collect after researching the layout of the Yelp website:

  • Reviewer’s name
  • Review
  • Date
  • Star ratings
  • Restaurant Name

The requests module makes it simple to get files from the internet. The requests module can be installed by following these steps:

pip install requests

To begin, we go to the Yelp website and type in “restaurants near me” in the Chicago, IL area.

We’ll then import all of the necessary libraries and build a panda DataFrame.

import pandas as pd 
import time as t
from lxml import html 
import requestsreviews_df=pd.DataFrame()

Downloading the HTML page using request.get()

import requests
searchlink= ',+IL'
user_agent = ‘ Enter you user agent here ’
headers = {‘User-Agent’: user_agent}

Get the user agent here

To scrape restaurant reviews for any other location on the same review platform, simply copy and paste the URL. All you have to do is provide a link.

page = requests.get(searchlink, headers = headers)
parser = html.fromstring(page.content)

The Requests.get() will be downloaded in the HTML page. Now we must search the page for the links to various eateries.

businesslink=parser.xpath('//a[@class="biz-name js-analytics-click"]')
links = [l.get('href') for l in businesslink]

Because these links are incomplete, we will need to add the domain name.

for link in links:
u.append(''+ str(link))

We now have most of the restaurant titles from the very first page; each page has 30 search results. Let’s go over each page one by one and seek their feedback.

for item in u:
page = requests.get(item, headers = headers)
parser = html.fromstring(page.content)

A div with the class name “review review — with-sidebar” contains the reviews. Let’s go ahead and grab all of these divs.

xpath_reviews = ‘//div[@class=”review review — with-sidebar”]’
reviews = parser.xpath(xpath_reviews)

We would like to scrape the author name, review body, date, restaurant name, and star rating for each review.

for review in reviews:
            temp = review.xpath('.//div[contains(@class, "i-stars i-                                                stars--regular")]')
            rating = [td.get('title') for td in temp]       xpath_author  = './/a[@id="dropdown_user-name"]//text()'
            xpath_body    = './/p[@lang="en"]//text()'
            author  = review.xpath(xpath_author)
            date    = review.xpath('.//span[@class="rating-qualifier"]//text()')
            body    = review.xpath(xpath_body)
            heading= parser.xpath('//h1[contains(@class,"biz-page-title embossed-text-white")]')
            bzheading = [td.text for td in heading]

For all of these objects, we’ll create a dictionary, which we’ll then store in a pandas data frame.

review_dict = {‘restaurant’ : bzheading,
‘rating’: rating, 
‘author’: author, 
‘date’: date,
‘Review’: body,
reviews_df = reviews_df.append(review_dict, ignore_index=True)

We can now access all of the reviews from a single website. By determining the maximum reference number, you can loop across the pages. A <a> tag with the class name “available-number pagination-links anchor” contains the latest page number.

page_nums = '//a[@class="available-number pagination-links_anchor"]'
pg = parser.xpath(page_nums)max_pg=len(pg)+1

Here we will scrape a total of 23,869 reviews with the aforementioned script, for about 450 eateries and 20-60 reviews for each restaurant.

Let’s launch a jupyter notebook and do some text mining and sentiment analysis.

First, get the libraries you’ll need.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We will save the data in a file named all.csv.


Let’s take a look at the data frame’s head and tail.


With 5 columns, we get 23,869 records. As can be seen, the data requires formatting. Symbols, tags, and spaces that aren’t needed should be eliminated. There are a few Null/NaN values as well.

Remove all values that are Null/NaN from the data frame.


We’ll now eliminate the extraneous symbols and spaces using string slicing.


Discovering the further data


The highest number of reviews with a 5-star luxury rating is 11,859, while the lowest number of reviews with a 1 star rating is 979. However, few records have a ‘t’ grade that is uncertain. These records ought to be discarded.

data.drop(data[data.rating=='t'].index , inplace =True)

We can develop a new function called review _length to help us better comprehend the data. The number of characters in each review will be stored in this column, and any white spaces in the review will be removed.

data['review_length'] = data['Review'].apply(lambda x: len(x) - x.count(' '))

Let’s make some graphs and analyze the data now.

hist = sns.FacetGrid(data=data, col='rating'), 'review_length', bins=50)
We can see that the number of 4- and 5-star reviews is increasing. For all ratings, the distribution of review durations is relatively similar.
Let's do the same thing using a box plot.
sns.boxplot(x='rating', y='review_length', data=data)

According to the box plot, reviews with 2- and 3-star ratings have longer reviewed than reviews with a 5-star rating. However, the number of dots above the boxes indicates that there are several outliers for each star rating. As a result, the duration of a review isn’t going to be very beneficial for our sentiment analysis.

Sentiment Analysis

Only the 1 star and 5-star ratings will be used to evaluate whether a review is good or negative. Let’s make a new data frame to keep track of the one- and five-star ratings.

df = data[(data['rating'] == 1) | (data['rating'] == 5)]
df.shapeOutput:(12838, 6)

We now have 12,838 records for 1- and 5-star ratings out of 23,869 total records.

The review text must be properly formatted in order to be used for analysis. Let’s take a look at a sample to see what we’re up against.


There appear to be numerous punctuation symbols as well as some unknown codes such as ‘\xa0’. In Latin1 (ISO 8859–1), ‘\xao’ is a non-breaking space (see chr) (160). You should substitute a space for it. Let’s now write a function that removes all punctuation, stops words, and then lemmatizes the text.


Bag of words is a common method for text pre-processing in natural language processing. A list of words, regardless of their grammar or arrangement, is represented by a bag-of-words. The bag-of-words model is widely employed, in which the frequency of each word is utilized to train a classifier.


Lemmatization is the process of combining together variant versions of words so that they can be studied as a single phrase, or lemma. A dictionary form of a word will always be returned by lemmatization. For example, the words typing, typed, and programming will all be regarded as a single word, “type.” This feature will be extremely beneficial to our research.

import string     # Imports the library
import nltk        # Imports the natural language toolkit'stopwords')   # Download the stopwords dataset'wordnet')
wn=nltk.WordNetLemmatizer()from nltk.corpus import stopwords
stopwords.words('english')[0:10]Output: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

These punctuations are frequently used and are of a neutral nature. They have no positive or negative meaning and can be ignored.

def text_process_lemmatize(revw):
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. create a list of the cleaned text
    4. Return Lemmatize version of the list
    # Replace the xa0 with a space
    revw=revw.replace('xa0',' ')
    # Check characters to see if they are in punctuation
    nopunc = [char for char in revw if char not in string.punctuation]    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    # Now just remove any stopwords
    token_text= [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    # perform lemmatization of the above list
    cleantext= ' '.join(wn.lemmatize(word) for word in token_text)
    return cleantext

Let’s use the function we just constructed to handle our review column.



The collection of lemmas in df [‘LemmText’] must be converted to vectors so that a machine learning model and Python could use and interpret it. Vectorizing is the term for this procedure. This procedure will generate a matrix with each review as a row and each unique lemma as a column, with the number of occurrences of each lemma in each column. The scikit-learn library’s Count Vectorizer and N-grams procedure will be used. We’ll simply look at unigrams in this section.

from sklearn.feature_extraction.text import CountVectorizer
ngram_vect = CountVectorizer(ngram_range=(1,1))
X_counts = ngram_vect.fit_transform(df['LemmText'])

Train Test Split

Let’s use scikit-train test split learns to create training and testing data sets.

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)#0.3 mean the training set will be 70% and test set will be 30%

For sentiment prediction, we’ll utilise Multinomial Naive Bayes. Negative reviews receive a one-star rating, while favourable reviews receive a five-star rating. Let’s build a MultinomialNB model that fits the X train and y train data.

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(),y_train)

Now let’s make a prediction on the test X test set.

NBpredictions = nb.predict(X_test)


Let’s compare our model’s predictions to the actual star ratings obtained from y test.

from sklearn.metrics import confusion_matrix,classification_report

The model has a 97% accuracy, which is excellent. Based on the customer’s review, this algorithm can predict whether he liked or disliked the restaurant.

Looking to scrape Yelp data and other restaurant reviews? Contact Foodspark today or ask for a free quote!

Explore Our Latest Insights

Food Delivery APIs

Food Delivery API – How Does It Transform The Food Delivery Industry?

How Predictive Analytics and Web Scraping Affect the Future Grocery Retail Business?

In the fast-paced world of retail, grocery stores are a lively and competitive business. They constantly adapt to what people...

Read more

10 Ways How Food Data Scraping Fuels Your Business

In the food game, data is the secret ingredient. But where do you find this data? It’s food scraping! This...

Read more