This is the third in a series of articles that uses BeautifulSoup to scrape Yelp restaurant reviews and then apply Machine Learning to extract insights from the data. In this article, you will use the code to extract all the reviews in a list. The script will be as follows:
import requests from bs4 import BeautifulSoup import time from textblob import TextBlob import pandas as pd#we use these argument to scrape the website rest_dict = [ { "name" : "the-cortez-raleigh", "link" : "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start=", "pages" : 3 }, { "name" : "rosewater-kitchen-and-bar-raleigh", "link" : "https://www.yelp.com/biz/rosewater-kitchen-and-bar-raleigh?osq=Restaurants&start=", "pages" : 3 } ]#scraping function def scrape(rest_list): all_comment_list = list() for rest in rest_list: comment_list = list() for pag in range(1, rest['pages']): try: time.sleep(5)#URL = "https://www.yelp.com/biz/the-cortez-raleigh?osq=Restaurants&start="+str(pag*10)+"&sort_by=rating_asc" URL = rest['link']+str(pag*10) print(rest['name'], 'downloading page ', pag*10) page = requests.get(URL)#next step: parsing soup = BeautifulSoup(page.content, 'lxml') soupfor comm in soup.find("yelp-react-root").find_all("p", {"class" : "comment__373c0__Nsutg css-n6i4z7"}): comment_list.append(comm.find("span").decode_contents()) print(comm.find("span").decode_contents()) except: print("could not work properly!") all_comment_list.append([comment_list, rest['name']]) return all_comment_list#store all reviews in a list reviews = scrape(rest_dict)
Here in the script, the output of the function will be saved in a variable known as reviews. While printing the variable, the result will be:
The nested list's structure follows this pattern:
[[[review1, review2], restaurant1], [[review1, review2], restaurant2]]
It will now be converted into DataFrame using Pandas
You will need to develop a DataFrame to hold all of the information now that you have established a list using the ratings and their respective restaurants.
df = pd.DataFrame(reviews)
Here, we will try to persuade this hierarchical list into a DataFrame directly, and will end up with a column full of listings and another column with a single restaurant title. To correctly update the data, we will use the explode function, which creates a single row for each element of the list where it's used, in this example, column 0.
df = df.explode(0)
The dataset is now appropriately structured, as you can see in the image. Each review has a restaurant associated with it.
Because the current samples are only numbered with 0 and 1, the only thing left to do is reset the index.
df = df.reset_index(drop=True) df[0:10]
It is complicated to extract restaurant ratings that were previously assigned to each review available on the website. You will need sentiment analysis that will try to discover a solution to the missing information. The NLP model’s interferences regarding values will take the place of each review’s star rating. Obviously, working with the information is an experiment, and sentiment analysis is independent on the model that we employ which is not always precise.
We will use TextBlob, a simple library that already includes a pre-trained algorithm for the task. Because you will have to apply it to every review, we will first develop a function that returns the estimated sentiment of a paragraph in a range of -1 to 1.
def perform_sentiment(x): testimonial = TextBlob(x) #testimonial.sentiment (polarity, subjectvity) testimonial.sentiment.polarity #sentiment_list.append([sentence, testimonial.sentiment.polarity, testimonial.subjectivity]) return testimonial.sentiment.polarity
After developing the function, we will use pandas and apply method to add a new column of our dataset that will hold algorithm analysis results. The sort values method will then be used to sort all of the reviews, starting with the negative ones.
The final dataset will be:
To continue with the experiment, we will now extract one of most frequently used words in a dataset division. However, there is a problem. Although certain words have the same root, such as "eating" and "ate," the algorithm will not automatically place them in the same category because they are different when converted to binary. As a solution to this difficulty, we will employ lemmatization, an NLP pre-processing approach.
Lemmatization may isolate the core of any existing word, removing any potential variation and enabling the data to be normalized. Lemmatizers are basic models that must be pre-trained before they can be built. To import a lemmatizer, we will use the spacy library.
!pip install spacy
Spacy is an open-source NLP library that includes a lemmatizer and many pre-trained models. This program will lemmatize all or most of the words in a single message and provide the frequency of each term (the number of times they have been used). We will arrange the results in ascending order to indicate which words have appeared the most frequently in a set of evaluations.
def top_frequent(text, num_words): #frequency of most common words import spacy from collections import Counternlp = spacy.load("en") text = text #lemmatization doc = nlp(text) token_list = list() for token in doc: #print(token, token.lemma_) token_list.append(token.lemma_) token_list lemmatized = '' for _ in token_list: lemmatized = lemmatized + ' ' + _ lemmatized#remove stopwords and punctuations doc = nlp(lemmatized) words = [token.text for token in doc if token.is_stop != True and token.is_punct != True] word_freq = Counter(words) common_words = word_freq.most_common(num_words) return common_words
We will extract the most common words from the worst-rated reviews, rather than the complete list of reviews. The information has already been sorted to place the worst ratings at the front, so all that remains is to build a unique string that contains all of the reviews. To convert the review list into a string, we will use the join function.
text = ' '.join(list(df[0].values[0:20])) texttop_frequent(text, 100)[('great', 22), ('<', 21), ('come', 16), ('order', 16), ('place', 14), ('little', 10), ('try', 10), ('nice', 10), ('food', 10), ('restaurant', 10), ('menu', 10), ('day', 10), ('butter', 9), ('drink', 9), ('dinner', 8), ...
If you are looking to perform an EDA on Yelp data then, you can contact Foodspark today!!
We will catch you as early as we receive the message
“We were searching for a web scraping partner for our restaurant data scraping requirements. We have chosen Foodspark and it was an amazing experience to work with them. They are complete professionals in their attitude towards data scraping. We would certainly recommend them to others for their food data scraping requirements.”
“Working with Foodspark was a completely exceptional experience for me. Foodspark team is professional, calm, and works well with all my food data scraping requirements. 5 Stars to them for their web data scraping work.
“We had a great time working with Foodspark for our restaurant food data scraping requirements that no other service providers would able to cope with competently. Foodspark is just amazing! They have done their work wonderfully well! Thank You Foodspark!”
“We were searching for a food data scraping service provider and we have found Foodspark! It was a great experience working with this professional company. They are absolutely professionals in their method of doing web scraping. You can surely hire them for all your food data scraping service requirements.”
“We are a food aggregator app and we were searching for a food data aggregator app data scraping service provider that can satisfy our requirements of extracting food data from our competitor’s app. Team Foodspark has worked extremely hard as the task was very difficult. They have provided great results and we have become their permanent client!”
“We are very much impressed with Foodspark for their Food Data Scraping Services. Our requirements were quite unusual and hard to implement but they were equally good to the job and they have worked very hard to offer us the finest results. Thumps Up to Foodspark!”