To scrape Amazon for book information, you require to first install Beautiful Soup library. The finest way of installing BeautifulSoup is through pip, so ensure you have a pip module installed.
!pip3 install beautifulsoup4
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/site-packages (4.7.1) Requirement already satisfied: soupsieve>=1.2 in /usr/local/lib/python3.7/site-packages (from beautiful)
Importing Required Libraries
It’s time to import the necessary packages that you would use for scraping data from a website as well as visualize that with the assistance of matplotlib, bokeh, and seaborn.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import re import time from datetime import datetime import matplotlib.dates as mdates import matplotlib.ticker as ticker from urllib.request import urlopen from bs4 import BeautifulSoup import requests
Extracting Amazon’s Best Selling Books
The URL, which you will scrape is here: https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo) (In case, you are unable to use this link, use parent link). The page row can be adapted to use data for every page. Therefore, to use all these pages, you require to go through all these pages to have the needed dataset, however, first, you require to discover total pages from a website.
For connecting to URL as well as fetching HTML content, these things are necessary:
Describe a get_data
function that will input page numbers like an argument,
Outline a user-agent
that will assist in bypassing detection as the scraper,
Identify the URL to requests.get
as well as pass a user-agent header like an argument,
Scrape content using requests.get,
Extract the detailed page and allocate it to soup variables,
The next step, which is very important is to recognize the parent tag
below which all the required data will reside. The data, which we will scrape include:
- Book’s Name
- Author’s Name
- Ratings
- Customer Ratings
- Pricing
The given image indicates where the parent tags are located s well as when you float over that, all the necessary elements get highlighted.
Similar to parents’ tags, you require to get the attributes for author, book name, ratings, customers rated, as well as price. You will need to visit the webpage that you like to extract, choose the attributes as well as right-click on that, and choose inspect element. It will assist you in getting the particular data fields you need to scrape from HTML web pages, as given in the below figure:
Some authors’ names are not listed with Amazon, therefore you require to apply additional finds for the authors. In the given cell code, you might get nested the if-else conditions for the authors’ names that are to scrape the publication or author names.
no_pages = 2 def get_data(pageNo): headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"} r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies) content = r.content soup = BeautifulSoup(content) #print(soup) alls = [] for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}): #print(d) name = d.find('span', attrs={'class':'zg-text-center-align'}) n = name.find_all('img', alt=True) #print(n[0]['alt']) author = d.find('a', attrs={'class':'a-size-small a-link-child'}) rating = d.find('span', attrs={'class':'a-icon-alt'}) users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'}) price = d.find('span', attrs={'class':'p13n-sc-price'}) all1=[] if name is not None: #print(n[0]['alt']) all1.append(n[0]['alt']) else: all1.append("unknown-product") if author is not None: #print(author.text) all1.append(author.text) elif author is None: author = d.find('span', attrs={'class':'a-size-small a-color-base'}) if author is not None: all1.append(author.text) else: all1.append('0') if rating is not None: #print(rating.text) all1.append(rating.text) else: all1.append('-1') if users_rated is not None: #print(price.text) all1.append(users_rated.text) else: all1.append('0') if price is not None: #print(price.text) all1.append(price.text) else: all1.append('0') alls.append(all1) return alls
The given code cell would do the given functions:
Call get_data
function within the for loop,
This for
loop would repeat over the function beginning from 1 till total pages+1.
As the output would be the nested list, you will initially flatten the listing and pass that to DataFrame.
In the end, save dataframe as the CSV file.
results = [] for i in range(1, no_pages+1): results.append(get_data(i)) flatten = lambda l: [item for sublist in l for item in sublist] df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price']) df.to_csv('amazon_products.csv', index=False, encoding='utf-8')
Read a CSV File
Now, it’s time to load a CSV file that you have created as well as saved in the given cell. Again, it is a voluntary step; you can even utilize a dataframe df straight and ignore this given step.
df = pd.read_csv("amazon_products.csv")
df.shape
(100, 5)
The dataframe’s shape discloses that there are 5 columns and 100 rows within the CSV file.
It’s time to print the initial 5 rows of this dataset.
df.head(61)
Book Name Author Rating Customers_Rated Price 0 The Power of your Subconscious Mind Joseph Murphy 4.5 out of 5 stars 13,948 ₹ 99.00 1 Think and Grow Rich Napoleon Hill 4.5 out of 5 stars 16,670 ₹ 99.00 2 Word Power Made Easy Norman Lewis 4.4 out of 5 stars 10,708 ₹ 130.00 3 Mathematics for Class 12 (Set of 2 Vol.) Exami... R.D. Sharma 4.5 out of 5 stars 18 ₹ 930.00 4 The Girl in Room 105 Chetan Bhagat 4.3 out of 5 stars 5,162 ₹ 149.00 ... ... ... ... ... ... 56 COMBO PACK OF Guide To JAIIB Legal Aspects Pri... MEC MILLAN 4.5 out of 5 stars 114 ₹ 1,400.00 57 Wren & Martin High School English Grammar and ... Rao N 4.4 out of 5 stars 1,613 ₹ 400.00 58 Objective General Knowledge Sanjiv Kumar 4.2 out of 5 stars 742 ₹ 254.00 59 The Rudest Book Ever Shwetabh Gangwar 4.6 out of 5 stars 1,177 ₹ 194.00 60 Sita: Warrior of Mithila (Ram Chandra Series -... Amish Tripathi 4.4 out of 5 stars 3,110 ₹ 248.00
Some pre-processing on Ratings, Price Column, and customers_rated:
As you know that ratings are calculated from 5, you may keep only ratings as well as remove the additional part from that.
From customers_rated column, just remove comma.
From pricing column, remove a comma, rupees symbol, and split that using dot.
In the end, convert all three columns in the float or integer.
df['Rating'] = df['Rating'].apply(lambda x: x.split()[0]) df['Rating'] = pd.to_numeric(df['Rating']) df["Price"] = df["Price"].str.replace('₹', '') df["Price"] = df["Price"].str.replace(',', '') df['Price'] = df['Price'].apply(lambda x: x.split('.')[0]) df['Price'] = df['Price'].astype(int) df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '') df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore') df.head() Book Name Author Rating Customers_Rated Price 0 The Power of your Subconscious Mind Joseph Murphy 4.5 13948 99 1 Think and Grow Rich Napoleon Hill 4.5 16670 99 2 Word Power Made Easy Norman Lewis 4.4 10708 130 3 Mathematics for Class 12 (Set of 2 Vol.) Exami... R.D. Sharma 4.5 18 930 4 The Girl in Room 105 Chetan Bhagat 4.3 5162 149
Now, it’s time to verify data types of DataFrame.
df.dtypes
Book Name object Author object Rating float64 Customers_Rated int64 Price int64 dtype: object
Then replace zero values within DataFrame to NaN.
df.replace(str(0), np.nan, inplace=True) df.replace(0, np.nan, inplace=True)
Count Number of NaNs within DataFrame
count_nan = len(df) - df.count() count_nan Book Name 0 Author 6 Rating 0 Customers_Rated 0 Price 1 dtype: int64
From the given outputs, you can witness that there are total six books, which are not having an Author’s Name, whereas one book is not having the price related to it. These data are important for the authors who want to sell their books as well as should not disregard to put these information.
It’s time to drop the NaNs.
df = df.dropna()
Highest Priced Books by Authors
Let's discover which authors had the maximum-priced book. You would imagine results for topmost 20 authors.
data = df.sort_values(["Price"], axis=0, ascending=False)[:15] data Book Name Author Rating Customers_Rated Price 56 COMBO PACK OF Guide To JAIIB Legal Aspects Pri... MEC MILLAN 4.5 114 1400.0 98 Diseases of Ear, Nose and Throat P L Dhingra 4.7 118 1285.0 3 Mathematics for Class 12 (Set of 2 Vol.) Exami... R.D. Sharma 4.5 18 930.0 96 Madhymik Bhautik Vigyan -12 (Part 1-2) (NCERT ... Kumar-Mittal 5.0 1 765.0 6 My First Library: Boxset of 10 Board Books for... Wonder House Books 4.5 3116 750.0 38 Indian Polity - For Civil Services and Other S... M. Laxmikanth 4.6 1210 700.0 42 A Modern Approach to Verbal & Non-Verbal Reaso... R.S. Aggarwal 4.4 1822 675.0 27 The Intelligent Investor (English) Paperback –... Benjamin Graham 4.4 6201 650.0 99 Law of CONTRACT & Specific Relief Dr. Avtar Singh 4.4 23 643.0 49 All In One ENGLISH CORE CBSE Class 12 2019-20 Arihant Experts 4.4 493 599.0 72 The Secret Rhonda Byrne 4.5 11220 556.0 86 How to Prepare for Quantitative Aptitude for t... Arun Sharma 4.4 847 537.0 8 Quantitative Aptitude for Competitive Examinat... R S Aggarwal 4.4 4553 435.0 16 Sapiens: A Brief History of Humankind Yuval Noah Harari 4.6 14985 434.0 84 Concept of Physics Part-2 (2019-2020 Session) ... H.C. Verma 4.6 1807 433.0 from bokeh.models import ColumnDataSource from bokeh.transform import dodge import math from bokeh.io import curdoc curdoc().clear() from bokeh.io import push_notebook, show, output_notebook from bokeh.layouts import row from bokeh.plotting import figure from bokeh.transform import factor_cmap from bokeh.models import Legend output_notebook()
Loading BokehJS ...
p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=550, title="Authors Highest Priced Book", toolbar_location=None, tools="") p.vbar(x=data.iloc[:,1], top=data.iloc[:,4], width=0.9) p.xgrid.grid_line_color = None p.y_range.start = 0 p.xaxis.major_label_orientation = math.pi/2 show(p)
Using the given graph, it’s easy to observe that top two maximum-priced books are from authors Mecmillan as well as P L Dhingra.
Top Rated Authors and Books wrt Customer Rated
Let's discover which authors are having top-rated books as well as which books from these authors are in the top list. Although, while getting this out, you will filter those authors that have < 1000 customer ratings.
data = df[df['Customers_Rated'] > 1000] data = data.sort_values(['Rating'],axis=0, ascending=False)[:15] data Book Name Author Rating Customers_Rated Price 26 Inner Engineering: A Yogi’s Guide to Joy Sadhguru 4.7 4091 254.0 70 Bhagavad-Gita (Hindi) A. C. Bhaktivedanta 4.7 1023 150.0 11 The Alchemist Paulo Coelho 4.7 22182 264.0 47 Harry Potter and the Philosopher's Stone J.K. Rowling 4.7 7737 234.0 84 Concept of Physics Part-2 (2019-2020 Session) ... H.C. Verma 4.6 1807 433.0 16 Sapiens: A Brief History of Humankind Yuval Noah Harari 4.6 14985 434.0 38 Indian Polity - For Civil Services and Other S... M. Laxmikanth 4.6 1210 700.0 29 Wings of Fire: An Autobiography of Abdul Kalam Arun Tiwari 4.6 3513 301.0 39 The Theory of Everything Stephen Hawking 4.6 2004 199.0 25 The Immortals of Meluha (Shiva Trilogy) Amish 4.6 4538 248.0 23 Life's Amazing Secrets: How to Find Balance an... Gaur Gopal Das 4.6 3422 213.0 34 Dear Stranger, I Know How You Feel Ashish Bagrecha 4.6 1130 167.0 17 The Monk Who Sold His Ferrari Robin Sharma 4.6 5877 137.0 13 How to Win Friends and Influence People Dale Carnegie 4.6 15377 99.0 59 The Rudest Book Ever Shwetabh Gangwar 4.6 1177 194.0 p = figure(x_range=data.iloc[:,0], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="") p.vbar(x=data.iloc[:,0], top=data.iloc[:,2], width=0.9) p.xgrid.grid_line_color = None p.y_range.start = 0 p.xaxis.major_label_orientation = math.pi/2 show(p)
From the given output, you could observe that top three books having over 1000 customer ratings are Inner Engineering: A Yogi’s Guide to Joy, Bhagavad-Gita (Hindi), as well as The Alchemist respectively.
p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="") p.vbar(x=data.iloc[:,1], top=data.iloc[:,2], width=0.9) p.xgrid.grid_line_color = None p.y_range.start = 0 p.xaxis.major_label_orientation = math.pi/2 show(p)
The given graph indicates the best 10 authors in descendant order that have maximum rated books having > 1000 customer ratings that include Sadhguru, A. C. Bhaktivedanta, as well as Paulo Coelho respectively.
Maximum Customer Rated Books and Authors
As you already have seen the best-rated books as well as top-rated authors, this would still become more credible and convincing to determine the finest author as well as the book depending on total customers that have rated for this book.
Therefore, let's rapidly find it out.
data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20] data Book Name Author Rating Customers_Rated Price 11 The Alchemist Paulo Coelho 4.7 22182 264.0 1 Think and Grow Rich Napoleon Hill 4.5 16670 99.0 13 How to Win Friends and Influence People Dale Carnegie 4.6 15377 99.0 16 Sapiens: A Brief History of Humankind Yuval Noah Harari 4.6 14985 434.0 18 Rich Dad Poor Dad : What The Rich Teach Their ... Robert T. Kiyosaki 4.5 14591 296.0 10 The Subtle Art of Not Giving a F*ck Mark Manson 4.4 14418 365.0 0 The Power of your Subconscious Mind Joseph Murphy 4.5 13948 99.0 48 The Power of Your Subconscious Mind Joseph Murphy 4.5 13948 99.0 72 The Secret Rhonda Byrne 4.5 11220 556.0 41 1984 George Orwell 4.5 10829 95.0 2 Word Power Made Easy Norman Lewis 4.4 10708 130.0 46 Man's Search For Meaning: The classic tribute ... Viktor E Frankl 4.4 8544 245.0 67 The 7 Habits of Highly Effective People R. Stephen Covey 4.3 8229 397.0 47 Harry Potter and the Philosopher's Stone J.K. Rowling 4.7 7737 234.0 40 One Indian Girl Chetan Bhagat 3.8 7128 113.0 65 Thinking, Fast and Slow (Penguin Press Non-Fic... Daniel Kahneman 4.4 7087 410.0 27 The Intelligent Investor (English) Paperback –... Benjamin Graham 4.4 6201 650.0 17 The Monk Who Sold His Ferrari Robin Sharma 4.6 5877 137.0 53 Ram - Scion of Ikshvaku (Ram Chandra) Amish Tripathi 4.2 5766 262.0 93 The Richest Man in Babylon George S. Clason 4.5 5694 129.0 from bokeh.transform import factor_cmap from bokeh.models import Legend from bokeh.palettes import Dark2_5 as palette import itertools from bokeh.palettes import d3 #colors has a list of colors which can be used in plots colors = itertools.cycle(palette) palette = d3['Category20'][20] index_cmap = factor_cmap('Author', palette=palette, factors=data["Author"]) p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated") p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author') p.xaxis.axis_label = 'RATING' p.yaxis.axis_label = 'CUSTOMERS RATED' p.legend.location = 'top_left' BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead show(p)
The given graph here is the scatter plot of Authors that bagged customer ratings vs. actual ratings. The following results can be taken after going through the plot.
The Alchemist - Hands down Paulo Coelho's book, is the best-selling book as the ratings and number of clients rated, both are synced.
Ram - Scion of Ikshvaku (Ram Chandra) – written by Amish Tripathi, has average ratings of 4.2 having 5766 customer ratings. Although, a book named The Richest Man in Babylon, written by George S. Clason has nearly similar customer ratings however the overall ratings is 4.5. Therefore, it could be decided that more clients gave a higher ratings with The Richest Man in Babylon.
Conclusion
In this tutorial, we have provided the basic details of doing web scraping using BeautifulSoup as well as how can you make sense out from the data scraped from the web through visualizing that using bokeh plotting library. Another good exercise of taking the step forward while learning data scraping with BeautifulSoup is scraping data from other websites as well as see how you can get insights from that.
If you want to scrape data from Amazon book details then contact Retailgators or ask for a free quote!
Leave a Reply
Your email address will not be published. Required fields are marked