How to Scrape Amazon for Book Information Using Python and BeautifulSoap?

To scrape Amazon for book information, you require to first install Beautiful Soup library. The finest way of installing BeautifulSoup is through pip, so ensure you have a pip module installed.

!pip3 install beautifulsoup4

Requirement already satisfied: beautifulsoup4 in 
/usr/local/lib/python3.7/site-packages (4.7.1)
Requirement already satisfied: soupsieve>=1.2 in 
/usr/local/lib/python3.7/site-packages (from beautiful)

Importing Required Libraries

It’s time to import the necessary packages that you would use for scraping data from a website as well as visualize that with the assistance of matplotlib, bokeh, and seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

Extracting Amazon’s Best Selling Books

The URL, which you will scrape is here: https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo) (In case, you are unable to use this link, use parent link). The page row can be adapted to use data for every page. Therefore, to use all these pages, you require to go through all these pages to have the needed dataset, however, first, you require to discover total pages from a website.

For connecting to URL as well as fetching HTML content, these things are necessary:

Describe a get_data function that will input page numbers like an argument,

Outline a user-agent that will assist in bypassing detection as the scraper,

Identify the URL to requests.get as well as pass a user-agent header like an argument,

Scrape content using requests.get,

Extract the detailed page and allocate it to soup variables,

The next step, which is very important is to recognize the parent tag below which all the required data will reside. The data, which we will scrape include:

Book’s Name
Author’s Name
Ratings
Customer Ratings
Pricing

The given image indicates where the parent tags are located s well as when you float over that, all the necessary elements get highlighted.

Similar to parents’ tags, you require to get the attributes for author, book name, ratings, customers rated, as well as price. You will need to visit the webpage that you like to extract, choose the attributes as well as right-click on that, and choose inspect element. It will assist you in getting the particular data fields you need to scrape from HTML web pages, as given in the below figure:

Some authors’ names are not listed with Amazon, therefore you require to apply additional finds for the authors. In the given cell code, you might get nested the if-else conditions for the authors’ names that are to scrape the publication or author names.

no_pages = 2

def get_data(pageNo):  
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup = BeautifulSoup(content)
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):
        #print(d)
        name = d.find('span', attrs={'class':'zg-text-center-align'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('a', attrs={'class':'a-size-small a-link-child'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
        price = d.find('span', attrs={'class':'p13n-sc-price'})

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:    
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')     

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)    
    return alls

The given code cell would do the given functions:

Call get_data function within the for loop,

This for loop would repeat over the function beginning from 1 till total pages+1.

As the output would be the nested list, you will initially flatten the listing and pass that to DataFrame.

In the end, save dataframe as the CSV file.

results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price'])
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')

Read a CSV File

Now, it’s time to load a CSV file that you have created as well as saved in the given cell. Again, it is a voluntary step; you can even utilize a dataframe df straight and ignore this given step.

df = pd.read_csv("amazon_products.csv")

df.shape

(100, 5)

The dataframe’s shape discloses that there are 5 columns and 100 rows within the CSV file.

It’s time to print the initial 5 rows of this dataset.

df.head(61)

Book Name	Author	Rating	Customers_Rated	Price
0	The Power of your Subconscious Mind	Joseph Murphy	4.5 out of 5 stars	13,948	₹ 99.00
1	Think and Grow Rich	Napoleon Hill	4.5 out of 5 stars	16,670	₹ 99.00
2	Word Power Made Easy	Norman Lewis	4.4 out of 5 stars	10,708	₹ 130.00
3	Mathematics for Class 12 (Set of 2 Vol.) Exami...	R.D. Sharma	4.5 out of 5 stars	18	₹ 930.00
4	The Girl in Room 105	Chetan Bhagat	4.3 out of 5 stars	5,162	₹ 149.00
...	...	...	...	...	...
56	COMBO PACK OF Guide To JAIIB Legal Aspects Pri...	MEC MILLAN	4.5 out of 5 stars	114	₹ 1,400.00
57	Wren & Martin High School English Grammar and ...	Rao N	4.4 out of 5 stars	1,613	₹ 400.00
58	Objective General Knowledge	Sanjiv Kumar	4.2 out of 5 stars	742	₹ 254.00
59	The Rudest Book Ever	Shwetabh Gangwar	4.6 out of 5 stars	1,177	₹ 194.00
60	Sita: Warrior of Mithila (Ram Chandra Series -...	Amish Tripathi	4.4 out of 5 stars	3,110	₹ 248.00

Some pre-processing on Ratings, Price Column, and customers_rated:

As you know that ratings are calculated from 5, you may keep only ratings as well as remove the additional part from that.

From customers_rated column, just remove comma.

From pricing column, remove a comma, rupees symbol, and split that using dot.

In the end, convert all three columns in the float or integer.

df['Rating'] = df['Rating'].apply(lambda x: x.split()[0])

df['Rating'] = pd.to_numeric(df['Rating'])

df["Price"] = df["Price"].str.replace('₹', '')

df["Price"] = df["Price"].str.replace(',', '')

df['Price'] = df['Price'].apply(lambda x: x.split('.')[0])

df['Price'] = df['Price'].astype(int)

df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '')

df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore')

df.head()


Book Name	Author	Rating	Customers_Rated	Price
0	The Power of your Subconscious Mind	Joseph Murphy	4.5	13948	99
1	Think and Grow Rich	Napoleon Hill	4.5	16670	99
2	Word Power Made Easy	Norman Lewis	4.4	10708	130
3	Mathematics for Class 12 (Set of 2 Vol.) Exami...	R.D. Sharma	4.5	18	930
4	The Girl in Room 105	Chetan Bhagat	4.3	5162	149

Now, it’s time to verify data types of DataFrame.

df.dtypes

Book Name           object
Author              object
Rating             float64
Customers_Rated      int64
Price                int64
dtype: object

Then replace zero values within DataFrame to NaN.

df.replace(str(0), np.nan, inplace=True)
df.replace(0, np.nan, inplace=True)

Count Number of NaNs within DataFrame

count_nan = len(df) - df.count()

count_nan

Book Name          0
Author             6
Rating             0
Customers_Rated    0
Price              1
dtype: int64

From the given outputs, you can witness that there are total six books, which are not having an Author’s Name, whereas one book is not having the price related to it. These data are important for the authors who want to sell their books as well as should not disregard to put these information.

It’s time to drop the NaNs.

df = df.dropna()

Highest Priced Books by Authors

Let's discover which authors had the maximum-priced book. You would imagine results for topmost 20 authors.

data = df.sort_values(["Price"], axis=0, ascending=False)[:15]

data



Book Name	Author	Rating	Customers_Rated	Price
56	COMBO PACK OF Guide To JAIIB Legal Aspects Pri...	MEC MILLAN	4.5	114	1400.0
98	Diseases of Ear, Nose and Throat	P L Dhingra	4.7	118	1285.0
3	Mathematics for Class 12 (Set of 2 Vol.) Exami...	R.D. Sharma	4.5	18	930.0
96	Madhymik Bhautik Vigyan -12 (Part 1-2) (NCERT ...	Kumar-Mittal	5.0	1	765.0
6	My First Library: Boxset of 10 Board Books for...	Wonder House Books	4.5	3116	750.0
38	Indian Polity - For Civil Services and Other S...	M. Laxmikanth	4.6	1210	700.0
42	A Modern Approach to Verbal & Non-Verbal Reaso...	R.S. Aggarwal	4.4	1822	675.0
27	The Intelligent Investor (English) Paperback –...	Benjamin Graham	4.4	6201	650.0
99	Law of CONTRACT & Specific Relief	Dr. Avtar Singh	4.4	23	643.0
49	All In One ENGLISH CORE CBSE Class 12 2019-20	Arihant Experts	4.4	493	599.0
72	The Secret	Rhonda Byrne	4.5	11220	556.0
86	How to Prepare for Quantitative Aptitude for t...	Arun Sharma	4.4	847	537.0
8	Quantitative Aptitude for Competitive Examinat...	R S Aggarwal	4.4	4553	435.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
84	Concept of Physics Part-2 (2019-2020 Session) ...	H.C. Verma	4.6	1807	433.0


from bokeh.models import ColumnDataSource
from bokeh.transform import dodge
import math
from bokeh.io import curdoc
curdoc().clear()
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Legend
output_notebook()

Loading BokehJS ...

p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=550, title="Authors Highest Priced Book", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,4], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

Using the given graph, it’s easy to observe that top two maximum-priced books are from authors Mecmillan as well as P L Dhingra.

Top Rated Authors and Books wrt Customer Rated

Let's discover which authors are having top-rated books as well as which books from these authors are in the top list. Although, while getting this out, you will filter those authors that have < 1000 customer ratings.

data = df[df['Customers_Rated'] > 1000]

data = data.sort_values(['Rating'],axis=0, ascending=False)[:15]

data


Book Name	Author	Rating	Customers_Rated	Price
26	Inner Engineering: A Yogi’s Guide to Joy	Sadhguru	4.7	4091	254.0
70	Bhagavad-Gita (Hindi)	A. C. Bhaktivedanta	4.7	1023	150.0
11	The Alchemist	Paulo Coelho	4.7	22182	264.0
47	Harry Potter and the Philosopher's Stone	J.K. Rowling	4.7	7737	234.0
84	Concept of Physics Part-2 (2019-2020 Session) ...	H.C. Verma	4.6	1807	433.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
38	Indian Polity - For Civil Services and Other S...	M. Laxmikanth	4.6	1210	700.0
29	Wings of Fire: An Autobiography of Abdul Kalam	Arun Tiwari	4.6	3513	301.0
39	The Theory of Everything	Stephen Hawking	4.6	2004	199.0
25	The Immortals of Meluha (Shiva Trilogy)	Amish	4.6	4538	248.0
23	Life's Amazing Secrets: How to Find Balance an...	Gaur Gopal Das	4.6	3422	213.0
34	Dear Stranger, I Know How You Feel	Ashish Bagrecha	4.6	1130	167.0
17	The Monk Who Sold His Ferrari	Robin Sharma	4.6	5877	137.0
13	How to Win Friends and Influence People	Dale Carnegie	4.6	15377	99.0
59	The Rudest Book Ever	Shwetabh Gangwar	4.6	1177	194.0


p = figure(x_range=data.iloc[:,0], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,0], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2

show(p)

From the given output, you could observe that top three books having over 1000 customer ratings are Inner Engineering: A Yogi’s Guide to Joy, Bhagavad-Gita (Hindi), as well as The Alchemist respectively.

p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2

show(p)

The given graph indicates the best 10 authors in descendant order that have maximum rated books having > 1000 customer ratings that include Sadhguru, A. C. Bhaktivedanta, as well as Paulo Coelho respectively.

Maximum Customer Rated Books and Authors

As you already have seen the best-rated books as well as top-rated authors, this would still become more credible and convincing to determine the finest author as well as the book depending on total customers that have rated for this book.

Therefore, let's rapidly find it out.

data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20]

data


Book Name	Author	Rating	Customers_Rated	Price
11	The Alchemist	Paulo Coelho	4.7	22182	264.0
1	Think and Grow Rich	Napoleon Hill	4.5	16670	99.0
13	How to Win Friends and Influence People	Dale Carnegie	4.6	15377	99.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
18	Rich Dad Poor Dad : What The Rich Teach Their ...	Robert T. Kiyosaki	4.5	14591	296.0
10	The Subtle Art of Not Giving a F*ck	Mark Manson	4.4	14418	365.0
0	The Power of your Subconscious Mind	Joseph Murphy	4.5	13948	99.0
48	The Power of Your Subconscious Mind	Joseph Murphy	4.5	13948	99.0
72	The Secret	Rhonda Byrne	4.5	11220	556.0
41	1984	George Orwell	4.5	10829	95.0
2	Word Power Made Easy	Norman Lewis	4.4	10708	130.0
46	Man's Search For Meaning: The classic tribute ...	Viktor E Frankl	4.4	8544	245.0
67	The 7 Habits of Highly Effective People	R. Stephen Covey	4.3	8229	397.0
47	Harry Potter and the Philosopher's Stone	J.K. Rowling	4.7	7737	234.0
40	One Indian Girl	Chetan Bhagat	3.8	7128	113.0
65	Thinking, Fast and Slow (Penguin Press Non-Fic...	Daniel Kahneman	4.4	7087	410.0
27	The Intelligent Investor (English) Paperback –...	Benjamin Graham	4.4	6201	650.0
17	The Monk Who Sold His Ferrari	Robin Sharma	4.6	5877	137.0
53	Ram - Scion of Ikshvaku (Ram Chandra)	Amish Tripathi	4.2	5766	262.0
93	The Richest Man in Babylon	George S. Clason	4.5	5694	129.0



from bokeh.transform import factor_cmap
from bokeh.models import Legend
from bokeh.palettes import Dark2_5 as palette
import itertools
from bokeh.palettes import d3
#colors has a list of colors which can be used in plots
colors = itertools.cycle(palette)

palette = d3['Category20'][20]


index_cmap = factor_cmap('Author', palette=palette,
                         factors=data["Author"])


p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated")
p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author')
p.xaxis.axis_label = 'RATING'
p.yaxis.axis_label = 'CUSTOMERS RATED'
p.legend.location = 'top_left'


BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead


show(p)

The given graph here is the scatter plot of Authors that bagged customer ratings vs. actual ratings. The following results can be taken after going through the plot.

The Alchemist - Hands down Paulo Coelho's book, is the best-selling book as the ratings and number of clients rated, both are synced.

Ram - Scion of Ikshvaku (Ram Chandra) – written by Amish Tripathi, has average ratings of 4.2 having 5766 customer ratings. Although, a book named The Richest Man in Babylon, written by George S. Clason has nearly similar customer ratings however the overall ratings is 4.5. Therefore, it could be decided that more clients gave a higher ratings with The Richest Man in Babylon.

Conclusion

In this tutorial, we have provided the basic details of doing web scraping using BeautifulSoup as well as how can you make sense out from the data scraped from the web through visualizing that using bokeh plotting library. Another good exercise of taking the step forward while learning data scraping with BeautifulSoup is scraping data from other websites as well as see how you can get insights from that.

If you want to scrape data from Amazon book details then contact Retailgators or ask for a free quote!

How to Scrape Amazon for Book Information Using Python and BeautifulSoap?

Importing Required Libraries

Extracting Amazon’s Best Selling Books

Read a CSV File

Some pre-processing on Ratings, Price Column, and customers_rated:

Count Number of NaNs within DataFrame

Curious to scrape Amazon book data?

Highest Priced Books by Authors

Top Rated Authors and Books wrt Customer Rated

Maximum Customer Rated Books and Authors

Conclusion

Leave a Reply

Ready to Get Started?

Solving Retailer Challenges With Advanced Data

Our Headquarters

Our Achievements

Our Services

Popular Etailer

Quick Links

Get In Touch