With thousands of companies offering products and price monitoring solutions for Amazon, scraping Amazon is big business.
But for anyone who’s tried to scrape it at scale you know how quickly you can get blocked.
So in this article, I’m going to show you how I built a Scrapy spider that searches Amazon for a particular keyword, and then goes into every single product it returns and scrape all the main information:
- ASIN
- Product name
- Image url
- Price
- Description
- Available sizes
- Available colors
- Ratings
- Number of reviews
- Seller rank
With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of products per month. The code for the project is available on GitHub here.
What We Will Need?
Obviously, you could build your scrapers from scratch using a basic library like requests and Beautifulsoup, but I choose to build it using Scrapy.
The open-source web crawling framework written in Python, as it by far the most powerful and popular web scraping framework amongst large scale web scrapers.
Compared to other web scraping libraries such as BeautifulSoup, Selenium or Cheerio, which are great libraries for parsing HTML data, Scrapy is a full web scraping framework with a large community that has loads of built-in functionality to make web scraping as simple as possible:
- XPath and CSS selectors for HTML parsing
- data pipelines
- automatic retries
- proxy management
- concurrent requests
- etc.
Making it really easy to get started, and very simple to scale up.
The second thing that was a must, if you want to scrape Amazon at any type of scale is a large pool of proxies and the code to automatically rotate IPs and headers, along with dealing with bans and CAPTCHAs. Which can be very time consuming if you build this proxy management infrastructure yourself.
For this project I opted to use Scraper API, a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.
Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.
Getting Started With Scrapy
Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:
pip install scrapy
Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“amazon_scraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:
scrapy startproject amazon_scraper
Here is what you should see
├── scrapy.cfg # deploy configuration file
└── tutorial # project's Python module, you'll import your code from here
├── __init__.py
├── items.py # project items definition file
├── middlewares.py # project middlewares file
├── pipelines.py # project pipeline file
├── settings.py # project settings file
└── spiders # a directory where spiders are located
├── __init__.py
└── amazon.py # spider we just created
Similar to Django when you create a project with Scrapy it automatically creates all the files you need. Each of which has its own purpose:
- Items.py is useful for creating your base dictionary that you import into the spider
- Settings.py is where all your settings on requests and activating of pipelines and middlewares happen. Here you can change the delays, concurrency, and lots more things.
- Pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to databases (Excel, SQL, etc).
- Middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.
Creating Our Amazon Spider
Okay, we’ve created the general project structure. Now, we’re going to develop our spiders that will do the scraping.
Scrapy provides a number of different spider types, however, in this tutorial we will cover the most common one, the Generic Spider.
To create a new spider, simply run the “genspider” command:
# syntax is --> scrapy genspider name_of_spider website.com
scrapy genspider amazon amazon.com
And Scrapy will create a new file, with a spider template.
In our case, we will get a new file in the spiders folder called “amazon.py”.
import scrapy
class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/']
def parse(self, response):
pass
We’re going to remove the default code from this (allowed_domains, start_urls, parse function) and start writing our own code.
We’re going to create four functions:
- start_requests – will send a search query Amazon with a particular keyword.
- parse_keyword_response – will extract the ASIN value for each product returned in the Amazon keyword query, then send a new request to Amazon to return the product page of that product. It will also move to the next page and repeat the process.
- parse_product_page – will extract all the target information from the product page.
- get_url – will send the request to Scraper API so it can retrieve the HTML response.
With a plan made, now let’s get to work…
Send Search Queries To Amazon
The first step is building start_requests, our function that sends search queries to Amazon with our keywords. Which is pretty simple…
First let’s quickly define a list variable with our search keywords outside the AmazonSpider.
queries = ['tshirt for men', ‘tshirt for women’]
Then let’s create our start_requests function within the AmazonSpider that will send the requests to Amazon.
To access Amazon’s search functionality via a URL we need to send a search query “k=SEARCH_KEYWORD” :
https://www.amazon.com/s?k=<SEARCH_KEYWORD>
When implemented in our start_requests function, it looks like this.
## amazon.py
queries = ['tshirt for men', ‘tshirt for women’]
class AmazonSpider(scrapy.Spider):
def start_requests(self):
for query in queries:
url = 'https://www.amazon.com/s?' + urlencode({'k': query})
yield scrapy.Request(url=url, callback=self.parse_keyword_response)
For every query in our queries list, we will urlencode it so that it is safe to use as a query string in a URL, and then use scrapy.Request to request that URL.
Since Scrapy is async, we will use yield instead of return, which means the functions should either yield a request or a completed dictionary. If a new request is yielded it will go to the callback method, if an item is yielded it will go to the pipeline for data cleaning.
In our case, if scrapy.Request it will activate our parse_keyword_response callback function that will then extract the ASIN for each product.
Scraping Amazon’s Product Listing Page
The cleanest and most popular way to retrieve Amazon product pages is to use their ASIN ID.
ASIN’s are a unique ID that every product on Amazon has. We can use this ID as part of our URLs to retrieve the product page of any Amazon product like this…
https://www.amazon.com/dp/<ASIN>
We can extract the ASIN value from the product listing page by using Scrapy’s built-in XPath selector extractor methods.
XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should check out the documentation here.
Using Scrapy Shell, I’m able to develop a XPath selector that grabs the ASIN value for every product on the product listing page and create a url for each product:
products = response.xpath('//*[@data-asin]')
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.com/dp/{asin}"
Next, we will configure the function to send a request to this URL and then call the parse_product_page callback function when we get a response. We will also add the meta parameter to this request which is used to pass items between functions (or edit certain settings).
def parse_keyword_response(self, response):
products = response.xpath('//*[@data-asin]')
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.com/dp/{asin}"
yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})
Extracting Product Data From Product Page
Now, we’re finally getting to the good stuff!
So after the parse_keyword_response function requests the product pages URL, it passes the response it receives from Amazon to the parse_product_page callback function along with the ASIN ID in the meta parameter.
Now, we want to extract the data we need from a product page like this.
To do so we will have to write XPath selectors to extract each field we want from the HTML response we receive back from Amazon.
def parse_product_page(self, response):
asin = response.meta['asin']
title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
image = re.search('"large":"(.*?)"',response.text).groups()[0]
rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
For scraping the image url, I’ve gone with a regex selector over a XPath selector as the XPath was extracting the image in base64.
With very big websites like Amazon, who have various types of product pages what you will notice is that sometimes writing a single XPath selector won’t be enough. As it might work on some pages, but not on others.
In cases like these, you will need to write numerous XPath selectors to cope with the various page layouts. I ran into this issue when trying to extract the product price so I needed to give the spider 3 different XPath options:
def parse_product_page(self, response):
asin = response.meta['asin']
title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
image = re.search('"large":"(.*?)"',response.text).groups()[0]
rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
if not price:
price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()
If the spider can’t find a price with the first XPath selector then it moves onto the next one, etc.
If we look at the product page again, we will see that it contains variations of the product in different sizes and colors. To extract this data we will write a quick test to see if this section is present on the page, and if it is we will extract it using regex selectors.
temp = response.xpath('//*[@id="twister"]')
sizes = []
colors = []
if temp:
s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
json_acceptable = s.replace("'", "\"")
di = json.loads(json_acceptable)
sizes = di.get('size_name', [])
colors = di.get('color_name', [])
Putting it all together, the parse_product_page function will look like this, and will return a JSON object which will be sent to the pipelines.py file for data cleaning (we will discuss this later).
def parse_product_page(self, response):
asin = response.meta['asin']
title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
image = re.search('"large":"(.*?)"',response.text).groups()[0]
rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
if not price:
price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()
temp = response.xpath('//*[@id="twister"]')
sizes = []
colors = []
if temp:
s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
json_acceptable = s.replace("'", "\"")
di = json.loads(json_acceptable)
sizes = di.get('size_name', [])
colors = di.get('color_name', [])
bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
yield {'asin': asin, 'Title': title, 'MainImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews,
'Price': price, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points,
'SellerRank': seller_rank}
Iterating Through Product Listing Pages
We’re looking good now…
Our spider will search Amazon based on the keyword we give it and scrape the details of the products it returns on page 1. However, what if we want our spider to navigate through every page and scrape the products of each one.
To implement this, all we need to do is add a small bit of extra code to our parse_keyword_response function:
def parse_keyword_response(self, response):
products = response.xpath('//*[@data-asin]')
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.com/dp/{asin}"
yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})
next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first()
if next_page:
url = urljoin("https://www.amazon.com",next_page)
yield scrapy.Request(url=product_url, callback=self.parse_keyword_response)
After the spider has scraped all the product pages on the first page, it will then check to see if there is a next page button. If there is, it will retrieve the url extension and create a new URL for the next page. Example:
https://www.amazon.com/s?k=tshirt+for+men&page=2&qid=1594912185&ref=sr_pg_1
From there it will restart the parse_keyword_response function using the callback and extract the ASIN IDs for each product and extract all the product data like before.
Testing The Spider
Now that we’ve developed our spider it is time to test it. Here we can use Scrapy’s built-in CSV exporter:
scrapy crawl amazon -o test.csv
All going good, you should now have items in test.csv, but you will notice there are 2 issues:
- the text is messy and some values are lists
- we are getting 429 responses from Amazon which means Amazon is detecting us that our requests are coming from a bot and is blocking our spider.
Issue number two is the far bigger issue, as if we keep going like this Amazon will quickly ban our IP address and we won’t be able to scrape Amazon.
In order to solve this, we will need to use a large proxy pool and rotate our proxies and headers with every request. For this we will use Scraper API.
Connecting Your Proxies With Scraper API
As discussed, at the start of this article Scraper API is a proxy API designed to take the hassle out of using web scraping proxies.
Instead of finding your own proxies, and building your own proxy infrastructure to rotate proxies and headers with every request, along with detecting bans and bypassing anti-bots you just send the URL you want to scrape the Scraper API and it will take care of everything for you.
To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.
Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.
For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.
To do so, I just needed to create a simple function that sends a GET request to Scraper API with the URL we want to scrape.
API = ‘<YOUR_API_KEY>’
def get_url(url):
payload = {'api_key': API, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url).
def start_requests(self):
...
…
yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)
def parse_keyword_response(self, response):
...
…
yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin})
...
…
yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)
A really cool feature with Scraper API is that you can enable Javascript rendering, geotargeting, residential IPs, etc. by simply adding a flag to your API request.
As Amazon changes the pricing data and supplier data shown based on the country you are making the request from we’re going to use Scraper API’s geotargeting feature so that Amazon thinks our requests are coming from the US. To do this we need need to add the flag “&country_code=us” to the request, which we can do by adding another parameter to the payload variable.
def get_url(url):
payload = {'api_key': API, 'url': url, 'country_code': 'us'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
You can check out Scraper APIs other functionality here in their documentation.
Next, we have to go into the settings.py file and change the number of concurrent requests we’re allowed to make based on the concurrency limit of our Scraper API plan. Which for the free plan is 5 concurrent requests.
## settings.py
CONCURRENT_REQUESTS = 5
Concurrency is the number of requests you are allowed to make in parallel at any one time. The more concurrent requests you can make the faster you can scrape.
Also, we should set RETRY_TIMES
to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY
and RANDOMIZE_DOWNLOAD_DELAY
aren’t enabled as these will lower your concurrency and are not needed with Scraper API.
## settings.py
CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5
# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY
Cleaning Data With Pipelines
The final step we need to do is to do a bit of data cleaning using the pipelines.py file as the text is messy and some values are lists.
class TutorialPipeline:
def process_item(self, item, spider):
for k, v in item.items():
if not v:
item[k] = '' # replace empty list or None with empty string
continue
if k == 'Title':
item[k] = v.strip()
elif k == 'Rating':
item[k] = v.replace(' out of 5 stars', '')
elif k == 'AvailableSizes' or k == 'AvailableColors':
item[k] = ", ".join(v)
elif k == 'BulletPoints':
item[k] = ", ".join([i.strip() for i in v if i.strip()])
elif k == 'SellerRank':
item[k] = " ".join([i.strip() for i in v if i.strip()])
return item
After the spider has yielded a JSON object, the item is passed to the pipeline for the item to be cleaned.
To enable the pipeline we need to add it to the settings.py file.
## settings.py
ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}
Now we are good to go. You can test the spider again by running the spider with the crawl command.
scrapy crawl amazon -o test.csv
This time you should see that the spider was able to scrape all the available products for your keyword without getting banned.
If you would like to run the spider for yourself or modify it for your particular Amazon project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.
https://dev.to/iankerins/how-to-scrape-amazon-at-scale-with-python-scrapy-and-never-get-banned-44cm