Scrapy + AWS EC2 + S3

The main goal of project was to develop a web scraping solution to collect data from millions of pages from indeed.com and store inside of AWS S3 bucket under specific folder and naming conventions.

Project Details

Python’s Scrapy framework was used with Crawlera as proxy rotator so bot does not get blocked. There was a middleware in place which was pushing data into S3. Scrapy bot was deployed on AWS EC2 Ubuntu instance was running on specified schedule. As a result, we have extracted millions of data points and stored inside of an S3 bucket.

Scrapy + AWS EC2 + S3

Project Details

Date

Categories

Client