The main goal of project was to develop a web scraping solution to collect data from millions of pages from indeed.com and store inside of AWS S3 bucket under specific folder and naming conventions.
Project Details
Python’s Scrapy framework was used with Crawlera as proxy rotator so bot does not get blocked. There was a middleware in place which was pushing data into S3. Scrapy bot was deployed on AWS EC2 Ubuntu instance was running on specified schedule. As a result, we have extracted millions of data points and stored inside of an S3 bucket.