preloader
image

AI powered Scrapy bot

Goal of the project was to create a Scrapy bot which will go to thousands of websites from particular sector and extract most important data points like company about page, address, news and specifically employee realted information.

Project Details

The complexity of task was that all webpages had different html strucuture. So, practically there was no way to create so many rules and xpaths to capture all data points.

My solution to this problem was to develop a model, which will detect tags containing needed information and then Scrapy bot will just extract text or other attributes from it. Model of choice in this case was LSTM and it performed well for this task. So with some help I have labelled training set to train model for this task and successfully deployed model on the cloud.

  • Date

    30 Dec, 2019
  • Categories

    Machine Learning, Deep Learning, Cloud, Web Scraping
  • Client

    Robin Tar