preloader
image

Classifi cation of industry of firm

Project goals were to infer the industry sectors of New York–based firms—without pre-existing labels—by matching them to reference datasets and then building a machine-learning model to predict NAICS/SIC codes at scale.

Project Details

In this project I engineered an end-to-end pipeline that:

  • Matched unlabeled firms (with only name, owner, address, zip code, city, state, establishment/dissolution years) against secondary registries to retrieve known NAICS/SIC industry codes via fuzzy string and location‐based matching.
  • Curated a labeled training set by combining successfully matched entries and augmenting it with a few-shot labeling approach using GPT-4 to expand coverage without manual annotation.
  • Developed a BERT-based classifier (experimenting with SVM and Random Forest baselines) in Python to predict industry codes for the remaining unmatched firms, optimizing for accuracy and inference cost.
  • Implemented QA controls, with random manual reviews of both matching results and model predictions to ensure high data quality.
  • Delivered full documentation and a replication package so stakeholders can reproduce the matching logic, labeling strategy, and model training/inference workflows.
  • Date

    08 Aug, 2024
  • Categories

    Deep Learning, Machine Learning, Generative Ai, Llm, Chat Gpt
  • Client

    Andrew Miles