Classifi cation of industry of firm

Project goals were to infer the industry sectors of New York–based firms—without pre-existing labels—by matching them to reference datasets and then building a machine-learning model to predict NAICS/SIC codes at scale.

Project Details

In this project I engineered an end-to-end pipeline that:

Matched unlabeled firms (with only name, owner, address, zip code, city, state, establishment/dissolution years) against secondary registries to retrieve known NAICS/SIC industry codes via fuzzy string and location‐based matching.
Curated a labeled training set by combining successfully matched entries and augmenting it with a few-shot labeling approach using GPT-4 to expand coverage without manual annotation.
Developed a BERT-based classifier (experimenting with SVM and Random Forest baselines) in Python to predict industry codes for the remaining unmatched firms, optimizing for accuracy and inference cost.
Implemented QA controls, with random manual reviews of both matching results and model predictions to ensure high data quality.
Delivered full documentation and a replication package so stakeholders can reproduce the matching logic, labeling strategy, and model training/inference workflows.

Classifi cation of industry of firm

Project Details

Date

Categories

Client