BC Ministry of Environment and Parks

Classifying Contaminated Sites Records | Student Capstone Project

The site remediation program with the BC Ministry of Environment and Parks regulates the cleanup of sites in British Columbia that have had former commercial or industrial activities. As part of their mandate, the Environmental Management Act (EMA) requires the establishment of a registry that is accessible to the public containing records about the identification, investigation and remediation of contaminated sites. Environmental professionals, First Nations, municipalities and members of the public use information on the site registry to perform due diligence when purchasing a site that may be contaminated due to past activities.

The site remediation program with BC Ministry of Environment and Parks has tracked over 29,000 sites and associated activities over the past few decades. Each site has documents either in paper form or in legacy databases. To ensure that these documents are easily accessible, the BC Ministry of Environment and Parks worked with a group of UBC Master of Data Science in Computational Linguistics students to research, implement and compare the best data science approach to add meta data tags to site remediation documents.

The partner provided the students with approximately 6000 files already digitized in PDF format. The students had four clear goals for their capstone project: Extract accurate metadata from each scanned PDF, classify every document into the correct business category, remove or mark duplicates so that only one canonical version remains, and reorganize everything into a folder structure that can be ingested directly by the registry.

At a high level, the automated pipeline follows nine steps. First the students would load the PDF, pull a site identifier, extract metadata with an LLM (Site ID, sender/receiver names, address, and title, classified the document type), validate and fill gaps, classify document type (e.g., correspondence, report, etc.), detect duplicates, rename the file and place it in the correct folder, and finally log every decision to a central CSV.  

The pipeline combined rule-based logic, BERT-based models, and LLMs via Mistral/Ollama to extract and standardize key metadata, all while tackling OCR noise, duplicate detection, and release tagging. The students started with 225 gold standard files that had been fully annotated by the team, which gave them a solid foundation for training and tuning (the students also used 40 new gold-labelled documents that their model never saw during training).

A challenge the students faced was the provided PDFs were typewritten pages that have been photocopied multiple times, handwritten annotations in the margins, and even pages stitched together from different scans. Additional challenges included documents not following a consistent format, contained optical character recognition (OCR) errors, lacked obvious labels like who sent them, what they’re about, or whether they can be publicly released, and aren’t categorized by document type.

If the students were to start over, they would have invested more time upfront in data exploration and annotation schema design for manual labelling of files, built evaluation tools earlier to catch pipeline issues faster, and planned better for edge cases, like documents missing Site IDs or containing multiple conflicting senders.

At the end of the day, the students felt this project helped unlock thousands of environmental records, documents that could help support contaminated site investigations. Also, the project can help communities understand land risks, enable faster, more transparent public access to government data and reduce staff workload when handling Freedom of Information (FOI) requests by identifying publicly releasable documents in advance.

Parts of this article first appeared on Medium in a post written by the BC Ministry of Environment capstone team.

Explore Computational Linguistics Explore Other Data in Action Stories