Finding Fossils in the Literature
Student Capstone Project
Neotoma is an open source and community-led database that houses paleoecology data. Data includes the location and age of fossil sites, the species found and descriptive information about the site. The database gets used to study past environmental changes and to help predict and evaluate future climate change scenarios.
The process of data entry can often be a barrier for data submission. In addition, many published records are never submitted to the database. Neotoma worked with a group of UBC Master of Data Science Vancouver students to develop a full-text journal search using NLP tools to identify potential records for submission to Neotoma, and to extract relevant data from the journal articles. This data could then be displayed through a dashboard and a human-in-the-loop system to help refine the system.
The ideal outcome for Neotoma would be for all new paleoecology research from around the world to get uploaded to Neotoma. Neotoma is a global database, but not all researchers know about it, or are aware that their research is relevant for the database.
To work towards the ideal outcome, the team developed three data products. The first one is the Article Relevance Predictor that will determine the relevancy of articles.
The Article Relevance Predictor monitored any new additions to the xDD API Article Repository. These new articles are then queried with the CrossRef AP to gather the articles metadata such as title, author, abstract, etc.… These then became candidate features for the classification model. For their training data, Neotoma supplied a list of almost 4500 articles that comprised of some non-relevant (3523) and relevant (911) articles from the Neotoma database.
The students tried several supervised learning classification models and determined the Logistic Regression model had the best recall (96.5%) and precision (85.2%). This model used 666 test samples and discovered 138 were correctly identified as relevant with 5 false positives and 24 false negatives. The students also ran the pipeline against real world data and among the 5000 most recent articles on the xDD API, 21 were identified as relevant. These relevant titles were passed on to the next stage for fossil extraction.
The Fossil Data Extraction model used the relevant publications and extracted data entities that are considered critical metadata for Neotoma (site name and coordinates, geographic region, taxonomy, altitude, fossil ages, and the emails of the authors). To do this the team created a labelling pipeline to tag 40 articles (~300k words, ~11k entities) then trained multiple state of the art transformer-based models to extract the custom entities. The final model was able to extract ~80% of the known entities within the test set of labelled articles.
Once data has been extracted, it is moved to the Data Review Tool, an interactive dashboard that shows articles ready to be reviewed, articles that have been reviewed and articles not relevant to Neotoma. The extracted entities are reviewed and any corrections are made while having the actual sentences the entity appeared in as reference for the reviewer. If an article is not deemed relevant, users can mark it as irrelevant when reviewing the article. Any article deemed relevant and subsequently reviewed then gets its data added to Neotoma.
As a result of the team’s pipeline, they were able to increase access to Neotoma, reduced the time spent on data uploads, grow the Neotoma community and provide new opportunities for future funding.
Explore our Data Science Programs Explore Other Data in Action Stories