climate-plus part 2: data augmentation

It’s been quite a while since the first installment of this humble series. Things happened in the meantime: the team and I presented this project to the cohort; I, again, presented it to my colleagues; and most importantly, the ClimateBERT team updated their website and disclosed both datasets and downstream task models. In other words, our attempt to extend the functionality of ClimateBERT seems to be in vain. Nevertheless, due to computational limitations, we managed to approach the same question in a “lite” manner where we skipped the step to train a LLM with a dedicated climate-related corpus and instead fine-tuned a distilroberta directly with some available data.

What to do?

So this post is mainly about what our data is and how we obtained it.

As mentioned last time, we ended up with a folder of PDF files. Note that some of them are invalid because the original URLs parsed from the table could be faulty. Another issue we faced was that the actual content that contributes to the labels condensed into very few pages in a document. That is to say, the majority of scraped data is redundant.

A quick filtering by examining the size of the files could eliminate the dysfunctional ones. We tried to use the page number (a numeric value or a range) from the scraped table, but the page numbering is extremely inconsistent among the files. For instance, there could be a few Roman numeral pages before the actual “page 1”. Therefore, we manually extract the contributing pages from the document. Call it “a labor of love”.

Then the question is, **if we want to build a transformer model that classifies sentences into a couple of labels, what do we use for training?" Naturally, we utilize what we have at hand: breaking pages of documents into usable sentences. However, not all sentences are equally informative. Depending on the page layout and parsing method, some “sentences” could make absolutely no sense. Therefore, why not pick some representative ones using gpt-3.5 as long as we provide selection details?

The TCFD recommendation report defines 4 categories of these disclosures, and under each category, there are further subcategories. We could pass these definitions to the prompt template that interacts with OpenAI’s model. Of course, the number of retrieved sentences is fully customizable, and we picked 5 in our case.

The code to generate answers was recycled from another project, “chitchat,” that I was working on. I recently added a Streamlit app which now accepts multiple file uploads (compared to some other tools available online, it’s a plus for sure).

In this way, we are able to augment our dataset to a usable size of 500+.

#nlp #climate