PDF Library & Next Steps

SimFin
3 min readFeb 26, 2019

Dear SimFin users and everyone in the financial community,

We just released a new update on SimFin that includes our new PDF library, check it out here: https://simfin.com/pdf/library

What is the PDF library and how do we build it?

The PDF library aims to collect financial PDFs of all listed companies around the world, with a focus on annual/quarterly reports but also earnings releases, presentations and earnings call transcripts. The library can be accessed by anyone and it is the first time that all financial reporting is made available openly in a single place.

The PDFs in our library are crawled using our open source PDF crawler, if you want to improve it or are just curious whats’ happening behind the curtain, check it out on Github. For now, the crawler is blindly collecting all PDFs it can find on a company homepage. Once it is finished, we use machine learning to classify the PDFs into 7 categories: annual reports, quarterly reports, earnings releases, earnings presentations, earnings call transcripts (those 5 are the ones we mostly care about), other financial documents and one last category which we call “irrelevant” which are just pretty random (non-finance-related) PDFs.

We then also have different classification algorithms which decide what the applicable financial year, reporting period etc. is for the PDFs we care about. The result of all this is the PDF library, in which you can find the original (and classified, meaning sort- and filterable) financial reports and other documents that are released along with the earnings. As a user, you can also correct or confirm all the classifications made by our algorithms. Doing this will help us a lot in improving our machine learning models, so we are looking forward to your help here.

So why did we choose to build this PDF library, what’s the point of it, you might ask?

First, a lot of analysts and investors like to look at the original reports of a company when they make a deep analysis of a company’s fundamentals and business prospects. We think that the PDF library can come in very handy in quickly finding the original documents a company published and thus is a time saver for everyone who uses it.

Second, we think it could be interesting to have one place on the internet that stores all these documents in a centralised place, because even though information that is once on the internet rarely disappears completely again, it still tends to get harder to find as time passes. As websites change, so does the data that is openly available on them. So keeping track of this information and making sure that it doesn’t disappear completely is something valuable in our opinion.

Thirdly, and most importantly, the PDF library will be the basis of getting data from around the world for SimFin (you can read more about this here). We are already quite advanced in the development of our extraction engine but in order to make it really robust we will need a bit more time. It’s always hard to estimate when a project as complex as this will be “ready” for the public, but very probably you’ll see first results on SimFin in the coming months. We are excited to advance the PDF extraction further as it will enable us to:

a) improve data quality (as the XBRL data quality is quite poor),

b) give us access to data from around the world (as all companies we are aware of publish their financials in the PDF format) and

c) let’s us get data quicker than currently, as the XBRL comes online usually only a few days after the PDFs.

We hope you enjoy the new features on SimFin and are happy to hear your feedback.

Thomas Flassbeck, CEO and founder of SimFin

--

--