PDF Extractor Alpha

SimFin
3 min readApr 29, 2019

--

We are happy to announce that the first version of our PDF Extractor is now online. You can read more about the idea behind the PDF extraction in our previous posts (here and here), so this post will focus instead on the current state of the extractor, how it works and what the next steps are.

The PDF Extractor

You can find the PDF extractor if you head to the SimFin PDF Library and click on “open” in the “Extracted data” column for annual or quarterly reports. The extractor currently focuses solely on tables with numeric data, as these contain almost all the information we are interested in (financial statements, segment reporting etc.). Our general goal is to extract as many structured information as possible from annual/quarterly reports. This means that in the future we will be expanding the extractor to also gather information from outside the tables, but for the moment the focus on tables is an important simplification of the general problem. We will be testing the extractor now in the coming weeks and are happy to hear your feedback either in the forums or by e-mail.

How It Works

The PDF extraction engine was custom built by us for the specific purpose of extracting financial information from PDFs. We looked at various open source libraries for table extraction from PDFs but found them to be not reliable enough, which is probably due to the fact that these libraries try to extract any table from a PDF, while we can simplify the problem substantially by only looking at “numeric tables”. We also use a proprietary deep convolutional neural network to help us with the extraction of data, in order to increase precision even further.

Next Steps

The next step is to connect the extractor with the SimFin database, as you currently can only look at the extracted data and download it to your computer, but there is no way yet to upload this data automatically to SimFin in order to then access it via API or in the data finder.

We first want to test the extractor extensively before making this connection, which is the last step in the extraction process and probably the easiest, as the data is structured already, time periods, currencies and units are identified and context information about the report (type of report, reporting period etc.) is available via the PDF library.

We also have built other machine learning classifiers that can detect on which page the financial statements for example are located, so theoretically we have everything in place to upload the data, but we think that now is a good moment for also reworking SimFin Fuse (our current crawler/XBRL extractor), to combine it with the data extracted from PDFs to obtain a “SimFin Fuse 2.0”, that can then be used not only for US companies but for all companies around the world.

Covering other markets than the US will then be a step-by-step process whereby we go from one market to another (starting with Germany, our “home” market), continually testing the efficacy of our system while gathering feedback from you on where you want us to focus our efforts.

We are excited about getting very close to our goal of hosting data from all companies around the world on SimFin, and hope you enjoy the new feature.

Thomas Flassbeck, CEO

--

--