Snapshot of how our PDF extractor perceives a table

SimFin Fuse 2.0

SimFin
4 min readOct 2, 2019

We are happy to finally introduce SimFin Fuse 2.0, which you can find here. It has been online for the last two months already, but we had to improve some more things and didn’t feel it was time yet for an official presentation, so this will now be it’s official introduction.

SimFin 2.0 is the last big step towards our goal of increasing our data quality and expanding our data beyond the US market, as it combines our PDF crawling with the PDF extraction and manages the upload of the structured data to SimFin (it can also still process XBRL filings as before). As a result, we are now (technically) ready to crawl and host fundamental company data from any listed company around the world that reports financial statements in a PDF format. Although we have some first non-US companies now online (see for example Adidas, German telecom company 1 und 1 Drillisch or quarterly data up to 1999 for Bayer AG), every new PDF extraction is still a challenge (more on that below), so we are progressing slowly currently, but also steadily and will get faster with every new company that we have processed.

To recap, we started building our PDF crawler in autumn last year (read more here) and published the first results of our PDF extraction engine in spring (more about this here). The PDF extraction in its core form was then working at the beginning of this summer, but after testing it on some real world cases, we realised that we had still a lot to improve. It’s one thing to recognise a table and extract data from it, but it’s another to get everything right about the actual financial statements. In order for us to be able to extract data over long time periods (e.g. 20 years, that is 80 quarters), we have to properly recognise all time periods for all columns in the table, all units, all currencies etc. Needless to say that there is a huge variety of how companies report/display this in their PDFs. We have this working solidly now though, at least for the companies we encountered so far. Every new company can still be a challenge though (suddenly you have 80% unreadable characters in a PDF for example…), and can take quite long to be processed initially. But with every iteration to our crawler/extractor, our algorithms are getting a little bit better and more robust. Additionally, for some tasks where our algorithms perform poorly currently, we have built in some tools to make manual corrections, all of which directly go into the creation of a dataset that we’ll use to solve these issues with machine learning once the dataset is big enough.

We are as ever committed to making fundamental financial data freely available, and just made another big step in this direction, by starting to offer data from international companies for free. But in the end running the servers costs money and the PDF extraction is heavier in terms of computational power required than crawling the structured data from the SEC, as we rely a lot on machine learning to make it work and also still have to do some manual adjustments, so if you want to help us and speed up the process of covering more companies, consider acquiring a SimFin+ subscription, as all funds will help us tremendously in making SimFin an even better data platform. You can also request international companies if you have a SimFin+ subscription (you can still request US companies also without SimFin+), we’ll prioritise the upload of these then.

Coming up next: Bulk download rework and API improvements

In the coming weeks there will be some big improvements to the bulk download and the API. The bulk download is being reworked completely right now. It will be much easier and user friendly to load the data and work with it and there will also be more frequent updates of the datasets. The entire backend will have to be changed for this, but that will also enable us to make the API much faster than it is currently. We’ll inform you with another post once that is done and polished, so stay tuned.

Thanks to all users/supporters

On a personal note, thanks also to all the people that helped in any way to get SimFin this far, be it with encouraging words, funding, knowledge or simply feedback. To be honest I couldn’t have imagined getting to this point, as the workload ahead always seemed quite infinite and time limited to 24 hours a day. SimFin is very slowly becoming the data platform that I always had in my mind when I first started coding it 5 years ago. Until today, I have done almost all of the coding and design on SimFin personally, be it the entire front/backend, the PDF extraction or the creation of the machine learning models. I know there is still a lot of work to do, but very slowly I am starting to see something like the light at the end of the tunnel. I never doubted that I had to enter this tunnel, I just didn’t know how long it was and if I was going to come out of it at all. Then again, standing still or abandoning the project never felt like an option, so I kept walking, and hearing that some people are appreciating the results, I think it was (and is) worth it.

Thomas Flassbeck, CEO

--

--