We just released our new bulk download and along with it our new Python API. There are various tutorials already available on Github that range from the basics of using the new Python API to statistical analysis and machine learning. All tutorials are available as Jupyter Notebooks on Google Colab so you can try everything in your browser without having to download anything.
In this blog post I’m going to show you quickly how we structured the new bulk download and what you can do with our new Python API.
Bulk Download File Structure
[If you mainly care about what amazing things you can do with the Python API, you can skip this part]
Before we could build a Python API, we had to decide how to structure the data in a way that it could be downloaded quickly and worked with seamlessly. So we threw the design of our old bulk download away and reworked everything. In this new iteration of our bulk download, the data is split into different datasets (i.e. tables), so that as little information as possible is “repeated”. In a way, you could say the data is structured like a relational database. In that sense, all the general information that pertains to a single company and doesn’t change over time (or it might change over time, but only the “current” information is relevant) is contained in one dataset (or table) called “Companies”. The information saved here is currently: the ticker, the company name, the SimFin ID (our internal unique identifier for a company) and the Industry ID.
You might ask now, why is there only a cryptic “IndustryId” and not something more tangible like the actual name of the industry? Because each Industry ID contains a sector (represented by the first 3 digits) and an industry (represented by the last 3 digits), and writing this info out just wouldn’t be efficient, in the sense that it would result in the file getting bigger than it needs to be (and thus the API slower, as it downloads all the files from the server). So for that reason, the mapping of Industry IDs to the actual sector/industry, is in a different dataset called “Sector/Industry”.
If you are familiar with relational databases, splitting the data in this way will feel quite intuitive to you. Joining the data together (i.e. combining the companies and the industries datasets) is quite easy using the Pandas DataFrames the Python API produces but could also be done with any relational database engine (you could essentially transfer the tables from the bulk download one to one to a relational database).
So far, so good. But how about the more interesting data, like fundamentals and share prices?
Fundamental data is split into different datasets for the P&L, the balance sheet and the cash flow. Banks and insurances are also handled separately from the rest, since their standardisation template (meaning the mapping of the items originally reported by a company to some standardised figures) differs from that used by most of the “normal” companies.
There are several different time periods available for the fundamental datasets: annual data, quarterly data and trailing twelve months (TTM) data.
Let’s take a look at the “Income Statement” dataset with the variant “annual data”:
As you can see, the company is identified by the ticker and the SimFinId here (both are unique identifiers for a company). There are “meta” columns for the currency, the fiscal year/period, as well as the report and publishing dates. If you have a SimFin+ subscription, you can also get some additional columns like the source of the data, that directly links to the SEC filings (or the PDFs the data was extracted from for our international companies).
A full list of the available fundamental columns in the income statement can be found on our website, the premium datasets are also available with some additional columns. The data in the fundamental datasets does not contain ratios or derived figures such as EBITDA. This is because it’s very easy to calculate these with the new Python API (more on that below), and as there are often several ways to calculate a ratio, we leave this choice to you.
In essence, the fundamental files are very intuitive to read, each row is a time period for a company and all the relevant information is contained in that row. Making calculations over different datasets (e.g. for calculating ratios) is very easy with our Python API and will be shown in the following.
The Python API automatically downloads the bulk download files, updates them and loads all data into Pandas DataFrames so that you can start working with it instantly. It also offers a lot of other cool functionality, I’ll demonstrate some of it in the following but for more examples I suggest checking out the extensive tutorials.
We suggest using a Python package manager such as Conda and will be using Conda in the following examples. So once you have installed Conda, create a new environment by typing in your command line:
conda create --name simfintest python=3.6
If you are getting some error here, Conda is probably not properly installed.
Next, activate the environment:
source activate simfintest
With the enviroment active, we can install the SimFin Python API:
pip install simfin
And that’s all. It’s as easy as that! We can start using the Python API now.
If you want to plot charts (will be done in the following), you also have to install matplotlib and for more complex charts seaborn (this is optional and not required for the core functionality):
pip install matplotlib
pip install seaborn
We first have to import our package and set some variables, such as the location where the bulk download files should be saved.
If you have a SimFin+ subscription, you can insert your API key in line 7 to get the premium datasets, otherwise the “free” API key works just fine (the free data is delayed by 12 months but still contains more than 10 years of fundamental data).
In line 14 we set the directory where the bulk download files will be stored, modify this to your liking.
We can now start to load the actual data:
This code loads:
- All companies in the US market
- All industry code mappings
- All quarterly income statement, balance sheet and cash flow data for the US market
If you execute the code, the data will be downloaded and stored on your disk. The downloaded files will be used if you run the code again, so you don’t have to download the files every time you run the code. The default time period for the dataset to be “refreshed” (i.e. loaded again) is 30 days, but you can change that (if you manually delete the downloaded bulk download files, new files will be loaded also).
Working With the Data
We can now work with the data. Let’s display Revenue and Net Income for Microsoft (ticker “MSFT”) for example:
These 2 lines produce the following image:
Pretty cool, right? As this is the “quarterly” dataset, there is some seasonality present, you can get rid of that if you use the trailing twelve months dataset for example, which adds up the last 4 available quarters at every point in time.
The variables “REVENUE” and “NET_REVENUE” are our so called “Python Shortcuts”, with which you can select certain columns of the datasets. Alternatively, you could also use the full column names found in the CSV files. A list of all available columns and Python shortcuts for each dataset can be found here.
Let’s calculate the net profit margin now. The cool thing with this Python API is, that you can make calculations over the whole dataset in a matter of seconds.
That’s it, you’ve just calculated the net profit margin for all time periods and all companies in the dataset. Let’s say we want to plot the net profit margin for several companies, it’s as easy as that:
And the result:
The same works also with ratios that involve data from separate datasets, e.g. to calculate EBITDA we take operating income from the df_income dataframe and simply add D&A from the df_cashflow dataframe:
We use .fillna(0) here to account for cases where the values might be not available (e.g. due to the company not reporting it).
Plotting this for Microsoft:
It is also possible to do more advanced calculations, e.g. calculating the price/sales ratio for every single day for 10+ years of data.
In order to be able to do this, we basically have to convert the fundamentals we want to relate to a share price from quarterly to daily figures (which stay the same for the duration of one quarter, and then get a new value). We can use the function “reindex” from the SimFin package to achieve this easily:
Resulting in the following plot:
It takes a while to compute this (roughly 30 seconds), so for that reason you can also explore the SimFin Data Hubs, which facilitate complex calculations further.
That’s all for now! We’d be happy to hear your feedback and are excited to see what you can build with this.
The full example: