At this moment, there are exactly 604,844 registered corporations in the USA. The story of why I know that is pretty fascinating.
A while back, I embarked on a personal project to unify data sources and build a reasonably powerful business search tool, and in the process, I learned a lot about what information is and isn’t easy to acquire out there on the web.
First, you have social networks. Those were easy. Just connect to APIs, perform basic string search for an entity and allow the user to rapidly dive into content and then flag it as the correct entity. From there, you can look into social network engagement and activity, and those types of things.
Second, you have general information repositories, like wikipedia. Using DBpedia and WIkidata APIs, I started pulling in RDF data about companies, and due to the crowdsourced nature of those datasources, some companies are rich with data, and some companies have none. Basically, these sources proved to be unreliable when it came to identifying all possible companies.
Third, to identify the products and services a company offers, that proved to be most difficult. Most sources don’t reliably track a company’s services, products, product lines, etc. So, the most reliable place to get that information is from the company’s website itself. The only problem is, this is supposed to be a unified search. So I had to build a special scraper / parser that took that information and translated it into my data graphs.
Fourth, the process of getting financial data is more categorical than I thought it would be. On one hand, if you’re searching for a public company, you have stocks, accounting data, and all sorts of interesting financial information. If you’re dealing with a private company, you have to be creative about how to characterize their finances.
It was here that I discovered that you can look into a private company’s finances by searching through SEC filings. They come in an antiquated XML / search interface called Edgar, which is served over FTP. I was able to get a basic API going that consumed the information from that FTP and parsed it into useful formats, but even then, it was limiting in what you could do with it. One interesting thing that I found in these filings is that Board Members in the company are always listed.
It was then I realized why companies like Bloomberg company search always has directors and leadership listed — you can extract that from SEC filings. Easy peasy! Also, I discovered that you can actually get a raw data list of all existing corporations in the US, and if parsed correctly, you can tease out their CIK numbers, which are used to identify trails of SEC filings.
Once you parse all of the above information, you can build quite the graph of data. And, while I’ve seen some companies who provide bits and pieces of that kind of information, nobody really provides a unified search across everything. Hopefully I will provide that service someday…