Solving the Problem of Unified Deep Business Search is Fascinating

At this moment, there are exactly 604,844 registered corporations in the USA. The story of why I know that is pretty fascinating.

A while back, I embarked on a personal project to unify data sources and build a reasonably powerful business search tool, and in the process, I learned a lot about what information is and isn’t easy to acquire out there on the web.

First, you have social networks. Those were easy. Just connect to APIs, perform basic string search for an entity and allow the user to rapidly dive into content and then flag it as the correct entity. From there, you can look into social network engagement and activity, and those types of things.

Second, you have general information repositories, like wikipedia. Using DBpedia and WIkidata APIs, I started pulling in RDF data about companies, and due to the crowdsourced nature of those datasources, some companies are rich with data, and some companies have none. Basically, these sources proved to be unreliable when it came to identifying all possible companies.

Third, to identify the products and services a company offers, that proved to be most difficult. Most sources don’t reliably track a company’s services, products, product lines, etc. So, the most reliable place to get that information is from the company’s website itself. The only problem is, this is supposed to be a unified search. So I had to build a special scraper / parser that took that information and translated it into my data graphs.

Fourth, the process of getting financial data is more categorical than I thought it would be. On one hand, if you’re searching for a public company, you have stocks, accounting data, and all sorts of interesting financial information. If you’re dealing with a private company, you have to be creative about how to characterize their finances.

It was here that I discovered that you can look into a private company’s finances by searching through SEC filings. They come in an antiquated XML / search interface called Edgar, which is served over FTP. I was able to get a basic API going that consumed the information from that FTP and parsed it into useful formats, but even then, it was limiting in what you could do with it. One interesting thing that I found in these filings is that Board Members in the company are always listed.

It was then I realized why companies like Bloomberg company search always has directors and leadership listed — you can extract that from SEC filings. Easy peasy! Also, I discovered that you can actually get a raw data list of all existing corporations in the US, and if parsed correctly, you can tease out their CIK numbers, which are used to identify trails of SEC filings.

When I was looking at that raw line-delimited list of registered corporations in the USA, I became curious how long the list was. Frankly, it was killing my browser’s scroll bar, so I just had to know. So, I opened up my browser javascript console, selected the text from the DOM, and ran a basic regex splitter function to break all of that content into a long javascript array. From there, I just console.log’ed out the array.length, et voila!,¬†604,844 items. Holy cow. Is that a lot? or is that a little? The United States is so large. I couldn’t really map it mentally, so I started doing basic string searches in it, and found the regulars… Google, Uber, Twitter, etc. Then I started looking up other, smaller private companies that were either LLCs or traditional corporations, and yep, they were there. What about the local taco shop? Nope. Not a corporation. Fascinating!

Once you parse all of the above information, you can build quite the graph of data. And, while I’ve seen some companies who provide bits and pieces of that kind of information, nobody really provides a unified search across everything. Hopefully I will provide that service someday…

Comments are closed.