There are all sorts of “stacks” out there… MEAN, WISP, LAMP. Today, I introduce you to a new one that I’m working on. — GrEAN. It stands for Graph – Express – Angular – Node.
Recently, I was tasked with building a method for performing data analysis as a service, as well as building a data science team that could perform those services for various industries. So, I quickly kicked off a deep exploration of the field and the technologies available in the space. I found what I learned to be very engaging and it got me thinking not only about technical challenges in performing analysis, but also managing the process of doing data science in a business context.
DBPedia, in general, is a linked-data data extraction of Wikipedia. If you’ve been living under a rock and don’t know what Wikipedia is, its a crowd sourced encyclopedia hosted on the internet. In terms of data structure, Wikipedia reports on its own wiki page that it is powered by clusters of Linux servers and MySQL databases, and uses Squid caching servers in order to handle the 25,000 to 60,000 page requests per second that it gets on average. In terms of the product, it is very culturally significant in that it is one of the most referenced sources of general information on earth, if not the outright leader. Again, DBPedia, for all intents and purposes, is a linked-data version of that dataset.
So, you’ve heard of a triplestore, that’s an important first step. Now, you’re wondering why you’d need one? That is a good question. I believe that the best way to answer the question is to talk a little bit about we know about triples as a data model, what SPARQL is good for, and where the industry has gone in the last few years that has caused us to need triples and SPARQL in the first place. Let’s get started.
Partial updates are somewhat problematic in the world of RESTful applications. Currently, we use POST and PUT to write data or update it, but on sub-properties of data updates, it actually can get somewhat hard to code for when you get into the more subtle application logic and error management, let alone on datasets that are very large or have very deeply nested data structures in a single JSON object, for example.
But, regardless, PUT and POST have done a satisfactory job up until now, and nobody really needs to use PATCH in a relational context. But therein lies an interesting point: data is getting bigger, and naturally, semantic data is starting to become much more prevalent, and its URI-based. It logically follows that if data continues to become more semantic, and you’re dealing more often in deeply nested structures, you’ll need a URI-based updating method that can be more flexible than PUT and POST. But you don’t have to take my word for it, lets ask an expert.
When discussing triplestores and big data, people often ask me what the difference is between it and a traditional relational database. The following is a collection of everything I think you’d need to know or want to know in order to talk intelligently about the subject.