When discussing triplestores and big data, people often ask me what the difference is between it and a traditional relational database. The following is a collection of everything I think you’d need to know or want to know in order to talk intelligently about the subject.
The Basic Concept
The company AllegroGraph once gave a presentation about triplestores vs. relational databases and gave an interesting example that I found helpful in illustrating the basic differences between these two types of databases. The hypothetical data was based on describing the Kennedys.
The speaker said, “Here’s some triples that describe the members of the Kennedy family. In a triplestore, there is a list of triples, each containing Subject-Predicate-Object, and is thus a collection of flat data.” Here, they showed this list of data. “triple 32” refers to the actual identifier/pointer to the data object, and the other three fields are the subject, predicate, and object (the triple). Thus, you could read it as “person2 is a thing with [type] person”, “person2 has the first name Rose”, “person2 is a female”, and so on.
In a relational database, there would be several tables that were populated with information about them. You’d have a names table, a spouses table, a schools table, etc, and link tables to relate all of those other fields together, and so on. Via joining, you’d be able to generate useful information about the family.
But, from experience, you know that doing complex queries on these people gets really hard really fast.
What Makes a Triplestore Valuable
Let’s say for example you wanted to view “all Kennedys who are involved in politics who have been involved in producing legislation that contributed to the capturing of Osama Bin Laden”. Imagine trying to do that query in SQL. The joins alone would be funny, but where would you even get all of that information? Wouldn’t it be unrealistic to assume there are tables out there that have columns that would support those joins?
The reason why that would be hard to query, mostly, is because it is unlikely you’d have all of the necessary data required to do that query, and if you did, it would only be useful to people doing that query. Not to mention, there are abstract ideas at work in terms of predicates / relationships / assumptions, such as “involved in”, “produced”, “contributed to”, “captured”, etc. What is beautiful about triplestores, also known as RDF databases, is that they pull their data from ontologies and resources, which look like URLs / URIs and return complex chunks of serialized information that an RDF query browser can parse. This means that in a query such as the one from above, one could use the “involvedIn” property as defined by resource X and the “Osama_Bin_Laden” resource from dbpedia.org’s datastores, and so on.
Also, because everything is stored as statements with three pieces of information, you can add new predicates to the database as much as you want without changing any schema, and nothing will change except for how far you can reach in terms of complex query capacity. In other words, its easy to grow your database.
More On Writing Queries and Things
When you do a query on a triplestore, its output is called a graph. There are several query languages designed to query RDF databases (triplestores), RDQL, SPARQL, RQL, SeRQL, Versa, and more. The most popular language of them all is SPARQL, by far, which could be characterized as being something like a combination of SQL, Turtle, and a dynamic table of contents. Much like SQL has views and stored procedures, SPARQL has named graphs, which is itself a resource.
Data Models and Serialization Formats
A triplestore, for all intents and purposes, is an RDF database. RDF stands for Resource Description Framework, which basically has the task of describing resources (entities of data) with statements (i.e. <Kristian is a male>, <Kristian is a programmer>, <Kristian lives in California>, <California is a place in the USA>, <a region has the property of being near other regions>, etc). Together, the more data you have, the more complex a graph you can produce.
The most popular serializations of RDF are as RDF/XML and Turtle, and another known serialization is called N3, which is supposed to be easier to read and write than the XML version (Tim Berners-Lee famously said that XML was the worst thing that has happened to RDF, presumably because it is so verbose and it confused developers who tried to process it as normal XML).
SPARQL is to Resources as SQL is to Tables & Databases
What is absolutely awesome about SPARQL querying of triplestores is that you can use resources, which are, when written, similar looking to an html tag: this is a <http://resource>.
Since a resource is a url which refers to some ontology, property, resource, etc, you basically can join any data against any other data, anywhere, as long as it is RDF or some derivative or it. That means I can query companies according to wikepedia’s definition (and accompanying properties) of company and filter that list by companies located in California according to the definition and properties of a state as defined by the government and then get all of the website / homepages of the CEO’s of those companies, as defined by ontology X and resource Y.
Basically, if you have access to enough resources, you will basically be able to derive any kind of information you’ll ever want.
By the way, that is precisely why electronic intelligence gathering by entities like the NSA is so crucial to them — if you have access to enouch RDF data (hence them taking data from facebook and google, etc) then you can know quite a lot about everything, and more importantly, you can draw conclusions and make predictions over complex ideas.
Where Can I Get a Triplestore To Try It?
If you want to pay for a semantic database, there are several companies working on their own proprietary databases. To find them, just google “semantic database” or “rdf database” or “triplestore”, and you’ll find plenty. If you don’t want to pay for it, there are some open source products out there currently.
I’m assuming since the semantic / rdf / triplestore area of databases is such a hot commodity right now that you’ll probably have to pay for it if you want it to perform well. After all, this is the big thing right now, as you can see in the tech news — Google has a lot vested in growing their Knowledge Graph, Facebook is desperately trying to import more information into its Social Graph, and other tech giants are either very interested in developing it, or aren’t and should be.