Thursday 13 May 2021

RDF Explained (hopefully) the Right Way

I had intended to create a few illustrations for my last blog entry (as when I wrote it I had not access to my usual graphic-design tools), but instead I think it would be more useful to go through the different elements of RDF to explain them in both concept and functionality, then with that we can better understand how it all works together. If any of this sounds too obvious or condescending (to those who know better), please keep in mind that its goal is to better educate the ignorant myself that existed only a few years ago.


Individuals

The most basic RDF concept is the "individual": consider it to be a 'cell' in a spreadsheet or relational database. Yet here it is free from those constraints: at its most basic level, an 'individual' is neither a column nor row, nor is it attributed to any 'table', and the only thing that counts is that we maintain its 'individuality', or make it distinguishable from other 'individuals'. If we want to create a new 'individual', all we have to do is name it: in fact, naming it is what brings it into existence. To better relate this to our relational-database experience, consider it as a cell with only an 'id' attribute.




An individual can be an 'empty' named shell, and as such it can be used to demonstrate relations between different individuals (more on this later). But an individual can 'contain' data: in our relational-database minds, we could consider attributing data to this individual to be 'filling its cell', but data attributes for a single individual can be many: we could consider these multiple data attributes as 'column names', but that would complicate things for our later understanding. Let's consider, instead, that the data attribute is in fact another type of 'unnamed individual' (a cell with neither an id or column 'identifier'), and that the only thing linking it to the individual is the declared data attribute relation.

Most RDF tutorials go straight to organising 'classes' (the next part of this), but I think this opposite-way-round approach is easier on the mind as far as understanding RDF is concerned: data is, after all, what we ultimately will be extracting from our RDF database, so it's best we understand that first. So when thinking of RDF database structure, I find that it's best to first think of the individuals we will be structuring, and think of the data attributes that each should have.

So, conceptually speaking, we have created an individual with a 'name', 'birthdate' and an 'address' attribution. What are we describing here? Most instinctively we would put something that had a name, birdate and address in a 'person' box or category... and this brings us to 'classes'.


Classes

'Classes' are basically boxes in which we can group individuals of the same 'type': these could be any indivduals with any data attributes, and it's only our putting them in the class that make them 'of' that class.

So let's create a few 'people' individuals (with name, birthdate and address attributes), then create a 'people' class to put them in. In a way, in thinking from our relational-database experience, we have created a 'people' table with several 'people' (with different rows (id) and columns (data attributes)) in it. At its most basic, an individual belonging to a class has an 'is a' relation.

Pierrick trigger alert

So now we have a bunch of individuals grouped within a "people" class. Let's go through the procedure again, but this time let's create "house" individuals, and group them in a "house" class. What has a house? Let's say that they're between adjoining streets: each house would have a "data:number" and "data:streetname" attribute, and just for fun, a "data:floors" attribute.

But hey, won't the "data:address" attribute of our "people"-class individuals conflict, or become redundant, with the "data:streetname" and "data:number" attributes of the individuals in our "house" class? Yes, and this is why it is important to think down to the data attributes of each individual when creating a database scheme or structure.

So now that we have individuals in 'house' and 'people' groups, who lives in which house? If we want to indicate that "Bob" (0001) lives in house h_0003, we have to create a new connection between the two individuals: "person_livesIn_house" (the naming scheme, as mentioned in my earlier post, doesn't really matter, but let's call it such to make our undersanding easier). So, if we interconnect our individuals as so:
..we see that we have two different 'goups' of individuals with links between them.

Were were to apply this model to a typical mySql database (in the most memory-economic way possible), we would have to have a) one table for 'houses', b) another table for 'people', c) each would have to have an 'id' column, and between them a 'connecting' column on which to make 'joins' (something like 'living_in_house_id'), and if, say, we wanted to know who lived in the house at "1 elm Street", our sql query would look like such:


...whereas, in RDF, our query would be:


...and we can even do more complex, abstract queries such as "who lives in a house with more than 1 floor?" as such:
 

...and this is just a database with two classes ('tables').

We can also add unilmited classes and individuals (e.g. 'pets' (not the french word), 'cars', 'trades'... whatever!), to our own ontology (nota: the preferred RDF terminology for 'dataset' or 'database' that is a single catalog of RDF triples). So, for example, were we to add another 'class' and set of individuals therein, say, 'lottery ticket wins' (and the individuals that are lottery winners): we can now ask things like "The names and addresses of people living in a one-floor house who won the lottery before august 19, 1957". Try doing that in a relational database.

Not only can we add new classes and individuals to our own ontology, but we can import others' data as well: were we to import, say, the yellow pages, we would suddenly have a huge database of names, addresses, and phone numbers: all we would have to do is ensure (in our ontology) that the imported-data's relational properties are understood by ours: for example, if their ontology's relation between a 'human' individual and a 'residence' is 'humanLivesAt', we would have to declare that their 'human' is the same as our 'person', their 'residence' is the same as our 'house' class, and that their 'humanLivesAt' property is the same as our 'person_livesIn_house' property, then should we import their data, we can query both datasets as one using our own ontology language. Another method, although one that makes our queries become a bit more complex, is to access another dataset remotely (as every ontolgy has an URL (URI) exactly for this) by querying both their database (with their terminology) and ours.

The next step, to avoid overlaps or errors, to improve query efficiency, and even give the query engine the ability to reason, is structuring our classes and applying rules to them, but we've gone quite far enough for one day.


No comments:

Post a Comment