Thursday, 13 May 2021

RDF Explained (hopefully) the Right Way

I had intended to create a few illustrations for my last blog entry (as when I wrote it I had not access to my usual graphic-design tools), but instead I think it would be more useful to go through the different elements of RDF to explain them in both concept and functionality, then with that we can better understand how it all works together. If any of this sounds too obvious or condescending (to those who know better), please keep in mind that its goal is to better educate the ignorant myself that existed only a few years ago.


Individuals

The most basic RDF concept is the "individual": consider it to be a 'cell' in a spreadsheet or relational database. Yet here it is free from those constraints: at its most basic level, an 'individual' is neither a column nor row, nor is it attributed to any 'table', and the only thing that counts is that we maintain its 'individuality', or make it distinguishable from other 'individuals'. If we want to create a new 'individual', all we have to do is name it: in fact, naming it is what brings it into existence. To better relate this to our relational-database experience, consider it as a cell with only an 'id' attribute.




An individual can be an 'empty' named shell, and as such it can be used to demonstrate relations between different individuals (more on this later). But an individual can 'contain' data: in our relational-database minds, we could consider attributing data to this individual to be 'filling its cell', but data attributes for a single individual can be many: we could consider these multiple data attributes as 'column names', but that would complicate things for our later understanding. Let's consider, instead, that the data attribute is in fact another type of 'unnamed individual' (a cell with neither an id or column 'identifier'), and that the only thing linking it to the individual is the declared data attribute relation.

Most RDF tutorials go straight to organising 'classes' (the next part of this), but I think this opposite-way-round approach is easier on the mind as far as understanding RDF is concerned: data is, after all, what we ultimately will be extracting from our RDF database, so it's best we understand that first. So when thinking of RDF database structure, I find that it's best to first think of the individuals we will be structuring, and think of the data attributes that each should have.

So, conceptually speaking, we have created an individual with a 'name', 'birthdate' and an 'address' attribution. What are we describing here? Most instinctively we would put something that had a name, birdate and address in a 'person' box or category... and this brings us to 'classes'.


Classes

'Classes' are basically boxes in which we can group individuals of the same 'type': these could be any indivduals with any data attributes, and it's only our putting them in the class that make them 'of' that class.

So let's create a few 'people' individuals (with name, birthdate and address attributes), then create a 'people' class to put them in. In a way, in thinking from our relational-database experience, we have created a 'people' table with several 'people' (with different rows (id) and columns (data attributes)) in it. At its most basic, an individual belonging to a class has an 'is a' relation.

Pierrick trigger alert

So now we have a bunch of individuals grouped within a "people" class. Let's go through the procedure again, but this time let's create "house" individuals, and group them in a "house" class. What has a house? Let's say that they're between adjoining streets: each house would have a "data:number" and "data:streetname" attribute, and just for fun, a "data:floors" attribute.

But hey, won't the "data:address" attribute of our "people"-class individuals conflict, or become redundant, with the "data:streetname" and "data:number" attributes of the individuals in our "house" class? Yes, and this is why it is important to think down to the data attributes of each individual when creating a database scheme or structure.

So now that we have individuals in 'house' and 'people' groups, who lives in which house? If we want to indicate that "Bob" (0001) lives in house h_0003, we have to create a new connection between the two individuals: "person_livesIn_house" (the naming scheme, as mentioned in my earlier post, doesn't really matter, but let's call it such to make our undersanding easier). So, if we interconnect our individuals as so:
..we see that we have two different 'goups' of individuals with links between them.

Were were to apply this model to a typical mySql database (in the most memory-economic way possible), we would have to have a) one table for 'houses', b) another table for 'people', c) each would have to have an 'id' column, and between them a 'connecting' column on which to make 'joins' (something like 'living_in_house_id'), and if, say, we wanted to know who lived in the house at "1 elm Street", our sql query would look like such:


...whereas, in RDF, our query would be:


...and we can even do more complex, abstract queries such as "who lives in a house with more than 1 floor?" as such:
 

...and this is just a database with two classes ('tables').

We can also add unilmited classes and individuals (e.g. 'pets' (not the french word), 'cars', 'trades'... whatever!), to our own ontology (nota: the preferred RDF terminology for 'dataset' or 'database' that is a single catalog of RDF triples). So, for example, were we to add another 'class' and set of individuals therein, say, 'lottery ticket wins' (and the individuals that are lottery winners): we can now ask things like "The names and addresses of people living in a one-floor house who won the lottery before august 19, 1957". Try doing that in a relational database.

Not only can we add new classes and individuals to our own ontology, but we can import others' data as well: were we to import, say, the yellow pages, we would suddenly have a huge database of names, addresses, and phone numbers: all we would have to do is ensure (in our ontology) that the imported-data's relational properties are understood by ours: for example, if their ontology's relation between a 'human' individual and a 'residence' is 'humanLivesAt', we would have to declare that their 'human' is the same as our 'person', their 'residence' is the same as our 'house' class, and that their 'humanLivesAt' property is the same as our 'person_livesIn_house' property, then should we import their data, we can query both datasets as one using our own ontology language. Another method, although one that makes our queries become a bit more complex, is to access another dataset remotely (as every ontolgy has an URL (URI) exactly for this) by querying both their database (with their terminology) and ours.

The next step, to avoid overlaps or errors, to improve query efficiency, and even give the query engine the ability to reason, is structuring our classes and applying rules to them, but we've gone quite far enough for one day.


Saturday, 1 May 2021

Why it took me Three Years to 'Get' RDF

I was introduced to RDF around five years ago by Paul Rouet, the former digital technologies director of the APUR (Atelier Parisien d'URbanisme), while in a meeting about the creation of what is now the "Paris Time Machine" HumaNum project. RDF is a data-modelling technology that had, much to my amazement, been around since the 1960s, and Paul had proposed using it as a base for all the historical data we planned to accumulate, but in the end this point was left unaddressed, as no-one in the meeting (including me) knew the first thing about it. 

We're all used to relational databases: lines, most usually made distinct from others with unique IDs, divided up into 'columns' of data-cells: every line in a relational-database table is an 'individual' set of data, a cell therein is a 'type' of data, and the table itself is a sort of 'context'. This is all well and fine, but should we want to add a new data-type for each individual, we would have to add an extra column to our table, or create another table entirely. When our data-modelling is over, and it comes time to actually query our data, if the database setup is not organised, or the information we are looking for is deep-rooted, things can get messy pretty quickly because of all the 'JOIN's required. Yet we learn to adapt to these limitations, and I did to a point where I became even 'fluid' in them (the limitations didn't seem such anymore, and became 'part of the process' in my mind). This is probably why it was so hard for me to 'let go' of these methods, and why most of my past five years toying with RDF was spent trying to apply relational-database-'think' to it... which was exactly why I wasn't 'getting' it. I ended up leaving it to the side around a year ago.

What made me return to it was (another) side-project that was researching a 13th-century Burgundy fort: over time I had accumulated a huge amount of historical, political and genealogical data about it, and when it came time to actually write a resumé of my findings, providing citations for every claim in my writ became a huge obstacle: were we to create a new entry in, say, a genealogical chart, every element of that 'being' would require a citation: their name (and its spelling, and variations thereof), birthdate (and place thereof), reign (over what period), titles, marriages, children, death (and place thereof, and conflicting records, etc.), etc., etc.. Already the spelling of a certain person's name would require a table in itself (or adding extra columns for each new variation == bad practice), another table for marriages (as there were often more than one), another one for birth dates (as different sources often cite different ones), another for titles (to account for many, plus other conflicting source claims), etc. etc.. And to all that we have to catalogue everything we can about each individual source any citation points to. In all, the idea of setting up a relational database that could deal with all that, plus writing queries for the same, seemed pretty daunting.

So I returned to RDF once again as a possible solution for this problem. There is a lot 'out there' on the web about it, but since RDF is a technical solution made by technicians, and it is largely unused by the public, I could find very little out there that was understandable by anyone not already having knowledge of it.

Already the most-often-found terminology used to describe it is an obstacle: the first thing we will most likely read about RDF is its "subject-predicate-object" structure of data, but this, especially to one with a long relational-database experience behind them), is misleading, as we might (as I did) try to project a 'line-column-data' structure onto it, which is completely missing the point of RDF entirely. In fact, and this is a point most often left out of RDF explanations, the 'subject' and 'object' are perfectly interchangeable, and are in fact but 'individual instances', or 'bits of data'. The only thing that makes RDF what it is is the structure, or relations, between this data. So if there's one thing to retain in understanding RDF, it's 'predicates (relations) are everything'.

But if any 'individual data' ('subject' as per common-tutorial parlance, but henceforth 'individual') can have an unlimited number of any other 'individual data' ('objects' ad idem) attached to it, and the 'type' of that data is dictated by the relation (predicate) linking the two together. If I was to apply this to my case, a given person (individual) could have an unlimited number of spelling variants (also 'individual data'), without the constraint of extra tables or columns, and one type of relation (say: 'hasName') linking them. One thing to retain here: thus far, as far as the database is concerned, all of the 'individual data' is of the same 'type' (only that they be identifiable as separate 'individuals' is important at this stage).

(nota: will add diagram at a later date)

Yet if we were to consider the actual 'data-organisation' angle of this setup: if every bit of data is an 'individual', each linked to one another through 'relation-types' (predicates), we would quite quickly have a) a basket of 'individuals' and b) a basket of relations (relation-types, or predicates). Telling relations apart from one another is fairly easy if we 'name' them right (example: hasName (in form: 'individual data' => hasName => 'individual data'), but how to differentiate the 'individuals' from one another? In my first primitive tentatives with RDF, I had tried naming every individual as I would a relational-database column -and- row name (e.g.: (individual) "HuguesStVerain" => hasName => (individual) "Saint-Verain". Already in creating three individuals, with multiple spelling variants for each, with this method, my database was already a mess. And when we consider that I was giving the bit of data that was an individual (human, in this case) the 'name' (unique ID) it was supposed to -point to-, this all seems (now) quite stupid. In fact, as far as RDF is concerned, and this was perhaps the hardest part of RDF to grasp, was that what the 'individual' was 'named' in our database didn't matter: the only element in RDF reasoning of importance is that one individual bit of data (no matter what 'type' it is) would not have the same 'ID' as any other (and an individual could be an identifier with no data at all (but the identifier itself)). But if individual data-bits were 'targeted' by unique IDs (say: 0001, 0002, etc.), this would make database-management and queries a nightmare.

And it's probably for this very reason that 'classes' were invented, and it's only here that an explanation about classes should come in to any tutorial on the subject.

So it's possible to categorise 'individuals' into 'classes' to better organise things. If we take the above example, the 'individual' that in reality is a 'person' could be labeled as a 'person' class, and the various 'individuals' that are the various spellings of a name could be labelled with a 'name' class. Primitive RDF achieved this with an 'ref:type' label (that we would see in the .xml code), but nowadays we use the 'rdfs' attribute (itself a labelling system written in RDF as an 'extension' to RDF) to this end. Without getting into too much detail here, RDFS not only makes it possible to classify individual data, but to label it and add more 'relation-types', or 'relation-rules', between them. But let's stay with its ability to 'class'-ify individual data for now. So with our ability to classify data, we can more easily manage our data-model: a 'person' individual can have multiple 'name' individuals, each pointing (or not) to, say, 'source' individuals. This looks like it's exactly what our 'genealogy' situation requires.

(nota: a diagram would be helpful here, as well)

But here I had to take a step back and ask myself: If the 'spelling' of the identifier of any bit of individual data 'doesn't matter' (and I must add at this point that the same holds true for class identifiers), and only 'which (class of) data is related to which' is important, what exactly is going on here? In this simple model, "individual identifier "huguesStVerain" ("Person") => predicate identifier "hasName" => individual identifier "HuSaintVerain" ("name") data:"Saint Verain"" works, but "individual identifier "pers_0001" ("c_001") => predicate identifier "rel_0001" => individual identifier "nm_0005" ("c_002") data:"Saint Verain"" works as well!

If we were apply this to other situations: if any given person has a certain number of other people in their entourage, we could 'call' any of these people anything at all, and that would change nothing in the relations between them. We could describe 'a rock sitting on a table' in any language or terminology we want, and that would change nothing in the fact that the... rock is sitting on the table.

Conclusion: RDF was not designed as a 'data manager', but designed to represent things exactly as they are in reality.

That's when the flash of understanding came: if we were to dig down using that model, we'd see that people are not only related with people, but people are related with animals (pets) as well, and cells are related with cells, and viruses are related with cells, and molecules are related with molecules, atoms are related with atoms, all the way down to gravity's relation with energy.

That seems to be going a bit far, but that should be taken to mind whenever constructing an RDF schema, and this seems to be exactly what most have not done when designing theirs. Everywhere I see RDF ontologies (that's the word for an RDF database) structured on 'what we call things' or 'how we categorise things' in almost complete ignorance of reality itself: it's we who apply our names, categorisations and classifications to reality, and when in our data models we try to make reality 'fit' those concepts (instead of the other way around), the result in RDF will always be a schema that not only won't work, but won't fit with anyone else's. But the 'shared knowledge' principle was the very reason for RDF's invention.

So I had to rethink my model yet again. This time I was careful to separate our concepts ('names', 'classification', etc.) from reality itself (an 'entity' class that has 'object' (itself with 'construction', 'implantation', 'machine', 'composition' (as in 'molecular' or 'atomic') and 'organism') and 'phenomenon' subclasses)... and to that, the element of 'time'. I may have to revisit this again. But in any case, we have a model that separates matter, concepts and time.

And therein I could describe relations between entities ('organisms') and concepts ('name') without affecting anything in the rest of the model: what's more, therein I gained the ability to import other databases into my model. For example, when into the 'concept' => 'classification' branch I imported a speciation ontology, I was suddenly able to classify my 'organisms' (as 'homo sapiens', subclass of 'homo', subclass of... and so on and so on) without changing anything in my own model (other than to add a link between any given 'organism' and its 'species' class). I could also do the same with geographical data and link my model's 'placename' to a specific geographical location (and elevation!), and when one adds the element of time for say, an 'event' (that is a subclass of 'occurrence', itself a subclass of 'time'), we get even more information. And if I was to remove all the additional data sources, other than the broken links (were the data sources imported/referred to internally), my model would still work.

But that last point is super-important in the RDF scheme of things: if data sources remain constant, and our references to these (in our own RDF models) not internal, but linking to their remote data, save in the exception of no internet access, our model would never break.

But to return from my digression (though one hopefully useful to this presentation), and to conclude, the prime element separating me from an understanding of how RDF works seems to have been... my misunderstanding, or misuse, if you will, of the methods we have at our disposal to perceive, interpret and communicate reality.