Saturday 1 May 2021

Why it took me Three Years to 'Get' RDF

I was introduced to RDF around five years ago by Paul Rouet, the former digital technologies director of the APUR (Atelier Parisien d'URbanisme), while in a meeting about the creation of what is now the "Paris Time Machine" HumaNum project. RDF is a data-modelling technology that had, much to my amazement, been around since the 1960s, and Paul had proposed using it as a base for all the historical data we planned to accumulate, but in the end this point was left unaddressed, as no-one in the meeting (including me) knew the first thing about it. 

We're all used to relational databases: lines, most usually made distinct from others with unique IDs, divided up into 'columns' of data-cells: every line in a relational-database table is an 'individual' set of data, a cell therein is a 'type' of data, and the table itself is a sort of 'context'. This is all well and fine, but should we want to add a new data-type for each individual, we would have to add an extra column to our table, or create another table entirely. When our data-modelling is over, and it comes time to actually query our data, if the database setup is not organised, or the information we are looking for is deep-rooted, things can get messy pretty quickly because of all the 'JOIN's required. Yet we learn to adapt to these limitations, and I did to a point where I became even 'fluid' in them (the limitations didn't seem such anymore, and became 'part of the process' in my mind). This is probably why it was so hard for me to 'let go' of these methods, and why most of my past five years toying with RDF was spent trying to apply relational-database-'think' to it... which was exactly why I wasn't 'getting' it. I ended up leaving it to the side around a year ago.

What made me return to it was (another) side-project that was researching a 13th-century Burgundy fort: over time I had accumulated a huge amount of historical, political and genealogical data about it, and when it came time to actually write a resumé of my findings, providing citations for every claim in my writ became a huge obstacle: were we to create a new entry in, say, a genealogical chart, every element of that 'being' would require a citation: their name (and its spelling, and variations thereof), birthdate (and place thereof), reign (over what period), titles, marriages, children, death (and place thereof, and conflicting records, etc.), etc., etc.. Already the spelling of a certain person's name would require a table in itself (or adding extra columns for each new variation == bad practice), another table for marriages (as there were often more than one), another one for birth dates (as different sources often cite different ones), another for titles (to account for many, plus other conflicting source claims), etc. etc.. And to all that we have to catalogue everything we can about each individual source any citation points to. In all, the idea of setting up a relational database that could deal with all that, plus writing queries for the same, seemed pretty daunting.

So I returned to RDF once again as a possible solution for this problem. There is a lot 'out there' on the web about it, but since RDF is a technical solution made by technicians, and it is largely unused by the public, I could find very little out there that was understandable by anyone not already having knowledge of it.

Already the most-often-found terminology used to describe it is an obstacle: the first thing we will most likely read about RDF is its "subject-predicate-object" structure of data, but this, especially to one with a long relational-database experience behind them), is misleading, as we might (as I did) try to project a 'line-column-data' structure onto it, which is completely missing the point of RDF entirely. In fact, and this is a point most often left out of RDF explanations, the 'subject' and 'object' are perfectly interchangeable, and are in fact but 'individual instances', or 'bits of data'. The only thing that makes RDF what it is is the structure, or relations, between this data. So if there's one thing to retain in understanding RDF, it's 'predicates (relations) are everything'.

But if any 'individual data' ('subject' as per common-tutorial parlance, but henceforth 'individual') can have an unlimited number of any other 'individual data' ('objects' ad idem) attached to it, and the 'type' of that data is dictated by the relation (predicate) linking the two together. If I was to apply this to my case, a given person (individual) could have an unlimited number of spelling variants (also 'individual data'), without the constraint of extra tables or columns, and one type of relation (say: 'hasName') linking them. One thing to retain here: thus far, as far as the database is concerned, all of the 'individual data' is of the same 'type' (only that they be identifiable as separate 'individuals' is important at this stage).

(nota: will add diagram at a later date)

Yet if we were to consider the actual 'data-organisation' angle of this setup: if every bit of data is an 'individual', each linked to one another through 'relation-types' (predicates), we would quite quickly have a) a basket of 'individuals' and b) a basket of relations (relation-types, or predicates). Telling relations apart from one another is fairly easy if we 'name' them right (example: hasName (in form: 'individual data' => hasName => 'individual data'), but how to differentiate the 'individuals' from one another? In my first primitive tentatives with RDF, I had tried naming every individual as I would a relational-database column -and- row name (e.g.: (individual) "HuguesStVerain" => hasName => (individual) "Saint-Verain". Already in creating three individuals, with multiple spelling variants for each, with this method, my database was already a mess. And when we consider that I was giving the bit of data that was an individual (human, in this case) the 'name' (unique ID) it was supposed to -point to-, this all seems (now) quite stupid. In fact, as far as RDF is concerned, and this was perhaps the hardest part of RDF to grasp, was that what the 'individual' was 'named' in our database didn't matter: the only element in RDF reasoning of importance is that one individual bit of data (no matter what 'type' it is) would not have the same 'ID' as any other (and an individual could be an identifier with no data at all (but the identifier itself)). But if individual data-bits were 'targeted' by unique IDs (say: 0001, 0002, etc.), this would make database-management and queries a nightmare.

And it's probably for this very reason that 'classes' were invented, and it's only here that an explanation about classes should come in to any tutorial on the subject.

So it's possible to categorise 'individuals' into 'classes' to better organise things. If we take the above example, the 'individual' that in reality is a 'person' could be labeled as a 'person' class, and the various 'individuals' that are the various spellings of a name could be labelled with a 'name' class. Primitive RDF achieved this with an 'ref:type' label (that we would see in the .xml code), but nowadays we use the 'rdfs' attribute (itself a labelling system written in RDF as an 'extension' to RDF) to this end. Without getting into too much detail here, RDFS not only makes it possible to classify individual data, but to label it and add more 'relation-types', or 'relation-rules', between them. But let's stay with its ability to 'class'-ify individual data for now. So with our ability to classify data, we can more easily manage our data-model: a 'person' individual can have multiple 'name' individuals, each pointing (or not) to, say, 'source' individuals. This looks like it's exactly what our 'genealogy' situation requires.

(nota: a diagram would be helpful here, as well)

But here I had to take a step back and ask myself: If the 'spelling' of the identifier of any bit of individual data 'doesn't matter' (and I must add at this point that the same holds true for class identifiers), and only 'which (class of) data is related to which' is important, what exactly is going on here? In this simple model, "individual identifier "huguesStVerain" ("Person") => predicate identifier "hasName" => individual identifier "HuSaintVerain" ("name") data:"Saint Verain"" works, but "individual identifier "pers_0001" ("c_001") => predicate identifier "rel_0001" => individual identifier "nm_0005" ("c_002") data:"Saint Verain"" works as well!

If we were apply this to other situations: if any given person has a certain number of other people in their entourage, we could 'call' any of these people anything at all, and that would change nothing in the relations between them. We could describe 'a rock sitting on a table' in any language or terminology we want, and that would change nothing in the fact that the... rock is sitting on the table.

Conclusion: RDF was not designed as a 'data manager', but designed to represent things exactly as they are in reality.

That's when the flash of understanding came: if we were to dig down using that model, we'd see that people are not only related with people, but people are related with animals (pets) as well, and cells are related with cells, and viruses are related with cells, and molecules are related with molecules, atoms are related with atoms, all the way down to gravity's relation with energy.

That seems to be going a bit far, but that should be taken to mind whenever constructing an RDF schema, and this seems to be exactly what most have not done when designing theirs. Everywhere I see RDF ontologies (that's the word for an RDF database) structured on 'what we call things' or 'how we categorise things' in almost complete ignorance of reality itself: it's we who apply our names, categorisations and classifications to reality, and when in our data models we try to make reality 'fit' those concepts (instead of the other way around), the result in RDF will always be a schema that not only won't work, but won't fit with anyone else's. But the 'shared knowledge' principle was the very reason for RDF's invention.

So I had to rethink my model yet again. This time I was careful to separate our concepts ('names', 'classification', etc.) from reality itself (an 'entity' class that has 'object' (itself with 'construction', 'implantation', 'machine', 'composition' (as in 'molecular' or 'atomic') and 'organism') and 'phenomenon' subclasses)... and to that, the element of 'time'. I may have to revisit this again. But in any case, we have a model that separates matter, concepts and time.

And therein I could describe relations between entities ('organisms') and concepts ('name') without affecting anything in the rest of the model: what's more, therein I gained the ability to import other databases into my model. For example, when into the 'concept' => 'classification' branch I imported a speciation ontology, I was suddenly able to classify my 'organisms' (as 'homo sapiens', subclass of 'homo', subclass of... and so on and so on) without changing anything in my own model (other than to add a link between any given 'organism' and its 'species' class). I could also do the same with geographical data and link my model's 'placename' to a specific geographical location (and elevation!), and when one adds the element of time for say, an 'event' (that is a subclass of 'occurrence', itself a subclass of 'time'), we get even more information. And if I was to remove all the additional data sources, other than the broken links (were the data sources imported/referred to internally), my model would still work.

But that last point is super-important in the RDF scheme of things: if data sources remain constant, and our references to these (in our own RDF models) not internal, but linking to their remote data, save in the exception of no internet access, our model would never break.

But to return from my digression (though one hopefully useful to this presentation), and to conclude, the prime element separating me from an understanding of how RDF works seems to have been... my misunderstanding, or misuse, if you will, of the methods we have at our disposal to perceive, interpret and communicate reality.


No comments:

Post a Comment