Monday 29 April 2019

ANN (Artificial Neural Network) OCR: A likely dead-end method. Considering a new approach.

In my recent dives into AI-dependant endeavours, I've been presented with the gargantuan task of extracting data from countless pages of printed, and often ancient, text, and in every, I've run up against the same obstacle: the limitations of Artificial Neural Network (henceforth 'ANN')-dependant OCR.

For starters, ANN is but an exercise in comparison: it contains none of the logic or other processes that the human brain uses to differentiate text from background, identify text as text, or identify character forms (why is an 'a' not a 'd', and what characteristics do each have?). Instead, it 'remembers' through on a library of labelled 'samples' (images of 'a's named 'a') and 'recognises' by detecting these patterns in any given input image... and in many OCR application, the analysis stops there. What's more, ANN is a 'black box': we know whats in the sample library, and we know what the output is, but we don't know at all what the computer 'sees' and retains as a 'positive' match. Of course it would be possible to capture this (save the output of every network step), but I do not think this would aid the shortcomings just mentioned.

The present logic-less method may also be subject to over-training: the larger the sample library, especially considering all the forms (serif, sans serif, italic, scripted, etc.) a letter may have, the more the chance that the computer may find 'false positives'; the only way to avoid this is to do further training, and/or do a training specific to each document, a procedure which would limit the required library (character styles) and thus reduce error. But this, and further adaptation, requires human intervention, and still we have no means of intervening or monitoring the 'recognition' process. Also absent from this system is a probability determination (it is present, but as an 'accepted' threshold programmed into the application itself), and this would prove useful in further character and word analysis.

And all the above is specific to plain text on an uncluttered background: what of text on maps, partly-tree-covered billboards, art, and multi-coloured (and overlapping) layouts? The human brain 'extracts' character data quite well in these conditions; therein are other deduction/induction processes absent from ANN as well.

Considering Human 'text recognition' as a model.

Like many other functions of the human brain, text recognition seems to function as an independent 'module' that contributes its output to the overall thought/analysis process of any given situation. It requires creation, then training, though: a dog, for example, might recognise text (or any other human-made entity) as 'not natural', but the analysis ends there, as it has not learned that certain forms have meaning (beyond 'function'), and may so ignore them completely; a human, when presented with text in a language they were not trained in, may recognise the characters as 'text' (and here there are other ANN-absent rules of logic at work here), but that's about it.

What constitutes a 'recognised character'? Every alphabet has a logic to it: a 'b', for example, in most every circumstance, is a round-ish shape to the lower right of a vertical-ish line; stray from this, and the human brain won't recognise it as a 'b' anymore. In using the exact same forms, we can create p, a, d, and q as well... the only thing differentiating them is position and size. In fact, in all, the Roman alphabet consists of less than a dozen 'logic shapes'.



Not only can the human brain detect and identify these forms: it can also 'fill in the blanks' in situations like, say, a tree branch covering a billboard: the overall identifaction process seems to be an initial 'text, not text' separation, the removal of 'not text' from the picture, then it seems to 'imagine' what the covered 'missing bits' would be, and this is submitted for further analysis.

But the same holds true in cases where a character is badly printed, super-stylised, missing bits, etc.: in fact, if a word is not instantly readable (and this is a highly-trained process in itself), the brain seems to 'dig down' a level to determine what the 'missing' character should be, and 'matches' against that, and this is a whole other level of analysis (absent from ANN). In fact, were we able to extract the probability of a character match of every given character of any given word and compare this to to a dictionary, we would not only create another probability (the 'refined' chances of 'x' character being 'y'), we would also have a means for the computer to... train itself.