It is my pleasure and privilege to write the foreword for this book, whose results I
have been following and awaiting for the last few years. This monograph represents
the outcome of an ambitious project oriented towards advancing our knowledge of
the way the human visual system processes images, and about the way it combines
high level hypotheses with low level inputs during pattern recognition. The model
proposed by Sven Behnke, carefully exposed in the following pages, can be applied
now by other researchers to practical problems in the field of computer vision and
provides also clues for reaching a deeper understanding of the human visual system.
This book arose out of dissatisfaction with an earlier project: back in 1996, Sven
wrote one of the handwritten digit recognizers for the mail sorting machines of
the Deutsche Post AG. The project was successful because the machines could indeed
recognize the handwritten ZIP codes, at a rate of several thousand letters per
hour. However, Sven was not satisfied with the amount of expert knowledge that
was needed to develop the feature extraction and classification algorithms. He wondered
if the computer could be able to extract meaningful features by itself, and use
these for classification. His experience in the project told him that forward computation
alone would be incapable of improving the results already obtained. From his
knowledge of the human visual system, he postulated that only a two-way system
could work, one that could advance a hypothesis by focussing the attention of the
lower layers of a neural network on it. He spent the next few years developing a new
model for tackling precisely this problem.
The main result of this book is the proposal of a generic architecture for pattern
recognition problems, called Neural Abstraction Pyramid (NAP). The architecture
is layered, pyramidal, competitive, and recurrent. It is layered because images are
represented at multiple levels of abstraction. It is recurrent because backward projections
connect the upper to the lower layers. It is pyramidal because the resolution
of the representations is reduced from one layer to the next. It is competitive because
in each layer units compete against each other, trying to classify the input
best. The main idea behind this architecture is letting the lower layers interact with
the higher layers. The lower layers send some simple features to the upper layers,
the uppers layers recognize more complex features and bias the computation in the
lower layers. This in turn improves the input to the upper layers, which can refine
their hypotheses, and so on. After a few iterations the network settles in the best interpretation.
The architecture can be trained in supervised and unsupervised mode.