RE-ENTRANT PP: A New Development
A re-entrant scheme is proposed to give the PURR-PUSS learning system greater sequential strength at low memory cost. An example of its operation in learning Nursery Rhymes provides evidence of the feasibility of the scheme.
Introduction
The PURR-PUSS system, which will be abbreviated to PP, is a brain for a robot and a theory of human learning. PP was described in detail in Andreae(1998). It is a working system that can be implemented in a standard computer, but is designed for parallel operation in dedicated hardware.
The structure of PP originated from a desire to produce a learning system with "sequential strength", largely because of the demands of language learning. This was formulated as the problem of learning a finite state machine, and later a Turing Machine. It was also influenced strongly by Newell & Simon's (1972) production system theory of human problem solving. PP fits comfortably in the framework of Reinforcement Learning (Sutton and Barto, 1998), but is less mathematically tractable because of a novelty mechanism. Here, I will use the fourth source of the PP structure, a simplified view of the cerebral cortex of the human brain, because it points strongly to the need for re-entrant features, as argued by Gerald Edelman since 1978 (Edelman & Mountcastle, 1978; Edelman, 1992; Edelman & Tononi, 2000).
I assume that the cortex consists of a large number of (say 100) distinct areas, "cortical areas", to which bundles of "afferent" (input) neurons converge and from which other bundles of "efferent" (output) neurons leave. The efferent neurons are assumed to convey a conjunction (AND) of the activity of the afferent neurons. Cortical areas differ in the sources of their afferent neurons, which may be outside the cortex (e.g. sensors or other parts of the brain, such as reward centres) or other cortical areas. These assumptions are made specific in the PP structure by defining cortical areas, "event-types" and "events". It is assumed that each neuron carries an event of the type appropriate to its bundle.
Time advances in steps, the current step being "now". Bundles of neurons can be delayed by none, one or more steps, so the minimal definition of a cortical area (CA) looks like this with afferent neuron bundles to the left of the ">>" sign and efferent neuron bundles to the right of it:
[CA: event-type/delay, event-type/delay, ? >> event-type(s)]
As an illustrative example, imagine that a robot sings nursery rhymes described by two event-types, "note" and "beat". By "note" I mean a note of a musical scale such as middle C, D, E, .., while beat includes note-length (crotchet, quaver etc) and stress marking three-time (e.g. waltz time) or four-time (e.g. march time). Our imaginary robot might have its first cortical area, CA-1, defined by
{CA-1: note-3, beat-3, note-2, beat-2, note-1, beat-1 >> note, beat}
In words, CA-1 says that the current note and beat "can be obtained from" or "can be predicted by" the conjunction of (AND) the note and beat in the previous step, the note and beat in the step before that, and the note and beat in the step before that. The "-1" in "note-1" and "beat-1" should be interpreted as meaning one step before the current step. Similarly, "-2" and "-3" mean 2 and 3 steps before the current step. The six event-types to the left of the >> sign describe the "context" of the CA.
The nursery rhyme "Baa, Baa, Black Sheep" begins
Step |
Note & beat |
Abbreviation |
Word |
1 |
note C, crotchet, stressed |
CcrS |
Baa |
2 |
note C, crotchet, unstressed |
Ccr |
Baa |
3 |
note G, crotchet, unstressed |
Gcr |
Black |
4 |
note G, crotchet, unstressed |
Gcr |
Sheep |
5 |
note A, quaver, stressed |
AquS |
Have |
6 |
note B, quaver, unstressed |
Bqu |
You |
7 |
note C (octave), quaver, unstressed |
C'qu |
An- |
8 |
note A, quaver, unstressed |
Aqu |
-y |
Notice that each step is a new note, so the steps are not of equal duration. This is just a choice made here for simplicity. After step 4 we can store in the memory of the robot brain the first instance of a CA-1, using the abbreviations listed in the 3rd column:
[1CA-1: CcrS Ccr Gcr >> Gcr]
If the sequence "CcrS Ccr Gcr" reappears later on, then the "association" 1CA-1 will suggest that the next event is going to be "Gcr". However, if on that occasion the next event is "Fqu", 1CA-1 will acquire a second "associated event":
[1CA-1: CcrS Ccr Gcr >> Gcr, Fqu]
and from then on this instance of a CA will predict the alternatives "Gcr" and "Fqu".
Another addition to the terminology leads to more compact descriptions. An instance of the context of a CA will be called a "node". This will enable me to say the 1CA-1 node "CcrS Ccr Gcr" has 2 associated events. In the full PP system, a variety of flags, numbers and pointers are attached to nodes, but these are not needed here.
My book Associative Learning for a Robot Intelligence (Andreae, 1998) gave many examples of the power and versatility of small groups of Cortical Areas, including the mimicking of a Universal Turing Machine. One of the ideas that was played with there, but not usefully, was the extension of the notion of "event" to "node event", that is an event that is itself the context or node of a CA. (By the way, CAs were called templates in that book.). This idea was called "Contexts of contexts" but I failed to see how to make it work until February of this year (2005). Before that I was taking the context events from different CAs, which had the effect of broadening the context. This was of no advantage because it could be achieved by using a larger CA in the first place. However, if the node event in a CA is taken to be the previous instance of the same CA, we produce a CA of too much sequential strength! Indeed, if there was memory space for 10,000 nodes of the CA it would generate a sequence of that length before "forgetting" started to break up the sequence. Also, it would be useless because it couldn't use repetitions of sequences for predicting future behaviour. So far, I have used two ways of limiting the sequential strength of one of these "re-entrant" CAs. (Re-entrant because a CA contains a previous instance of itself.) The first way is by hashing the number of the CA node to a much smaller number, and the second is by zeroing the number when there is a "pause" situation. The next section reports an example of these two processes being used together.
Hashing is a random, but fixed, transform which generates a number in a specified range for each node (instance of the context) of a CA. The method I will use is the division of the node number by a prime number, taking the remainder. (In other words, expressing the node modulo the prime.) A big diagnostic advantage of this simple method of hashing is that it is easy to check results.
In the example described in the next section, two CAs are used, each with a different prime, 11 for the first and 13 for the second. The amount of memory used by several re-entrant CAs will be proportional to the sum of the primes, while the resolution (inverse probability of ambiguity) of the combined CAs will be proportional to the product of the primes. This means that there is an enormous advantage in having several CAs operating in conjunction, a discovery which promises to make the re-entrant CAs, and therefore re-entrant PP, significantly more powerful than ordinary CAs and PP.
Nursery Rhymes Example
It is a remarkable fact that young children can learn a large number of nursery rhymes without confusing them, even though the tunes have a considerable amount of similarity. Consider the 71 tunes in a book called My First Sing-A-Song Book (Lloyd, 1966) as a test for a robot brain capable of learning sequences. To avoid irrelevant complexity (and to avoid giving PP extra clues!), each of the tunes has been transposed to the key of C and the notes have been coded for pitch, stress and duration. In each tune, the length of the main beat is 8, so in 3-time there will be the equivalent of three length-8 beats, while in 4-time, there will be the equivalent of four length-8 beats. The tunes are presented in random order to the robot and between each pair of tunes there is a sequence of pauses during which the robot can take over the "singing". We look at one of the attempts of the PP robot to reproduce the tunes.
To remove any effects of human interference, the set-up for the experiment has two robots, robot A having the learning brain and robot B being programmed to behave in a certain way which includes the singing of songs. I will frequently refer to robot A as PP. Both robots are programmed to wait (do pause actions) while the other robot sings until the other robot signals that it is finished. The default action for PP, when it can no longer choose a singing action, is a "hand-over pause".
PP is given 2 cortical areas, SICA and SICB, defined as follows:
| {SICA: note&beat-2 note&beat-1 SICA-1 >> note&beat} | hashing prime = 11 |
| {SICB: note&beat-2 note&beat-1 SICB-1 >> note&beat} | hashing prime = 13 |
Note. These definitions are exactly equivalent to the CAs used in the simulation, but omit complications arising from the fact that the system was developed for a more complex interaction of robots. Specifically, I will be describing the system as though only robot actions are involved, while in the actual simulation stimuli were used as well. It can be seen that here the note and beat are combined in a single event-type note&beat, instead of keeping them separate as in CA-1.
A simulation was run for 8,000 steps and then the memory of PP was dumped to give us a record of all the CA nodes (instances) stored. During the simulation, as a result of the random choice of song, 47 of the 71 songs were sung by robot B, a few as many as 3 times. On looking at the behaviour of robot A, I found a sequence of 93 notes starting at step 7813 during which robot A reproduced song number 14, which was OLD KING COLE. Let me call this sequence List-X. I could have chosen many other sequences, but this was the one nearest the dump. (It is available for playing in the Downloads Section as midilistX.mid, a MIDI audio file.)
Robot B sang OLD KING COLE from step 7719 to step 7811 and this was the first time, so it was novel to PP. PP starts on step 7813 with the introductory pauses given by robot B at the beginning of 3-time and 4-time songs. On step 7817, PP chooses the first note of song 14 from 18 possibles. It then proceeds to sing song 14 accurately, with an unambiguous choice of note on all but 6 of the remaining 87 notes. In each of the six cases, it chooses the appropriate note because of the increased worth of novel events.
In the following, I will refer to the first two event-types of SICA and SICB, namely note&beat-2 and note&beat-1, as the fixed context, the remaining re-entrant event-type as the node event-type, SICAnode or SICBnode.
There are a number of ways of showing the effect of the node events. If the re-entrant node events had not been included in SICA and SICB, all 331 2-note matches between List-X and the other 46 songs heard by PP would offer a possible transfer from song 14 to another song. This is convincing evidence that the re-entrant events increase the sequential strength of the CAs beyond the 2 events of the fixed context. That there are 22 3-note matches between List-X and the other 46 songs demonstrates a sequential strength greater than 3 notes. The work being done by the node events is even more clear when we see the details (Table 1) of how they deal with the 2-note match between (for example) notes 16 and 17 of List-X and 35 places in the other songs having the same pair of notes. The two notes are "CcrS Ccr" or "480248 80248" in the coding of Table 1. The probability of two entries in Table 1 having the same pair of node events is only (1/11)*(1/13) or 1/143, but more entries in the 36 rows of Table 1 have the same pairs of node events than we should expect. To explain this, we need to return to the "zeroing" of nodes, which was mentioned briefly in the Introduction.
Note/Song |
Note&Beat(coded) |
SICAnode |
SICBnode |
18/List-X |
1002D0 |
9 |
13 |
06/7 |
80248 |
4 |
5 |
19/7 |
80248 |
3 |
6 |
33/7 |
80248 |
12 |
7 |
38/7 |
100250 |
5 |
6 |
53/7 |
80248 |
7 |
6 |
04/8 |
480408 |
4 |
5 |
33/8 |
480408 |
7 |
14 |
08/21 |
80248 |
6 |
14 |
16/21 |
80248 |
8 |
6 |
59/25 |
100550 |
6 |
11 |
04/27 |
480248 |
4 |
5 |
06/27 |
4802C8 |
9 |
8 |
19/27 |
480248 |
7 |
6 |
21/27 |
4802C8 |
9 |
8 |
34/27 |
480248 |
12 |
10 |
36/27 |
5002D0 |
3 |
12 |
41/27 |
80248 |
5 |
14 |
43/27 |
5002D0 |
6 |
15 |
47/27 |
480248 |
8 |
6 |
49/27 |
4802C8 |
9 |
8 |
62/27 |
480248 |
7 |
6 |
64/27 |
4802C8 |
9 |
8 |
29/29 |
4802C8 |
13 |
11 |
05/37 |
80248 |
4 |
5 |
12/39 |
4802C8 |
7 |
4 |
06/60 |
80248 |
4 |
5 |
18/60 |
80248 |
11 |
9 |
43/60 |
80248 |
9 |
4 |
06/65 |
802C8 |
4 |
5 |
14/65 |
802C8 |
12 |
11 |
21/65 |
802C8 |
5 |
3 |
06/69 |
480108 |
12 |
5 |
10/69 |
5002D0 |
5 |
9 |
22/69 |
480108 |
6 |
8 |
26/69 |
880008 |
9 |
12 |
Matches with fixed context "480248 80248" of note 18/List-X Table 1 |
|||
For note 18 of List 8, the fixed context (coded) is "480248 80248". These two note&beats occur just before each note of a song given in the lefthand column of Table 1. Node events used to store the note&beat following this context are shown in the third and fourth columns. The node event values are "modulo prime + 3", so the SICAnode values range from 3 to 13 for prime = 11, and the SICBnode values range from 3 to 15 for prime = 13.
We would like identical sentences, behaviours, procedures and other complete sequences to be recognized as the same when they recur. For this to happen, there needs to be a resetting of parameters in the pause before a sentence, behaviour or procedure begins. This is achieved in Re-entrant PP by zeroing the node events during a pause. The rule, which is carried out automatically by PP is that if the last two events are pauses, the node events (SICAnode and SICBnode) being stored are each given the value "2". This value is allowed because the hashed values have 3 added to them, leaving the value "2" for this purpose and the values "0" and "1" for general housekeeping. All of the six entries in Table 1 with the node events 4 and 5 come from the beginning of songs just after zero node events have been stored. There has not been time for the various sequences to be distinguished, as they all begin with the same two note&beats.
Two of the entries in Figure 1 with node events 9 and 8, and two with node events 7 and 6 are from adjacent segments of the same 29-note sequence in song number 27, ON THE BRIDGE, which is repeated as a chorus. It is only by chance that the two occurrences of this sequence are not distinguished by the node events, but non-re-entrant CAs would be unable to distinguish them without having a context of length 29 notes! Two of the other three entries with these same pairs of node events are from the same repetitive song in which the context in question occurs 12 times. Anyway, if we assume that the 36 entries in the table are independent of each other, then we can use the Binomial Distribution to show the probability of clashes. It can be shown that there is a 78% probability that the set {9,13} for SICAnode and SICBnode of note 18 in List 8 will not be seen among the 35 other sets, and it is not. It can also be shown that there are likely to be about 3 pairs of associations among the 35 that share the same sets. These predictions assume random assignment of the sets, so the associations with set {4,5} can be disregarded because they are controlled by the zeroing process. There is certainly correlation between the two sets {7,6} and {9,8} because they occur in song 27, so the 9 associations sharing 3 sets {7,6}, {8,6} and {9,8} are not too far away from the prediction of 3 pairs sharing the same sets.
The song number 14, OLD KING COLE, which the re-entrant-PP reproduces, has a sequence of 18 notes that are repeated later in the song. These two occurrences of the sequence were distinguished in the reproduction, something that a standard PP couldn't do without having CAs covering the length of the sequence.
If the CA with a fixed context of 2 events takes M units of memory, then each additional value of node event will require another M units of memory. That is, roughly, the memory required for a CA is proportional to the number of values available to the node event. Using the prime method, the memory required for the CA will be M x prime. With 2 CAs, the memory will be M x prime1 + M x prime2. In general, the memory required by several CAs will be proportional to the sum of the prime numbers. However, we have already seen that the ability of the CAs to distinguish sequences depends on the product of the primes, 143 in the case of the two primes 11 and 13. It follows, that it is preferable to use several CAs with small hashing primes, rather than one CA with a large hashing prime. Of course, we must ensure that all decisions by the CAs are unanimous.
A Microsoft Word document "ReEntrantPP29oct05.doc" with more details may be downloaded from the Downloads section.
References
Andreae, J.H. (1998): Associative Learning for a Robot Intelligence. Imperial College Press.
Edelman, G.M. & Mountcastle, V.B. (1978): The Mindful Brain. MIT Press.
Edelman, G.M. (1992): Bright Air, Brilliant Fire. Penguin Books.
Edelman, G.M. & Tononi, G. (2000): A Universe of Consciousness. Basic Books.
Lloyd, N. (1966): My First Sing-A-Song Book. London: Paul Hamlyn.
Newell, A. & Simon, H.A.(1972): Human Problem Solving. Prentice-Hall.
Sutton, R.S. & Barto, A.G. (1998): Reinforcement Learning. MIT Press.