Finding the final bug

Since the 11-11-11 announcement of the project, nothing much happened for quite a while. Or so it probably seems.

Actually, I’ve worked on-and-off (mostly off, to be honest) on the minor, but still very annoying bug in 2.11 BSD – I mean the bug that crashed the C compiler when doing a kernel build.

I had already looked into the bug a bit before the 11-11-11 announcement, but with the somewhat complex structure of the C compiler, I did not easily find what was going wrong and where – considering that the exact message was just “Fatal error in /lib/c0”, there was no easy clue to start looking into. And the failed instruction the core dump pointed to made no immediate sense either.

The 2.11 BSD images I used were the RP06 disk image from PUPS, and the RK image composed by Walter Müller. The RP06 image is a bit flaky in some areas, for instance it has a mix of binaries including the short- and long versions of UT_NAMESIZE that affect the passwd bdb, so it’s not easily possible to change the shell for the root user. And also I suspect the /usr/include contents are not consistent. Walter’s images don’t have that kind of problem, but lack the sources – there simply is no room for them anywhere – but also, the source images that I tried were not consistent with Walter’s kernel and what came with it in terms of /usr/include. So, I did not have a set of reliable sources that I could consistently use to reproduce the bug; but, at one point I was playing with it, and I noticed that if a kernel build crashed on one image, I could move it to the other and it would not crash at the same point. So, by moving the build between the RP image and Walter’s RK set, I could actually complete a build – and the resulting kernel image would appear to boot somewhat ok-ish – at least, it would not be completely broken. I was completely flabbergasted at this point, and decided to concentrate on other things for a while – and basically forgot about this strange phenomenon that I could not explain. In retrospect though, it was an important clue – I should have remembered it. Well, hindsight is always 20/20.

Anyway, a couple weeks later I was playing with Unix V7 – which worked fine, I thought. But, I noticed that there was a similar problem with the C compiler, only not while building the kernel, but while building the commands in /usr/src/cmd. Some of the commands living in their own subdirectory failed. For example the build of the eqn command, that I would use to reproduce the problem and that I used as a test many times:

# cd /usr/src/cmd/eqn
# make
yacc -d e.y

conflicts: 85 shift/reduce, 71 reduce/reduce
mv y.tab.c e.c
mv y.tab.h e.def
yacc -d e.y

conflicts: 85 shift/reduce, 71 reduce/reduce
mv y.tab.c e.c
cc -O -c e.c
cc -O -c diacrit.c
cc -O -c eqnbox.c
cc -O -c font.c
cc -O -c fromto.c
cc -O -c funny.c
cc -O -c glob.c
cc -O -c integral.c
cc -O -c io.c
cc -O -c lex.c
cc -O -c lookup.c
cc -O -c mark.c
cc -O -c matrix.c
cc -O -c move.c
cc -O -c over.c
cc -O -c paren.c
cc -O -c pile.c
cc -O -c shift.c
Fatal error in /lib/c1
*** Error code 8

Stop.
#

I verified the same set with SIMH – but there the compiler worked flawlessly. In all images that I could find or create, SIMH worked flawlessly, but my VHDL consistently failed, and it always failed in the same places. So, I started to try and find out more about the problem. I started by trying to vary things in the setup. The first attempt was to see if it might be something to do with timing – so, I ran the same setup at 1Mhz instead of 12. Same thing. The next thing I tried was if there was a difference if I ran the same test on an Altera board and a Xilinx board. Or a board with dram and one with sram. No difference at all – so, obviously, the problem very likely would have to be in the VHDL.

I still did not have any useful idea on where to start looking. So, after a while, I came up with the idea to include some logic in the CPU core to show a signal outside the FPGA, to trigger my logic analyzer. I started at the instruction that consistently featured in the /lib/c1 core dumps – the mov 0104216(r0),r3.

Like so:

   when state_src6 =>
      if ir(8 downto 6) = "111" then
         addr_indirect <= unsigned(datain) + unsigned(rbus_data_p2);
      else
         addr_indirect <= unsigned(datain) + unsigned(rbus_data);
      end if;
      if rbus_data = "0010000001101110" and r7 = x"334C" then
--       no state transition - halt cpu
      else
         state <= state_src6a;
         r7 <= r7p2;
      end if;
So, effectively the CPU would crash at the point where the mov 0104216(r0),r3 instruction was executed – after several tries, I found that that was specific enough to allow Unix to boot and run everything up to the compiler run, and still consistently trigger the logic analyzer. But, to my disappointment, the ~4000 clock transitions that my logic analyzer can capture were not sufficient to show the problem – I traced back from where the trigger occurred and checked all the instructions, one by one, but all appeared to be processed correctly; at least, I did not see anything going wrong.
During the work with the logic analyzer, I also looked into the exact code that was breaking in the C compiler. It was the c12.c source file, and specifically this part:
optim(atree)
struct tnode *atree;
{
        struct { int intx[4]; };
        register op, dope;
        int d1, d2;
        struct tnode *t;
        register struct tnode *tree;

        if ((tree=atree)==0)
                return(0);
        if ((op = tree->op)==0)
                return(tree);
        if (op==NAME && tree->class==AUTO) {
                tree->class = OFFS;
                tree->regno = 5;
                tree->offset = tree->nloc;
        }
        dope = opdope[op];

where, in the compiled form the opdope[op] is interesting because the index, op, was out of range and thus causing the problem. To be more easily understood by reading the compiled form of the same bit of the compiler source, and especially considering the last two instructions, asl r0, and mov _opdope(r0):

_optim:
~~optim:
jsr     r5,csv
~atree=4
jbr     L1
L2:~op=r4
~dope=r3
~d1=177770
~d2=177766
~t=177764
~tree=r2
mov     4(r5),r2
jne     L4
clr     r0
jbr     L3
L4:mov  (r2),r4
jne     L5
mov     r2,r0
jbr     L3
L5:cmp  $24,r4
jne     L6
cmpb    $13,4(r2)
jne     L6
movb    $24,4(r2)
movb    $5,5(r2)
mov     10(r2),6(r2)
L6:mov  r4,r0
asl     r0
mov     _opdope(r0),r3

What I also found out while looking into the structure of the C compiler, was that Unix V7 already included a very advanced debugger: adb, that you can use to examine core dumps. Which produced the following:

# adb /lib/c1 core
$r
ps      0170000
pc      031512  ~optim+076
sp      0175234
r5      0175252
r4      010067
r3      0113544 _end+0154
r2      052
r1      0
r0      020156
~optim+076:     mov     0104216(r0),r3
$c
~optim(052)
~optim()
c routine not found

The first thing to check in the adb output was the instruction itself, and the ones preceding it. Obviously, a mode-6 access with the index in r0 would be out of range for a normal array for this value of r0 – like the opdope[op] would imply, and which I could easily verify to be correct from the C language sources. And also, the shift instruction converting the word index in r4 to the address index in r0 worked correctly – after all, 2 times 010067 is exactly 020156. So, from the adb output, and also by looking at the logic analyzer, it was becoming clear that the problem actually had to be something else: the stack was corrupted, because there was no way that the op value of 010067 could be correct – at the very least, the high order bit was not correct, but even if the offending bit would be ignored, 067 also did not make sense looking at the potential values for the field. Equally obvious, 052 (052 being the octal notation) could not be a valid value for a pointer into the heap. But still, looking into the logic analyzer output and all of the instructions in the assembler output from the compiler, I could not see how or why the stack pointer would become corrupted – and it simply happened outside of the tiny viewport the ~4000 clock transitions that the logic analyzer recorded.

So, effectively, I was still more or less at the same point that I started out at. Even though I had spent considerable time working on it, the problem had not become any more clear, I had found no theory to chew on, and really the only definite clue I had was that “it did not work” and the vague notion that it probably had something to do with stack corruption. But for that kind of problems, there are many possible explanations – stack corruption typically occurs after a lot of errors. So, at that point I was somewhat discouraged and again decided to work on other things for a while.

That’s where I more or less forgot about the whole problem for a couple of weeks. I concentrated on some of the other things I like to do for a while – like, working out in the climbing gym. So for some weeks I was not really thinking about the problem at all, spending a lot of time on other things, and maybe only playing with the PDP stuff in some lost moments. Until the day that I was climbing something slightly too difficult, and hurt a pulley in one of my fingers. And decided that it would be a good idea to take a rest from climbing training for some days to allow my finger to heal somewhat – or to find out if it was an injury to be worried about, because at one point it really hurt bad. So suddenly I was confronted with a whole weekend I did not have any plans for – I had no trip planned to anywhere, the weather was not inviting to go outside, and I would certainly not go to the gym for training. To cut the story short, I decided to spend the entire weekend on debugging instead.

So on the friday night, I started thinking where to begin work. I had already looked into the set of instructions that were used in the C compiler assembler source, and had found no obvious candidates to corrupt the stack – except perhaps the jsr instruction. However, that instruction had already taken the lead role in the last major debugging exercise – it was, after all, the vexing jsr r6 problem which caused compiled Fortran code to break on RT-11. And thus also broke sysgen and much more importantly Dungeon. So I thought I would not have to look at the jsr instruction, because I had already examined it ad nauseam.

So, even though I thought that there could be no problem there, I had already looked at the jsr, the differences with the implementation in SIMH, and also the formation of the MMU SR1 for jsr. Which, I noted at some point, were missing in the VHDL, but SIMH did include them. So I had already added a tentative fix in the VHDL:

                  when state_jsr =>
                     rbus_ix <= "110";
                     state <= state_jsra;

                  when state_jsra =>
                     addr_indirect <= rbus_data_m2;
                     rbus_waddr <= pswmf(15 downto 14) & "0110";
                     rbus_d <= rbus_data_m2;
                     rbus_we <= '1';
                     sr1_dstd <= sr1_m2;
                     rbus_ix <= ir(8 downto 6);
                     state <= state_jsrb;

                  when state_jsrb =>
                     if ir(8 downto 6) /= "111" then
                        rbus_waddr <= pswmf(15 downto 14) & pswmf(11) & ir(8 downto 6);
                        rbus_d <= r7;
                        rbus_we <= '1';
                     end if;
                     r7 <= dest_addr;
                     state <= state_ifetch;

but that did not cause any differences. And that was something that had already vexed me - why had Bob Supnik included the update of the SR1, if apparently none of the operating systems really needed it to run? I did not know what to think of it, really - so I started looking into the order by which the SR1 is formed, because I knew that was a difference between some of the PDP models and my VHDL - the rule appears to be that 'the register that is modified first goes into the lower byte, and the other goes into the upper byte'. But different models update the registers in different orders... so which version to follow?

That is when I realized that by the fix I had applied I wrote the sr1_dstd field - but, since the state_jsr follows the destination address calculation, that field is already potentially used by the state_dstX FSM states. I quickly changed the VHDL to:

                  when state_jsr =>
                     rbus_ix <= "110";
                     state <= state_jsra;

                  when state_jsra =>
                     addr_indirect <= rbus_data_m2;
                     rbus_waddr <= pswmf(15 downto 14) & "0110";
                     rbus_d <= rbus_data_m2;
                     rbus_we <= '1';
                     sr1_srcd <= sr1_m2;
                     rbus_ix <= ir(8 downto 6);
                     state <= state_jsrb;

                  when state_jsrb =>
                     if ir(8 downto 6) /= "111" then
                        rbus_waddr <= pswmf(15 downto 14) & pswmf(11) & ir(8 downto 6);
                        rbus_d <= r7;
                        rbus_we <= '1';
                     end if;
                     r7 <= dest_addr;
                     state <= state_ifetch;

and reran the usual test - flash the Unix V7 image, because it would be corrupt after failing a test, then booting it, changing to /usr/src/cmd/eqn, and running make. Even though I had not really expected the test to pass outright, I was still somewhat disappointed that it failed - but, it failed in a different way - so I knew that I had somewhat affected the nature of the problem. Now, it crashed in the same compilation, but apparently earlier in the same run - or at least, at an instruction closer to the start of the /lib/c1 executable.

So that puzzled me for a while. Then I came up with the idea that since it was likely a stack issue, and likely the jsr instruction that caused a stack push that caused the MMU to abort and therefore the instruction restart routines to kick in and somehow fail, that it would make sense to look into the order of the updates to the registers during the jsr instruction.

Oops. So, if the stack push would fail, the CPU state machine during the memory access that would be aborted would have already progressed to the state where it would update the target register for the instruction. So, even though the jsr instruction would be aborted because of the stack push, the target register would be written into - and thus, after the instruction restart carefully constructed by the OS fault handler, the restarted instruction would contain the wrong value of the target register. In any case but jsr pc, that would mean that the target register would be updated twice, but with only one update to the stack - so the original value of the register would be lost. Ouch. So, once I saw that, the fix was easy, just introduce a new intermediate state so that the update to the stack (which may cause the MMU to abort) and the update to the target register to actually occur in different cycles - so that if the memory cycle causes an abort, no update to the target register will have been initiated. As follows:

                  when state_jsr =>
                     rbus_ix <= "110";
                     state <= state_jsra;

                  when state_jsra =>
                     addr_indirect <= rbus_data_m2;
                     rbus_waddr <= pswmf(15 downto 14) & "0110";
                     rbus_d <= rbus_data_m2;
                     rbus_we <= '1';
                     sr1_srcd <= sr1_m2;
                     rbus_ix <= ir(8 downto 6);
                     state <= state_jsrb;

                  when state_jsrb =>
                     state <= state_jsrc;

                  when state_jsrc =>
                     if ir(8 downto 6) /= "111" then
                        rbus_waddr <= pswmf(15 downto 14) & pswmf(11) & ir(8 downto 6);
                        rbus_d <= r7;
                        rbus_we <= '1';
                     end if;
                     r7 <= dest_addr;
                     state <= state_ifetch;

After applying the fix, I was somewhat surprised - but happily so - that the problem in the make eqn test was apparently fixed. The realization that I had found the problem and actually fixed it came as something of a shock. And, going back to my plan of spending the entire weekend to chew on the problem, by the time I found the fix, it was only 10PM on the Friday night. Quickly I started test runs on several boards in parallel - so I had Unix V7 running all kinds of work on three different boards, and was working on another to check 2.11BSD. Which also now ran commands that previously repeatedly crashed and repeatedly completed kernel builds, in short: everything worked flawlessly.

Incredible.

So, I spent most of the night in the enlightened state that you can only reach when something of an incredibly complex technical nature unexpectedly works like a charm. Watching how it works over and over again, hours on end. In a state somewhere between disbelief and utter amazement, just looking at what I had created. If there is a heaven for geeks, it must be something very close to this.

So the next morning I had no plan for the rest of the weekend. Well, to be honest, by the time I woke up normal people were already well into their notion of afternoon. And this day, for the first time this season there was snow - and a lot of it too, 4 inches easily, but by the time I woke up the sun was shining and the sky was clear. So I took my camera and went out into the nature reserve that I consider my back yard and spent some time watching an incredible sunset over a frozen plain seemingly inhabited only by wildlife and me.

Some days, life really is wonderful.

1 comment to Finding the final bug