Early this summer, just after I had left home for the holidays, I got a mail from Robert Keck. He’d been looking into why RT-11 versions 5 crashed on booting from an RL02 image but didn’t find the reason for the crash – same as I didn’t on my previous debugging sessions.
Robert had used an interesting approach: he hooked up a hardware logic analyzer between the FPGA and the card. One of the things he found was that my SPI logic doesn’t adhere to the standard very well – to the point that his logic analyzer didn’t work with it. That didn’t stop him though, he replaced my SPI logic by a generic component and made that work with the RL logic. Nice work, and a reminder that I should revisit my SPI logic – it works, but it could be better.
He hadn’t found the issue yet though, but his mail included a premonition ‘many problems are obvious only after you see them’. So it shouldn’t have been a surprise when the next day I woke up to another mail saying that he’d found the problem: the wcp register, the internal positive form of the word count that determines how much data is going to be read or written, was too narrow for one of the transfers that the RT-11 boot code tried to do.
The RL controller was the first of the disk controllers that I added to PDP2011, and back in March 2009, the only information I could find was the user manual for the controller. Like all user manuals of all RL-type controllers, it is quite specific about the word count register: the top 3 bits should always be 1.
It also goes on to give some stern warnings to programmers that the capabilities of the RL controller are indeed quite limited and that it is incapable of transfers that cross the end of a track – or, as is so nicely voiced in a comment in the 211bsd driver “This routine is stupid (because the rl is stupid)”. In other manuals, there’s even a warning that the timeout for transfers past the end of a track is quite long and should thus be avoided.
But apparently the RT-11 programmers missed all that, because that’s exactly what they did in the boot process – a transfer long enough to go past the end of a track, and also with a word count that won’t fit in the 13 bits allotted for the register. So clearly the manual is wrong about the 15-13 bits being fixed to 1 – or 0, in the internal positive representation that I use. The register would have to be made wider than that. But how much wider – add one more bit, two, or all three? Robert used a more scientific approach – he checked out the schematics of the controller, and that clearly has all 16 bits.
And sure enough, increasing the width of the register to the full 16 bits fixes the issue. And I think I even notice the long timeout the manuals warn about.
Since the fix is rather trivial, I’ll not push out an update to the download page straight away – there’s been a lot going on in the code base since the last releases on the download page, and it’d need creating branches relative to those earlier releases. Certainly doable, but it takes time I’d rather put in new developments. Here’s a screenshot of the diff:
and for your convenience, I’ll add the updated rl11.vhd file here – unzip, put the rl11.vhd in your source tree, recompile the whole lot and everything should work.
Lots of thanks to Robert for finding and fixing this issue! and also for not just fixing it, but showing that the fix is correct according to the schematics. It doesn’t really get better than that.