BMOW title
Floppy Emu banner

Banging my Head Against the Wall

uclinux-logo

Still working on the uClinux bring-up for my 68000 protoboard machine. Still not there. Why is this so hard?

Yesterday the boot process was working up to the point where it tried to mount the root filesystem image. I finally got that sorted out, as well as the mistake I’d made in the available ROM size. I created a hybrid boot config that placed the filesystem image in ROM, and expected everything else to be preloaded in RAM by my monitor program. Finally I was ready to boot up all the way to a shell prompt! Alas, it wasn’t meant to be. Here’s what’s happening now – with a lot of debug prints added:

uClinux/MC68000
Flat model support (C) 1998,1999 Kenneth Albanowski, D. Jeff Dionne
KERNEL -> TEXT=0x080000-0x0b95d4 DATA=0x0bef18-0x0c5b50 BSS=0x0c5b50-0x0d3f53
KERNEL -> ROMFS=0x0c5b50-0x0c5b50 MEM=0x0d3f60-0x0fbe00 STACK=0x0fbe00-0x0ffe00
No Command line passed
Done setup_arch
start_mem is 0xd3f60
virtual_end is 0xfbe00
before free_area_init
free_area_init -> start_mem is 0xd6f60
virtual_end is 0xfbe00
MC68000 68 Katy support by Big Mess o' Wires, Steve Chamberlin
Calibrating delay loop.. ok - 0.81 BogoMIPS
Mem_init: start=db000, end=fbe00
Memory available: 128k/243k RAM, 0k/0k ROM (741k kernel code, 876k data)
Swansea University Computer Society NET3.035 for Linux 2.0
NET3: Unix domain sockets 0.13 for Linux NET3.035.
uClinux version 2.0.39.uc2 (ubuntu@ubuntu-VirtualBox) (gcc version 2.95.3 20010315 (release)(ColdFire patches - 20010318 from http://fiddes.net/coldfire/)(uClinux XIP and shared lib patches from http://www.snapgear.com/)) 48 Fri Nov 14 00:42:45 CET 2014
systctl_init
kernel_thread
idling...
init
setup
sys_setup
FTDI FT245 driver version 1.00 by Steve Chamberlin
ttyS0 at 0x00060000 (irq = 2) is a FTDI FT245
Ramdisk driver initialized : 16 ramdisks of 4096K size
Blkmem copyright 1998,1999 D. Jeff Dionne
Blkmem copyright 1998 Kenneth Albanowski
Blkmem 1 disk images:
0: 3000-25FFF (RO)
VFS: Mounted root (romfs filesystem) readonly.
sys_setup_done
still running.
open console
trying /etc/init
trying /bin/init
do_mmap:
Process blocks 1: 000dd2ec: 00000000 -> 000e60d8: 000e60f8 (20456 @000ee018 #1).
bdflush() activated...sleeping again.
Exit_mmap:
Process blocks 3: 000dd36c: 00000000.
do_mmap:
Process blocks 3: 000dd36c: 00000000 -> 000e6118: 000e6138 (12264 @000f5018 #1).
Jump to 0

zBug(ROM) for 68Katy (press ? for help)

084000>

It successfully mounts the filesystem, it starts the init process, there are a couple of memory allocations, and then suddenly I’m back at the monitor prompt. What the hell?!

There is no intentional mechanism by which uClinux can return control to the ROM monitor, but that’s what appears to have happened. But how? After doing some sleuthing, it looks like uClinux is jumping to address zero. Normally that would cause a crash, since the value at address zero is supposed to be the initialization value of the supervisor stack after a CPU reset, and not executable code. But my monitor program has a “feature” where it stores a branch instruction at address zero, and initializes the stack elsewhere. By making that branch instruction point to a different spot than the actual CPU reset vector, I was able to confirm that a jump to zero is happening, rather than an actual CPU reset or the kernel’s hard_reset_now() function.

The easiest way to accidentally jump to address zero is to overwrite part of the stack with zeroes while in a subroutine or interrupt handler. When the return address is popped off the stack, boom! The program goes to zero city. That can happen in a buggy program, or after a stack overflow. But those shouldn’t be concerns when running standard pieces of the Linux boot process, should they?

My fear is that this isn’t a software bug at all, but glitchy hardware behavior due to noise or failed timing constraints or poor electrical decoupling. I saw something very similar a couple of days ago, where the monitor program would reliably jump back to the command prompt whenever I tried to load a binary file, but before I actually started even loading the file. The problem disappeared when I removed a recently-added bypass capacitor from the protoboard. Hmmmm, not good.

But assuming for a moment that this is a software bug, let’s take a closer look at what’s happening right before it dies. Everything up to the line trying /bin/init is the kernel startup code talking. When it’s done initializing things, it launches the user-mode program init, which is the first real process. Init’s job is to start all the other user-mode processes, like the login shell and background services. Launching it is a big deal: the kernel needs to locate /bin/init in the filesystem, allocate memory, copy the program binary into the allocated memory, perform load-time patch-up of the program’s addresses given the physical address at which it was loaded (no virtual memory here), and finally start the program running as a new process to be managed by the scheduler. So it’s possible something’s going wrong there, but… this is the Linux kernel, it’s supposed to work. I thought that once I’d written my hardware-specific code, my porting job would be done. I didn’t expect to have to debug execve().

I’m not completely sure how to interpret those last mmap log entries, but I think it’s saying that process id 1 (that should be init) has a single allocated block of 20456 bytes at address 0ee018 with a refcount of 1. Then later there are a couple of lines from process 3, which seems to imply init successfully started some child processes. Or did it?

I’ll try adding some debug print statements to init to see what’s happening, as well as waving a dead chicken over the board to scare off any hardware gremlins lurking there. This is proving to be a tougher task than I’d planned for!

Read 16 comments and join the conversation 

16 Comments so far

  1. murdock - November 13th, 2014 8:55 pm

    It does say “Jump to zero”. Any idea why? Could you search for the offending code then change the jump address to what you want it to be at? Or at least find out what determines “0”?

  2. Steve Chamberlin - November 13th, 2014 10:04 pm

    The “jump to zero” message comes from my own monitor program, not any uClinux code. It means that a jump to zero just occurred (the branch instruction at address 0 was taken). It’s probably due to some kind of memory trashing going on, but I need to figure out where and why. I monkeyed around with things enough to convince myself it probably *is* a software bug and not a hardware glitch, so at least that’s some small consolation. It may just be running out of memory – by my accounting, there would only be about 30K left at that point. But I would expect a failed memory allocation to be handled better – kernel panic message, terminated process, or something. Not just random memory trashing.

  3. kbob - November 14th, 2014 9:59 am

    It could also be an indirect call through a function pointer that was never initialized. Not a “Jump to Zero”, a “Call Zero”. (-:

    Good luck.

  4. kbob - November 14th, 2014 10:01 am

    Can you put a logic analyzer on it and see where the PC was just before it was zero?

  5. Steve Chamberlin - November 14th, 2014 10:44 am

    Found it! Kbob was right, and it was due to a mistake in the serial driver I wrote. I didn’t provide a “write_room” function for the driver, because the docs said it was optional. After long and painful debugging, I discovered that the function *is* required. The kernel was trying to call my non-existent function using a zero value from a function pointer table, which led to the jump-to-zero reset.

    Now it gets even further into the boot process. Init is definitely running, and it spawns a shell to execute my statup script, /etc/rc. And the first thing the script does is “/bin/expand /etc/ramfs.img /dev/ram0”. Immediately afterwards it prints a bunch of memory-related messages including “Couldn’t get a free page” and “Unable to allocate RAM for process data”. It looks like /bin/expand is terminated correctly, then it tries the next command from the script “mount -t proc proc /proc”. That generates endless “Failed to free page” messages until I hit the reset button. So… need more RAM?

  6. kbob - November 14th, 2014 1:28 pm

    Glad you found your bug. Fortunately, in 2014 another Megabyte of RAM costs quite a bit less than it did in 1983.

  7. Steve Chamberlin - November 14th, 2014 2:10 pm

    Thanks! The hard part isn’t getting more RAM, it’s fitting the extra RAM into the CPU’s 1MB of address space. Given the parts I have, I think it’ll actually be easier to increase the amount of ROM in the address space, stealing from device address space, while keeping the RAM the same. Then I can put more of the uClinux kernel in ROM, which will free up additional RAM for running uClinux user programs.

  8. Steve Chamberlin - November 15th, 2014 11:18 am

    And we have a shell prompt! Video coming soon…

  9. Steve Chamberlin - November 15th, 2014 3:25 pm

    …and there’s a crazy memory leak somewhere. Looks like malloc’d memory never gets freed, so the machine dies before long. Ugh.

  10. John Honniball - November 17th, 2014 1:47 am

    Before I read that comment about the driver code, I wondered if there’s any way to run a soak test on the hardware? Something like a memory test that runs continuously and hammers the RAM. The old 68000 Suns used a write-write-read-read memory test. It wrote a byte (or word) that was the inverse of what the test required, then wrote the true byte, then read it back twice. That test sequence ensures that all the bits in that memory location are changed by the write, and then ensures that they don’t change when read back. I’ve implemented it for the 6809 using the two accumulators for the write and read cycles, so that they are as back-to-back as possible.

    Incidentally, I found a way on the 6809 to call a subroutine and return without using the stack. Can you work out what I did? Hint: it’s a bit like a RISC chip such as the ARM.

  11. Steve Chamberlin - November 17th, 2014 7:45 am

    A memory test like that shouldn’t be difficult to add to my monitor program – hopefully just a few lines of assembly.

    The early portions of the 68000 Mac startup ROM (before RAM is tested and initialized) make a one-deep subroutine/return using an address register to hold the return address, instead of the stack. Is that what you did? I’m not familiar with the ARM design so that doesn’t help much. Is there a way to perform indirection into the CPU register file itself?

  12. Erik Petrich - November 17th, 2014 3:25 pm

    6809 lets you do exchanges and transfers with the program counter, just like the rest of the registers. So if you have no RAM and thus no working stack, you could save your return address in S (the stack pointer register) itself.

    lds #subr ; or leas subr,pcr for position independent code
    exg s,pc ; call subr

    subr:
    nop ; do whatever
    exg s,pc ; return

    I had good times with 6809 several decades ago.

  13. Steve Chamberlin - November 17th, 2014 3:56 pm

    In my mind you’re a 25-year-old grad student, so you must have been a pretty talented infant to tackle 6809 development.

  14. Erik Petrich - November 17th, 2014 5:50 pm

    You’re off by two decades on my age, but right in spirit. I grew up learning electronics at the same time I was learning to read and then moved on into software around age 10. My first computer of my own was a TRS-80 Color Computer, which used a 6809, but I previously learned assembly for the Apple II (6502) and the original TRS-80 (Z-80). At the moment I am trying to make the leap from a very experienced grad student to actual faculty.

    I enjoy watching you work on your projects since they are things that I either have done or wish I could do if I had more free time.

  15. John Honniball - November 20th, 2014 6:20 am

    About the 6809 subroutine call: Yes, that’s exactly how I did it, using the EXG instruction to swap current PC into a 16-bit register and swap back to return.

  16. Dan - May 13th, 2015 7:04 am

    A bit late, but something that I’ve been working on resurrecting to make things like this simpler is a 68k simulator. I have been working with BSVC (https://github.com/BSVC/bsvc) and just released version 2.2.1 (https://github.com/BSVC/bsvc/releases/tag/v2.2.1). It doesn’t support all of the devices on 68katy, but it works well enough that I’ve gotten a port of the original PDP-11 version of Xinu to (almost!) boot and adding devices is fairly straight forward; I bet it could be used to model 68katy and would boot ucLinux. And it’s much more pleasant to debug software under an emulator than on real hardware. 🙂

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.