BMOW title
Floppy Emu banner

Floppy Emu AVRGCC Mystery Behavior

avrgcc-diff

For a while I’ve been struggling with occasional mystery bugs when making changes to the Floppy Emu AVR microcontroller firmware. I’ll make some seemingly innocuous code changes, and completely unrelated things will break. Frequently I’ll change something in a menu or other user-facing code, and suddenly the Emu will lose the ability to read the SD card, and insist there’s no card present. Needless to say, this makes development a real pain in the butt.

Today I found a perfect example of this behavior, where changing the max loop count in some code that’s not even executed would cause the Emu to lose access to the SD card. Buried in the code that simulates reading a floppy disk image is this:

#define INTER_SECTOR_GAP_SIZE 55
// lots of stuff omitted
for (uint16_t i=0; i<INTER_SECTOR_GAP_SIZE; i++)
{
  SendByteAndCheckRestart(0xFF);
}

What I found is that if I changed 55 to 56 or 65 or pretty much any other number, the Emu would suddenly become unable to read the SD card. What’s strange is that this for loop isn’t even executed during Emu startup, where the SD problem occurs. And just changing a single constant from 55 to 56 shouldn’t change the size or location of the compiled code, or anything else that might reveal previously hidden memory-related bugs. But it was extremely reliable: 55 always worked fine, 56 could never read the SD card.

Determined to find an explanation, I grabbed the binary diff tool VBinDiff and used it to compare the two compiled .hex files. I fully expected the files to be identical, save for a single byte that was 0x47 in one version and 0x48 in another version. But I was stunned to find the two files differ in hundreds of places! There are at least a hundred single-byte differences, as well as whole blocks that are different between the two. Thinking maybe I was getting different compile results from one compile to the next, due to a compiled-in timestamp or something weird, I tried compiling the same source code twice and comparing the results. They were identical. But if I changed the loop count, I got a hugely different result. And somehow that result was causing an SD card problem.

At the moment I don’t have any explanation for this, except maybe for some kind of compiler optimization that kicks in after 55 something-or-others. If you’re curious, you can compare the two .hex files yourself: floppyemu-55.hex, floppyemu-56.hex. The hex files show exactly what will be programmed into the ATMEGA1284’s flash memory.

Read 15 comments and join the conversation 

15 Comments so far

  1. Steve Chamberlin - February 27th, 2015 6:18 pm

    Screenshot shows a binary diff on the .elf files, but I linked the .hex files instead. Both show a similar pattern of differences.

  2. Doug Brown - February 27th, 2015 6:20 pm

    From what I can tell, it seems at least partly to be a difference of how things are being ordered in flash, although it’s hard to do an extensive analysis without seeing the .elf files and source code. A disassembly points to 0x1B89 being the first real difference in the files (before that, the only changes are references to things below which have been shifted around slightly).

    At 0x1B89, a few chunks of code are in a different order. They are little small independent chunks that are jumped to from other locations–so it seems like you have a big function with smaller pieces that are in different random places in flash. In the 55 file, the chunks are ordered A, B, C. In the 56 file, those same chunks are ordered C, A, B. One of those chunks is part of the INTER_SECTOR_GAP_SIZE loop. The actual loop code appears to be the same, it’s just that small pieces of it been ordered differently in the generated binary–which is bizarre to me, too!

    After about 1DA6, the file seems to be mostly the same again — except the 56 version is one byte longer at this point! So my quick analysis is that a bunch of code in that range is being reordered in flash, and somehow an extra byte is being stuck in there too. I don’t know exactly where the byte is, and my head’s starting to hurt. Very odd…

  3. Steve Chamberlin - February 27th, 2015 8:51 pm

    Wow, thanks Doug! I’ll take a look at that.

    I also generated the .lss files for both versions, which is basically the assembly code output. That’s probably more helpful than starting at raw hex data: floppyemu-55.lss, floppyemu-56.lss

    Here’s the source: floppyemu-56.cpp

    And the elf files: floppyemu-55.elf, floppyemu-56.elf

  4. Doug Brown - February 27th, 2015 11:43 pm

    Thanks for posting those!

    The addresses I gave were in terms of 2-byte flash words, but objdump and the .lss file both deal in terms of bytes for their addresses. So for example, when I said 0x1B89 is where the differences begin, that is 0x3712 in objdump and the lss file. As you can see at 0x3722 in 55.lss and 0x372c in 56.lss, changing the loop from 55 to 56 rearranged the position of the loop in the generated assembly for DoTypeSpecificMainFloppy().

    I don’t think it actually changed the order of execution — each of these chunks is jumped to from elsewhere. It also seems to cause a change in the ordering of the code from 0x3850 to 0x3b4c (which appears to be all of the SendByteAndCheckRestart() calls after the changed loop?). All of this is still inside of DoTypeSpecificMainFloppy().

    Honestly, the generated assembly for that function is a nightmare to follow! It’s a huge function, and the assembly jumps all over the place. Did you compile it with the –obfuscate option to gcc or something? 😉 All the other functions are staying the same, it’s just that one function that is changing itself around in a few places. The 2-byte size difference could be due to a jump that is no longer necessary due to how the function is rearranging itself, or something like that. I’m puzzled as to why changing a loop constant caused large pieces of that function to rearrange itself, especially since the loop code itself looks to be exactly the same, just in a different position.

  5. Bryan - February 28th, 2015 12:35 am

    I used to develop language compilers back in the 1980s/1990s. In the compilation process, it seems that the AVR handles 55 differently from 56. Without looking at the actual machine language, I would guess that 56 is an odd enough of a value to the compiler so that it has to use a totally different set of routines to handle the value. For example, a compiler would handle a number that isn’t evenly divisible by X differently from a number that is divisible by X. Because you are using constants (vs a variable,) a compiler can take on shortcuts like this in order to optimize their machine language. In any case, the change in machine language would be enough to create a side effect that is totally unrelated. It could be that, upon exiting a function, the registers are different, and maybe the SD routines make some assumptions that are actually failing. I learned in school that regardless of what you think actually happens at a given point, you always clean up your registers before exiting a function. But, because you’re writing in a high-level language, you really have no say in any of the machine language code.

  6. Owen Shepherd - February 28th, 2015 4:14 am

    Which version of AVR GCC are you using? I remember reading that recently GCC has had a few bugs with flash bank crossing. For example, I believe some versions generate jump tables for switch statements which don’t work if the function is compiled above a certain address.

  7. Steve Chamberlin - February 28th, 2015 8:17 am

    It’s AVRGCC 3.3.1.27. No, there’s no –obfuscate flag, but it sure looks that way doesn’t it! And it is a very large function. It’s what used to be main(), before I restructured things to add HD20 support.

    I discovered that if I use a different SD card, the -56 version works. That’s even more confusing. Maybe the code rearrangement in the -56 version leads to some slight timing differences in the pattern of SD card accesses, and one brand of card responds to this differently than another? This lets me move forward with the work I was trying to do in the first place, but it leaves me with a very worried feeling that there’s some hidden problem here. Do I just have a flakey SD card? Or are other Floppy Emu owners going to run into the same problem, if their card behaves the “wrong” way in response to this code change?

  8. Doug Brown - February 28th, 2015 9:30 am

    OK, I graphed that function in both versions of the ELF file in my favorite disassembler, and the difference in lengths is definitely that the 56 version gets a new rjmp instruction because of the rearrangement. The MFM code at 0x3ABC in 55 is equivalent to the same code at 0x38FA in 56. After the breq instruction, the 56 version needs an rjmp to jump to the next chunk, but in 55 the next chunk is already directly below it, so no rjmp necessary.

    So…as far as I can tell, the only real difference between the two builds (other than everything below that function being shifted down by 2 bytes) is that there’s one extra rjmp instruction inserted between the “insert Address to Data gap bytes” and “insert sync bytes” loops in the MFM case in the 56 version. Would that be enough to throw off the timing of the SD card code, and is that section of code even being executed in your tests? Seems doubtful. Really weird. It’s also really strange that it only affects one SD card you’ve tested. I’m with you–calling it a glitch with that particular SD card seems like ignoring a hidden problem. But what would that problem be? 🙁

    One last interesting thing I noticed about the assembly — the code that calculates “checksum” seems to be delayed until right before its SendByteAndCheckRestart call, so there’s a slightly longer delay between the format and checksum byte. Dunno if it matters at all.

  9. Doug Brown - February 28th, 2015 9:52 am

    Just to clarify in case I didn’t make it clear, the checksum code is the same in both versions. It was just a peculiarity I noticed.

  10. Murdock - February 28th, 2015 2:10 pm

    Have you tried changing the 55 to 56 in the hex file and nothing else, and then flash it?

  11. Steve Chamberlin - February 28th, 2015 2:36 pm

    That section of code isn’t even being executed in my test – I just reboot the Emu, and it complains there’s no SD card inserted. It never even gets to the actual disk emulation code. So the SD problem must be related to the changing size or layout of the code. Maybe there’s a bug somewhere that references an uninitialized pointer, and the value that’s pointed to changes when the code gets reorganized. Now why that would be different when using one specific SD card vs another, I have no idea. 🙂 Another thing I should do is try that firmware on a different floppy emu board, with the same card. Maybe it’s some electrical level problem with the board+card combination, that’s exacerbated by a particular code path in the AVR.

  12. Doug Brown - February 28th, 2015 5:59 pm

    Hmm. Well, Murdock’s suggestion could give you some information (good idea!). If changing the hex file from 55 to 56 works with that SD card, then it has something to do with the code reorganization. Otherwise, the code reorganization is irrelevant and it’s probably some kind of a timing thing…

  13. Keenan - March 2nd, 2015 11:24 pm

    When encountering strange problems like this, I like to come at them from a different direction. The code that you’re looking at doesn’t seem to be the culprit, so why not look at the output and see what *could* be the culprit? I would capture the SD card interface on a scope/LA and compare the working and non working versions. Investigate any timing/data differences between the captures. You should be able to trace your code alongside the output and figure out exactly where the discrepancy is popping up.

    FWIW, when I encounter bugs like this, they are usually because I have neglected to properly protect multi-byte variables that are used in (timed) interrupt handlers. Small changes in code will change the timing slightly and sometimes you’ll end up with interrupts firing when the data is half written and doing weird things. If you’re right that this code never executes, and all the other code is identical, then I don’t think this change would affect timing on AVR… Not positive about that though, especially on the gigantic chips with flash pages.

  14. Steve Chamberlin - March 3rd, 2015 7:12 am

    Thanks Keenan, those are good suggestions. I don’t have an easy place to make connections to tap into the SD traffic, but I might be able to rig something up. The specific code I changed isn’t executed, but it does seem to trigger many other code re-arrangements that might possibly cause the sort of interrupt issues you’re describing. For now the problem seems to have disappeared as long as I don’t use that one specific card, but I’m sure I’ll see similar problems again.

  15. Dennis - March 10th, 2015 6:12 pm

    I’ve ran into similar problems on an ATtiny85 project I’ve been working on. A small change in one part of the code will break a completely unrelated part. And there is no good way for me to debug it other than just trying different things randomly. Very frustrating compared with gcc targeted for other processors, like x86 and ARM.

    Perhaps having it connected up to a scope would help narrow down the problem, but when working in C and making small changes, it really doesn’t feel like a scope would help out.

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.