More on Fast Interrupt Handling with Cortex M4

April 26th, 2018 | Category: Bit Bucket,Floppy Emu | Author: Steve

Can a fast microcontroller replace external glue logic, while also continuing to run application code? This is the third in a series of posts considering the question. It’s part of a potential simplification of my Floppy Emu disk emulator hardware, whose present design combines an MCU and a CPLD for glue logic. For readers that haven’t seen the first two parts, you can find them here. Read these first, including the comments discussion after the post body. Go ahead, I’ll wait.

Thoughts on Floppy Emu Redesign
Thoughts on Low Latency Interrupt Handling

There are several pieces of CPLD glue logic that I’m hoping to replace with interrupt handlers on a Cortex M4 microcontroller, specifically the 120 MHz Atmel SAMD51 Cortex M4. The most challenging is a piece of logic that behaves like a 16:1 mux, and must respond within 500 ns to any change on its address inputs. There’s also a write function that behaves a little like a 4-bit latch, as well as some enable logic. I haven’t yet done any real hardware testing, but I’ve spent many hours reading datasheets, writing code, and examining compiler output. I’ll save you the suspense: I don’t think it’s going to work. But it’s close enough to keep it interesting.

Coding an Interrupt Handler

A 120 MHz MCU means there are 120 clock cycles per microsecond. To meet the 500 ns (half a microsecond) timing requirement for the mux logic, the MCU needs to do its work in 60 clock cycles. Cortex M4 has a built-in interrupt latency of 12 clock cycles before the interrupt handler begins to run, so that leaves just 48 clock cycles to do the actual work. At best that’s enough time for 48 instructions. In reality it will be fewer than 48, due to pipeline issues, cache misses, branches, flash memory wait states, and the fact that some instructions just inherently take more than one clock cycle. But 48 is the theoretical upper bound.

I spent a while digging through the heavily-abstracted (or should I say obfuscated) code of Atmel Start, the hardware abstraction library provided for the SAMD51. Peeling back the layers of Start, I wrote a minimal interrupt handler that directly manipulates the MCU configuration registers for maximum speed, rather than using the Start API. I ignored the write latch and the enable logic for the moment, and just wrote an interrupt handler for the 16:1 mux function. Bearing in mind this code has never been run on real hardware, here it is:

volatile uint32_t selectedDriveRegister;
volatile uint32_t driveRegisters[16];

void EIC_Handler(void) 
{
	// a shared interrupt handler for changes on five different external pins:

	// EXTINT0 = PA00 = SEL - interrupt on rising or falling edge
	// EXTINT1 = PA01 = PH0 - interrupt on rising or falling edge
	// EXTINT2 = PA02 = PH1 - interrupt on rising or falling edge
	// EXTINT3 = PA03 = PH2 - interrupt on rising or falling edge
	// EXTINT4 = PA04 = PH3 - interrupt on rising edge
	// PA11 = output

	uint32_t flags = EIC->INTFLAG.reg; // a 1 bit means a change was detected on that pin

	// clear EXTINT0-4 flags, if they were set.
	EIC->INTFLAG.reg = (flags & 0x1F); // writing a 1 bit clears the interrupt flags

	// mask the 4 lowest bits and use them as the address of the desired drive register
	selectedDriveRegister = PORT->Group[GPIO_PORTA].IN.reg & 0xF; 

	// don't need to check if drive is enabled. 
	// output enable will be handled externally in a level shifter.
	switch (selectedDriveRegister)
	{
		case 7: 
			// motor tachometer
			// enable peripheral multiplexer selection
			PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; 
			// choose TIMER/COUNTER1 peripheral
			PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11E_TC1_WO1; 
			break;

		case 8:
			// disk data side 0
			// enable peripheral multiplexer selection
			PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; 
			// choose SERCOM0 peripheral
			PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11C_SERCOM0_PAD3; 
			// TODO: main loop must check selectedDriveRegister to see if it's 8 or 9 when adding 
			// new bytes to SPI
			break;

		case 9:
			// disk data side 1
			// enable peripheral multiplexer selection
			PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 1; 
			// choose SERCOM0 peripheral
			PORT->Group[GPIO_PORTA].PMUX[11>>1].bit.PMUXO = MUX_PA11C_SERCOM0_PAD3; 
			// TODO: main loop must check selectedDriveRegister to see if it's 8 or 9 when adding
			// new bytes to SPI
			break;

		default:
			// disk state flags and configuration constants
			// disable peripheral multiplexer selection, return to standard GPIO 
			PORT->Group[GPIO_PORTA].PINCFG[11].bit.PMUXEN = 0; 
			// set the output pin high or low, according to the register state
			if (driveRegisters[selectedDriveRegister])
				PORT->Group[GPIO_PORTA].OUTSET.reg = (1 << 11);
			else
				PORT->Group[GPIO_PORTA].OUTCLR.reg = (1 << 11);
			// TODO: also change the PA11 output in the main loop, if the selected register
			// changes its value
			break;
	}
}

You'll notice that EXTINT4 (the PH3 signal on the disk interface) isn't actually used in this code, but it will be needed later for the write latch.

The default of the switch statement is about what you'd expect: it uses four of the inputs to construct a 4-bit address, then uses that address to access an array of 16 internal drive registers. Then it sets the output pin high or low, depending on the internal register value.

Addresses 7, 8, and 9 get special handling. These aren't really registers, but are pass-throughs of the drive motor tachometer signal or of the instantaneous read head data from the top or bottom of the disk. They're not static values, but rather are constantly changing streams of data. I plan to implement the tachometer using the timer/counter peripheral, and the read head data using the SPI peripheral. All of these functions share the same pin, PA11. The code must enable and disable the peripheral pin remapping functions as needed.

After finishing this speculative interrupt handler code, I compiled it in Atmel Studio, using gcc with -O2 optimization. Then I viewed the .lss to see what code the compiler generated:

00000c70 <EIC_Handler>:
 c70:	481f      	ldr	r0, [pc, #124]	; (cf0 <EIC_Handler+0x80>)
 c72:	4b20      	ldr	r3, [pc, #128]	; (cf4 <EIC_Handler+0x84>)
 c74:	6942      	ldr	r2, [r0, #20]
 c76:	4920      	ldr	r1, [pc, #128]	; (cf8 <EIC_Handler+0x88>)
 c78:	f002 021f 	and.w	r2, r2, #31
 c7c:	6142      	str	r2, [r0, #20]
 c7e:	6a1a      	ldr	r2, [r3, #32]
 c80:	f002 020f 	and.w	r2, r2, #15
 c84:	600a      	str	r2, [r1, #0]
 c86:	680a      	ldr	r2, [r1, #0]
 c88:	2a08      	cmp	r2, #8
 c8a:	d012      	beq.n	cb2 <EIC_Handler+0x42>
 c8c:	2a09      	cmp	r2, #9
 c8e:	d010      	beq.n	cb2 <EIC_Handler+0x42>
 c90:	2a07      	cmp	r2, #7
 c92:	f893 204b 	ldrb.w	r2, [r3, #75]	; 0x4b
 c96:	d01a      	beq.n	cce <EIC_Handler+0x5e>
 c98:	f36f 0200 	bfc	r2, #0, #1
 c9c:	f883 204b 	strb.w	r2, [r3, #75]	; 0x4b
 ca0:	4816      	ldr	r0, [pc, #88]	; (cfc <EIC_Handler+0x8c>)
 ca2:	680a      	ldr	r2, [r1, #0]
 ca4:	f850 2022 	ldr.w	r2, [r0, r2, lsl #2]
 ca8:	b9ea      	cbnz	r2, ce6 <EIC_Handler+0x76>
 caa:	f44f 6200 	mov.w	r2, #2048	; 0x800
 cae:	615a      	str	r2, [r3, #20]
 cb0:	4770      	bx	lr
 cb2:	f893 204b 	ldrb.w	r2, [r3, #75]	; 0x4b
 cb6:	f042 0201 	orr.w	r2, r2, #1
 cba:	f883 204b 	strb.w	r2, [r3, #75]	; 0x4b
 cbe:	f893 2035 	ldrb.w	r2, [r3, #53]	; 0x35
 cc2:	2102      	movs	r1, #2
 cc4:	f361 1207 	bfi	r2, r1, #4, #4
 cc8:	f883 2035 	strb.w	r2, [r3, #53]	; 0x35
 ccc:	4770      	bx	lr
 cce:	f042 0201 	orr.w	r2, r2, #1
 cd2:	f883 204b 	strb.w	r2, [r3, #75]	; 0x4b
 cd6:	f893 2035 	ldrb.w	r2, [r3, #53]	; 0x35
 cda:	2104      	movs	r1, #4
 cdc:	f361 1207 	bfi	r2, r1, #4, #4
 ce0:	f883 2035 	strb.w	r2, [r3, #53]	; 0x35
 ce4:	4770      	bx	lr
 ce6:	f44f 6200 	mov.w	r2, #2048	; 0x800
 cea:	619a      	str	r2, [r3, #24]
 cec:	4770      	bx	lr
 cee:	bf00      	nop
 cf0:	40002800 	.word	0x40002800
 cf4:	41008000 	.word	0x41008000
 cf8:	2000063c 	.word	0x2000063c
 cfc:	200005f8 	.word	0x200005f8

I don't know much about ARM assembly, but I can count 44 instructions. Already that looks pretty dubious for execution in 48 clock cycles. A couple of cache misses, or multi-cycle branches, or anything else that requires more than one clock per instruction, and the interrupt handler will be too slow to work. And if I attempt to add the missing write latch logic, the code will almost certainly be too slow. Even just an if() test to see whether the write latch was written would probably be too much extra code.

Meanwhile the microcontroller will be running the main application, responding to user input, updating the display, and streaming disk data. Occasionally the main loop will need to do an atomic operation, requiring interrupts to be disabled for a few clock cycles. If an external pin changes state during that time, the interrupt handler will be delayed by a few clock cycles.

The interrupt handler shown above is appropriate for one of the Floppy Emu's many disk emulation modes. In other modes, a different behavior is needed. A real interrupt handler would need some more if() checks at the beginning to perform different actions depending on the current emulation mode. This would add a few clock cycles more.

Even reaching this "almost fast enough" level would require some minor heroics. I'm fairly certain the interrupt handler code would need to be in RAM, not flash, to minimize or eliminate flash wait states. Even RAM might not be enough - it might need to be placed in the special "tightly coupled memory" region. The vector table itself probably also needs to be relocated from flash to RAM or TCM. This should be theoretically possible, but it's the sort of uncommon thing that's often difficult to find good documentation or examples about, and that eats up lots of development time.

To make a long story short - it doesn't look like it's going to work. And even if it did work, it might be such a pain in the ass that it negates any gain I'd get by eliminating the CPLD. And yet it looks pretty close to working, at least within a factor of two if not less. If the timing requirement were 1000 ns instead of 500 ns, I think I could make it work.

Other Interrupt Oddities

According to the docs I've read, interrupt handlers on ARM are just like any other function. There's no special interrupt prologue or epilogue, and there's no RTI return from interrupt instruction. And yet gcc does specify an interrupt attribute for ARM functions:

__attribute__ ((interrupt))

The code in Atmel Start doesn't appear to use that attribute for its interrupt handlers. So is it needed or not? What does it do? As best as I can tell, it adds some extra code that aligns the stack pointer upon entry to the interrupt handler, but why? If I add the interrupt attribute to my EIC_Handler(), it gets many instructions longer.

Another unanswered question is how to handle nested interrupts. EIC_Handler wouldn't be the only interrupt handler in the firmware, but it should be the highest priority. If another interrupt handler is running when an external pin changes state, that handler should be pre-empted and EIC_Handler should be started. The Cortex M4 supports nested interrupts, but is there any extra code needed in the interrupt handlers to make it work correctly? Extra registers that must be pushed and popped? I'm not sure, but this discussion suggests the answer is yes. If so, that would add still more instructions to the interrupt handler, making it even slower.

Read 15 comments and join the conversation

15 Comments so far

deater - April 27th, 2018 9:41 am

I’m not sure if this is true on Cortex-M, but on Cortex-A the __attribute__ ((interrupt)) tells gcc to subtract 4 from the link register before returning from the function [as this is necessary to return to the proper address for historical reasons]

As for priority, do Cortex-M chips support the FIQ (fastest, highest priority interrupt that also auto-saves registers for you) interrupts?

For code that is as real time as what you describe I almost wonder if the only option would be to disable the caches and do everything in assembly with cycle counting, sort of like the original hardware did back in the day.
deater - April 27th, 2018 9:48 am

and I should have researched more *before* replying, as it turns out Cortex-M interrupt handling is completely different than Cortex-A. So just ignore everything I said 🙁

I’ve just been spending too much time recently inside of Cortex-A interrupt handlers while writing a custom OS for the Raspberry Pi.
Tux2000 - April 27th, 2018 10:16 am

From experience with our bare-metal code at $WORK, __attribute__ ((interrupt)) is not needed on SAM4S, SAN4N, and SAMD21 series. My guess would be that is also not needed on the SAMD51, as the SAMD5x series is an improved SAMD21.

Switching to a different vector table is quite easy. By default, it is at start of flash, but it can be moved to some other location (in flash or RAM). See for example the SAM-BA code for ROM-less SAMs (http://ww1.microchip.com/downloads/en/DeviceDoc/SAM-BA_MONITOR_ROMLESS_v2.18.zip), function check_start_application() in main.c.
Scott - April 27th, 2018 12:57 pm

I don’t understand why you’re talking as if that ISR has to execute all 44 instructions? In the assembly, I count 4 different branches and 4 different exit points. The worst-case timing is only the total instructions of the longest branch sequence. Might be worth spending the time to annotate that assembly and figure out how long the worst case will really take.
Steve - April 27th, 2018 2:19 pm

You’re absolutely right! In my ignorance of ARM assembly, I didn’t realize “bx lr” was a return point. If I’m counting correctly the worst case is only 26 instructions. I’m not sure if that will be fast enough, given all the other potential sources of extra delay that I mentioned, but it just might.
David - April 29th, 2018 10:58 am

I’m not familiar with the ARM architecture, but if you could use a jump table or computed goto, it would save several instructions compared to the case statement logic.
Scott - April 29th, 2018 12:33 pm

It’s 2018. It’s extremely unlikely that you’ll beat the compiler’s optimizer at its own game.
Michel Engel - April 30th, 2018 4:07 pm

There seems to be quite a bit optimization potential in the C code as well as the generated assembler code.

There is probably no need to clear the interrupt flags at the entry of the handler – clearing them just before returning would probably reduce the latency (unless interrupts occur back-to-back).

The ldr instruction at address 0xc86 is probably redundant (the volatile uint32_t declaration of selectedDriveRegister causes this). I think the volatile qualifier is not required here, since the value of the variable is not going to change in the handler.

A calculated branch, as mentioned by David, reduces the cycles required for the three separate comparisons against 7, 8, and 9. Depending on how it is implemented, it might waste some bytes of code space (e.g., by jumping to some base_address + 16 * selectedDriveRegister) due to required code duplication and padding, but this is probably less problematic.

Perhaps it’s a good idea to hand-code the IRQ handler in assembler. However, there are some pitfalls. The biggest one is that the Cortex-M automatically pushes registers R0-R3 to the stack, all other registers used by the handler have to be pushed and restored by code. Interrupts occurring back-to-back (i.e., an interrupt occurring while the handler is running or while the handler is exiting) use a special mode that saves a number of cycles in the process of preserving the contents of R0-R3.
Steve - April 30th, 2018 4:30 pm

Thanks for the analysis. If the interrupt flags aren’t cleared until just before returning, I believe it will create a potential bug. If the external state were to change again after the pins were read, but before the interrupt flags were cleared, then a second interrupt would not occur and the change would get ignored.

selectedDriveRegister is declared volatile for the benefit of main loop code, not shown here. From the point of view of the main loop, selectedDriveRegister can change at any moment. But I think I could avoid the extra ldr instruction by using a temporary variable in the interrupt handler.

I’ve seen the compiler generate a jump table for larger switch statements, so I have to assume it determined the if-else comparisons were faster in this situation. I seem to recall reading somewhere that explicitly loading the PC (as with a jump table) is an especially expensive operation. However, it’s interesting to wonder whether the compiler optimizes for the best average speed, or the best worst-case speed. I really only care about the worst case, which may not be what the compiler is optimizing for.
Joel Dillon - May 1st, 2018 6:22 am

I guess this has already been covered, but ‘your interrupt handlers are just regular C functions’ is specifically a selling point for the Cortex-M series. It’s not true of all ARM.
Doug Brown - May 1st, 2018 9:00 pm

I don’t know if it will make a difference in this small example, but you might want to consider trying -Os instead of -O2 for optimization. Might not help in this case but it would be interesting to compare the generated assembly.
manu - May 2nd, 2018 9:45 am

To minimize the uncertainty regarding pipeline refill/branch misprediction I would suggest to get rid of the branch/jump instructions in the time critical sections (i.e. the first 500ns) when possible.
One way of doing it would be to replace the switch statement with an array lookup and then applying the pre-computed configuration to the (port) registers.
The resulting code should be a sequence of load and store instructions that can be pipelined efficiently.
If the data is packed in a struct the compiler has to calculate the address only once and then can use relative addressing to load the elements.

E.g.
https://pastebin.com/VarpiNMA

For the different emulation modes I would suggest to have a handler for each mode and set the interrupt vector as needed.
To address the concerns regarding cache misses, flash wait states, etc. the datasheet suggests to use tightly coupled memory (TCM, a locked section of the L1 cache). Given the complexity of setting it up (from what I read in the datasheet) it is probably something to be tried out in a later stage.
Steve - May 2nd, 2018 10:21 am

Interesting idea, thanks!
D Schultz - May 4th, 2018 4:38 pm

The reason that an ISR is just a regular C function is because the CPU pushes those registers to the stack that a C function can clobber. So any other registers it uses will be saved.

Then there is some magic in the return (bx lr). In a normal function return lr would hold the return address but during exception processing the CPU puts a special value there which tells it what needs to be done.
cbmeeks - May 10th, 2018 12:20 pm

One thing I’m wondering, why not use a faster ARM?

Or, IMHO, find another CPLD that is 5V tolerant.

I’m currently learning how to use the ATF1508 (and 04) which are still in production.

https://www.microchip.com/wwwproducts/en/ATF1508AS

Just a thought…

Retro Products

Projects

Recent Comments

Blog Topics

Archives

More on Fast Interrupt Handling with Cortex M4

15 Comments so far

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.