Search This Blog

Jun 1, 2012

ARM cortex M3 instruction timing in real test.


movw    
32bit
1 cycle
movt  
32bit
1 cycle
mov.w 
32bit
1 cycle
mul.w  
32bit
1 cycle
udiv
32bit
2 cycles
ldrh 
16bit
2 cycles
ldr
16bit
2 cycles
str
16bit
2 cycles
b
16bit
2 cycles
cmp
16bit
1 cycles

Test method:
    we toggle one GPIO, and get GPIO low period as default time, and add 50 ARM instruction in middle of the GPIO low period,  measure the GPIO low time and minus GPIO low default time, use the result divide (1/20Mhz=50ns), and then we have 50 ARM instruction cycles.

AHB=20 MHz, Project release build configuration.

1)    Code running in Flash, Default GPIO low period is 368ns.

a)      Instruction only
i)        32bit width, 2 cycles
(1)   50 “udiv” instructions
__asm("udiv r3, r2, r3");
Total cycles is (5448-368) / 50 = 101, (100 cycles in theory)
ii)      32bit width, 1 cycle
(1)   50 “movw” instructions
__asm("movw r2, #41440 ");
Total cycles is (2888-368) / 50 = 50, (50 cycles in theory)
iii)    16bit width, 2 cycles
iv)    16bit width, 1 cycle
(1)   50 “cmp” instructions
__asm("cmp r2, r3 ");
Total cycles is (2868-368) / 50 = 50, (50 cycles in theory)
b)      Instruction with data access (0x20003000).
i)        32bit width, 2 cycles
ii)      32bit width, 1 cycle
iii)    16bit width, 2 cycles
Generally, load-store instructions take two cycles for the first access and one cycle for each additional access. Stores with immediate offsets take one cycle.
(1)   50 “str” instructions
__asm("str r2, [r7, #8] ");
Total cycles is (2961-368) / 50 = 51, (51 cycles in theory)
(2)   50 “ldrh” instructions
__asm("ldrh r3, [r7, #0] ");
Total cycles is (2966-368) / 50 = 51, (51 cycles in theory)
iv)    16bit width, 1 cycle

2)    Code running in SRAM, Default GPIO low period is 668ns.

a)      Instruction only
i)        32bit width, 2 cycles
(1)   50 “udiv” instructions
__asm("udiv r3, r2, r3");
Total cycles is (5708-668) / 50 = 100, (100 cycles in theory)
ii)      32bit width, 1 cycle
(1)   50 “movw” instructions
__asm("movw r2, #41440 ");
Total cycles is (5708-668) / 50 = 100, (50 cycles in theory)
iii)    16bit width, 2 cycles
iv)    16bit width, 1 cycle
(1)   50 “cmp” instructions
__asm("cmp r2, r3 ");
Total cycles is (3185-668) / 50 = 50, (50 cycles in theory)
b)      Instruction with data access. (0x20003000).
i)        32bit width, 2 cycles
ii)      32bit width, 1 cycle
iii)    16bit width, 2 cycles
Generally, load-store instructions take two cycles for the first access and one cycle for each additional access. Stores with immediate offsets take one cycle.
(1)   50 “str” instructions
__asm("str r2, [r7, #8] ");
Total cycles is (7304-668) / 50 = 132, (?? cycles in theory)
(2)   50 “ldrh” instructions
__asm("ldrh r3, [r7, #0] ");
Total cycles is (7005-668) / 50 = 126, (?? cycles in theory)
iv)    16bit width, 1 cycle

The result of SRAM is not same as we expected, let’s analysis it.We found below words from Arm Cortex M3 technical reference manual.
14.5.6. Pipelined instruction fetches
To provide a clean timing interface on the System bus, instruction and vector fetch requests to this bus are registered. This results in an additional cycle of latency because instructions fetched from the System bus take two cycles. This also means that back-to-back instruction fetches from the System bus are not possible.
Note:
Instruction fetch requests to the ICode bus are not registered. Performance critical code must run from the ICode interface.

From above, we know that access SRAM from system bus, the instruction fetch need two cycles. But why some of them are match with our expectation?  Let’s go through them one by one.
1.       “udiv”, 32bit width, 2 cycles instruction.
We know ARM cortex M3 has 3-stage pipeline, fetch->decode->execute.  The maximum cycles in one stage decide whole instruction cycles. “udiv” execute stage takes two cycles, so fetch stage increase to two cycles will not impact the final result, so the result we got is same as we calculate.
2.       “movw”, 32bit width, 1 cycle instruction.
Refer from above explanation, the fetch stage become 2 cycles, so final result will be come two times then we calculate. I have reconstructed the pipeline model, it seems reasonable.
3.       “cmp”, 16bit width, 1 cycle instruction.
All Thumb instructions are halfword aligned in memory, so two Thumb instructions are fetched at a time. For sequential code, an instruction fetch is performed every second cycle.
Here are the words in spec "The PFU fetches instructions from the memory system that can supply one word each cycle. The PFU buffers up to three word fetches in its FIFO, which means that it can buffer up to three Thumb-2 instructions or six Thumb instructions. "
So each time it takes 2 cycles to read two Thumb instructions from SRAM into PFU. And processor fetches instruction from PFU, it keeps one clock reading one Thumb instruction, so the result makes sense.
4.       “str” and “ldrh”, 16bit width, 2 cycles instruction with data access.
Although we know when it is not Harvard architecture, access memory cannot simulate execute code, but I fail to calculate what is meaning about number we got.

AHB=80 MHz, Project release build configuration.

1)    Code running in Flash, Default GPIO low period is 120ns.

a)      Instruction only
i)        32bit width, 2 cycles
(1)   50 “udiv” instructions
__asm("udiv r3, r2, r3");
Total cycles is (1370-120) / 12.5 = 100, (100 cycles in theory)
ii)      32bit width, 1 cycle
(1)   50 “movw” instructions
__asm("movw r2, #41440 ");
Total cycles is (1370-120) / 12.5 = 100, (50 cycles in theory)
iii)    16bit width, 2 cycles
iv)    16bit width, 1 cycle
(1)   50 “cmp” instructions
__asm("cmp r2, r3 ");
Total cycles is (761-120) / 12.5 = 51, (50 cycles in theory)
b)      Instruction with data access (0x20003000).
i)        32bit width, 2 cycles
ii)      32bit width, 1 cycle
iii)    16bit width, 2 cycles
Generally, load-store instructions take two cycles for the first access and one cycle for each additional access. Stores with immediate offsets take one cycle.
(1)   50 “str” instructions
__asm("str r2, [r7, #8] ");
Total cycles is (757-120) / 12.5 = 51, (51 cycles in theory)
(2)   50 “ldrh” instructions
__asm("ldrh r3, [r7, #0] ");
Total cycles is (781-120) / 12.5 = 53, (51 cycles in theory)
iv)    16bit width, 1 cycle

AHB=60 MHz, Project release build configuration.

1)    Code running in Flash, Default GPIO low period is 130ns.

a)      Instruction only
i)        32bit width, 2 cycles
(1)   50 “udiv” instructions
__asm("udiv r3, r2, r3");
Total cycles is (1803-130) / 16.7 = 100, (100 cycles in theory)
ii)      32bit width, 1 cycle
(1)   50 “movw” instructions
__asm("movw r2, #41440 ");
Total cycles is (1378-130) / 16.7 = 75, (50 cycles in theory)

AHB=50 MHz, Project release build configuration.

1)    Code running in Flash, Default GPIO low period is 160ns.

a)      Instruction only
i)        32bit width, 2 cycles
(1)   50 “udiv” instructions
__asm("udiv r3, r2, r3");
Total cycles is (2153-160) / 20 = 100, (100 cycles in theory)
ii)      32bit width, 1 cycle
(1)   50 “movw” instructions
__asm("movw r2, #41440 ");
Total cycles is (1640-160) / 20 = 74, (50 cycles in theory)
(2)   50 “mul.w” instructions
__asm("mul.w r3, r2, r3");
Total cycles is (1648-160) / 20 = 74, (50 cycles in theory)
iii)    16bit width, 1 cycle
(1)   50 “cmp” instructions
__asm("cmp r2, r3 ");
Total cycles is (1144-160) / 20 = 49, (50 cycles in theory)

I have no idea about 32 bit widths 1 cycle instruction result…

No comments: