Active Topics

 


Reply
Thread Tools
Posts: 1,418 | Thanked: 1,541 times | Joined on Feb 2008
#1
Hello, All!

I have just released the updated source code for the EMULib, a library of emulation and service routines including image processing and audio synthesis. The new version includes Maemo support, with joystick emulation, direct frame buffer access, and assembler-optimized scaling routines. You can get EMULib sources from

http://fms.komkon.org/EMUL8/

To see how EMULib can be used, check out recently updated ColEm source code:

http://fms.komkon.org/ColEm
 

The Following 3 Users Say Thank You to fms For This Useful Post:
Posts: 503 | Thanked: 267 times | Joined on Jul 2006 @ Helsinki
#2
Some comments:
1. ioctl(FBFD,OMAPFB_VSYNC); is useless and does nothing (and if it actually waited for VSYNC, that would be bad for performance)
2. You don't need to use OMAPFB_FORMAT_FLAG_FORCE_VSYNC flag (you may actually screw up tearing synchronization using it), just OMAPFB_FORMAT_FLAG_TEARSYNC is enough
3. And of course the license choice is bad
 
Posts: 1,418 | Thanked: 1,541 times | Joined on Feb 2008
#3
Originally Posted by Serge View Post
Some comments:
1. ioctl(FBFD,OMAPFB_VSYNC); is useless and does nothing (and if it actually waited for VSYNC, that would be bad for performance)
2. You don't need to use OMAPFB_FORMAT_FLAG_FORCE_VSYNC flag (you may actually screw up tearing synchronization using it), just OMAPFB_FORMAT_FLAG_TEARSYNC is enough
Understood. Will fix for the next version.
 
Posts: 503 | Thanked: 267 times | Joined on Jul 2006 @ Helsinki
#4
Originally Posted by fms View Post
Understood. Will fix for the next version.
Good. Also let me know if you have problems with tearing using the latest diablo firmware, there should not be any.

By the way, your assembly code is not good for ARM11. For example LibARM.s contains lots of chunks of code like this:
Code:
	mov r14,r5,lsr #16
	orr r14,r14,r14,lsl #16
	mov r12,r5,lsl #16
	orr r12,r12,r12,lsr #16
The problem is that the shifted register operand is "Early Reg" and increases latency by 1 (see ARM11 TRM, section "Cycle Timings and Interlock Behavior" if you are interested in improving performance). In this particular case you have 2 cycles penalty because of pipeline stalls (you need to wait for one extra cycle after register modification before you can use it as a shifted operand). Just reordering instructions is faster (4 cycles instead of 6) with supposedly no harm for other ARM cores (and surely it is also better for superscalar cores such as Cortex-A8 because it allows dual issue):
Code:
	mov r14,r5,lsr #16
	mov r12,r5,lsl #16
	orr r14,r14,r14,lsl #16
	orr r12,r12,r12,lsr #16
ARM11 pipeline is not so complex (much simplier than x86 cores for sure) and it is usually possible to predict how it would work and how to make it faster.

Just in order to make life easier and ensure that you managed to schedule instructions properly without missing anything, it is possible to use oprofile and collect CYCLES_DATA_STALL events. Because of the pipeline properties, they do not point exactly to the poorly scheduled instruction but are reported with some delay. So if you are looking at 'opannotate' output and see some spikes of CYCLES_DATA_STALL samples, the offending code is usually a few lines above. Checking ARM11 TRM helps to understand why exactly you got this pipeline stall.

Also optimizations for improving memory access performance are important. ARM processors usually don't allocate cache line on write miss, but uses write buffer to store data to memory. This implies that a special care needs to be taken about writes to memory as they may become a bottleneck. For OMAP1710 (Nokia 770) and OMAP2420 (Nokia N800/810) it happens that 16 byte aligned stores of exactly 4 registers with STM instruction are able to make use of burst transfers and performance is much better (roughly twice). So for example, in spite of being somewhat counterintuitive, instead of
Code:
	LDM {set of 8 registers}
	STM {set of 8 registers}
it is better to use
Code:
	LDM {set of 8 registers}
	STM {set of first 4 registers}
	STM {set of last 4 registers}
That is of course if the destination address is 16 bytes aligned. In other cases burst writes are not used, memory bus is not used efficiently and the code is slower. You can also check the following code which implements this trick: https://garage.maemo.org/plugins/scm...er&view=markup

I'm not sure if the same burst write optimization is useful for other ARM processors though because it may be too platform/microarchitecture specific.
 

The Following 5 Users Say Thank You to Serge For This Useful Post:
Reply


 
Forum Jump


All times are GMT. The time now is 22:29.