...: DOOM on ZPU

The last few days I spend some time on porting the beloved game DOOM to the ZPU CPU. For you who don't know what I am talking about, DOOM was a ground breaking game released in early 90. ZPU is not only a russian AA gun but also a 32 bit open source CPU

DOOM v1.666 running in ZPU simulator.

So what is the point of doing this?
Well, it is just for fun I guess. A ZPU based game console will have a hard time to beat the next Xbox 720 or PS4. Infact the ZPU is pretty slow and not very well suited for graphics.
Having that said, DOOM on ZPU is interesting from a profiling and verification point. Without any hard evidence I would say DOOM is a more interesting work load than Dhrystone. It also tests the toolchain and hardware quite a bit.

Mr DOOM in trouble

To summarize these are the leanings/status:

With HW acceleration it is possible to play DOOM on FPGA
long long operations and memcpy are not very great with the current toolchain
Port is running modified Phi platform
This is only tested in emulator

The following is currently not supported in the port:

Sound
Input
Network

I currently don't have a FPGA board so I have only run it inside the ZPU simulator, this gets you about 1-2 FPS on a decent laptop. So not very playable, therefor I didn't bother with input or sound.

Some add-ons to the Phi platform was needed:

Mode 13 frame buffer
Doom1.wad preloaded into memory
35Hz timer

Performance

So how fast is DOOM on ZPU?

With the following setup:

All instructions implemented
225 MHz
Loads takes 2 cycles (i.e no RAM latency)

At 225 MHz an average 30.5 FPS with some rare 20-25 FPS are achieved when running 'timedemo demo3'. Resolution is the default Doom, 320x200.

To reach the same performance on a PC something like a 486 @ 66MHz and a decent graphics card is needed.

Look at this great benchmark list.

So is the ZPU about 1/4 the speed of a 486?
No, the comparison is a bit unfair:

In simulation the frame buffer has no latency, this is NOT true for computers from the 486 era. A good graphics card could almost double FPS.
ZPU port does not have any sound.
RAM access has 1 cycle latency in simulation, this is far from true for an old PC.

The x86 port has assembly optimizations for the inner loops, the ZPU port is C code only.

It would be hard to reach >100 MHz on a FPGA, as far as I know all ZPU implementations barley reach 100 MHz. But with some tricks it would still be possible to achieve decent performance (see profiling for details):

Some of the inner loops and fixed point arithmetic could be replaced by hardware units.
There is a frame buffer memcpy done every frame, this could be removed if a double buffering capable display controller is used.
The 64 kB frame buffer could reside in a dualport onchip RAM. This would give a very fast frame buffer.

Profiling

The DOOM demo mode is very similar to real game play, the demo mode is normal game play but with key strokes recorded. Demo runs will be deterministic. The profiling has been run using the command line:
-timedemo demo3
(Again, taken from this. They are running the original DOOM port. The ZPU port is ported from BOOM.)

Time spent in each frame as FPS. The workload varies much between frames.

Frame 1137, 76 FPS. Fastest frame. No time spent in ceiling or floor. Little time spent in drawing walls.

Frame index 184, 14 FPS. Slowest frame. Way more fixed point multiplications than surrounding frames, not sure why.

A typical frame takes about 7400000 cycles with the ZPU platform described above. Stairs, enemies, shooting and transparency are some things that will increase render time. The table below contains a breakdown for frame 200 of the demo run, this frame is good typical frame. The frame takes 9562694 cycles to complete.

Function Name	Total cycles spent	Nbr calls	Avg time in call	Description
memcpy (all)	1414058	5356	263	All memcpy calls
memcpy (framebuffer)	224504	1	224504	memcpy to framebuffer
udivsi3	2753956	1578	1745	long long div
muldi3	2073756	2322	893	long long mult
R_Drawspan	837281	161	5200	Draw ceiling/floor
R_DrawColumn	647947	867	747	Draw walls

(This is just for frame 200 but other frames should be somewhat similar)

There seems to be something fishy here.The majority of time should be spend inside the R_DawSpan and R_DrawColumn functions, these are the functions that does all the pixel drawing. Instead most time is spend doing 64bit arithmetic.
(See http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for some interesting profiling results).

Frame 200. Used for function call profiling.

udivsi3 & muldi3

DOOM uses 64 bit multiplications as part of their fixed point arithmetic. Many old PCs lacked an FPU I believe this is the reason why DOOM is integer only.

#define FRACBITS 16

__inline__ static fixed_t FixedMul(fixed_t a, fixed_t b)
{
return (fixed_t)((long long) a*b >> FRACBITS);
}

__inline__ static fixed_t FixedDiv(fixed_t a, fixed_t b)

{

return (abs(a)>>14) >= abs(b) ? (a^b)<0 ? MININT : MAXINT :

(fixed_t)(((long long) a << FRACBITS) / b);

}

The ZPU does not have support for 64bit hardware multiplication or division so the compiler relies on a software to implement the long long operations. A multiplication takes almost 900 cycles, 400 of those cycles are spent inside two 8 byte memcpy. The division routine has a similar behavior.
There is plenty of room for improvement in these routines. A good approach would be to do a assembler optimized routine. Better but not portable would be to use a dedicated hardware for the fixed functions above. FPGAs usally has a lot of multipliers in them that can be used for the multiplication. Many FPGA vendors supply IPs for dividers, there are also some available on opencores.com
These hardware blocks would then be memory mapped, this would be easier than adding a new instruction to the instruction set. The hardware accelerated multiplier would be very quick the divider a bit slower.
Including function overhead the FixedMul could take less than 50 cycles and the FixedDiv less than 100 cycles, but this is just a very wild guess.

memcpy

If the Fixed functions are improved then the majority of the time spend in memcpy will disappear.
The memcpy complexity is approximently:

cycles_in_memcpy = 200 + bytes_to_copy *3.5;

Any small copy will have a pretty large overhead and large copies aren't that fast either. I am having trouble locating what memcpy being used, the only one I can find is this (zpugcc/toolchain/binutils/libiberty/bcopy.c):
void
bcopy (src, dest, len)
register char *src, *dest;
int len;
{
if (dest < src)
while (len--)
*dest++ = *src++;
else
{
char *lasts = src + (len-1);
char *lastd = dest + (len-1);
while (len--)
*(char *)lastd-- = *(char *)lasts--;
}
}

When looking at the assembler the doesn't look very similar, however GCC might unroll the loops.

The frame buffer memcpy takes about 2-3% of a frame, this could be removed completely by using hardware double buffering.

I have seen several post about optimized memcpy for ZPU, has anyone of these reached the repository?

R_DrawColumn & R_DrawSpan

These are the functions that most time should be spent into. These functions draw the pixel values to the framebuffer (via a buffer). R_Drawcolumn draws the walls and R_Drawspan draws the floors and ceiling, (see http://fabiensanglard.net/doomIphone/doomClassicRenderer.php for a good explanation)
This is one of the innerloops inside R_Drawcolumn:

do
{
// Re-map color indices from wall texture column
// using a lighting/special effects LUT.

// heightmask is the Tutti-Frutti fix -- killough

*dest = colormap[source[frac>>FRACBITS]];
dest += SCREENWIDTH;
if ((frac += fracstep) >= heightmask)
frac -= heightmask;
}
while (--count);

This loop would be a good candidate for hardware acceleration. The software would not need to wait for the hardware accelerated loop to finish, instead it could go on and do other things.
Each iteration takes about 30 cycles and the count value varies greatly but for this particular frame the average value is about 20. So around 600 cycles could be saved.
The R_drawspan has a similar but not identical inner loop that would also benefit greatly from hardware acceleration.

There is plenty of room for optimizations, software only optimizations would give a nice bump but HW acceleration is needed to reach 35 FPS on a FPGA board clocked at 50 MHz.

OP code histogram

The emulator has a nice opcode histogram feature, this is useful to get an understanding of the workloads.
A first guess might be that a 3D game would have a pretty characteristic profile when it comes to what instructions being used most. One might suspect lots of multiplications and divisions, but this is not the case.
Scroll down for a complete list of the opcode histogram. This histogram calculated from the entire run not just frame 200.
Here are some comments regarding some of the instructions:

loadsp / storesp / addsp

The most common instructions are what I call the stack operation instructions these are loadsp, storesp and addsp. I use to think of them as a necessity due to the lack of registers, or they being the equivalent to registers in a register based architecture. In most workloads these will always be among the most common instructions.
This is a typical example:
count = dc_yh - dc_yl;
32798: 73 loadsp 12
32799: 76 loadsp 24
3279a: 31 sub
3279b: 55 storesp 20

The register based architecture would have something like:

sub r20, r24,r12 // substract r12 from r24 and store in r20

load

32bit load is a very common instruction at 12.5%. 32bit loads are used for accessing arrays and global variables. There are many global variables in the code. These are often used in combination with the im instruction:

fracstep = dc_iscale;
32532: af im 47
32533: 99 im 25
32534: c8 im -56
32535: 08 load // load from bccc8
00032536 <.LM6>:
frac = dc_texturemid + (dc_yl-centery)*fracstep;
32536: 79 loadsp 36
32537: af im 47
32538: b0 im 48
32539: a4 im 36
3253a: 08 load // load from bd824
3253b: 31 sub
3253c: 71 loadsp 4
3253d: 29 mult
3253e: af im 47
3253f: a0 im 32
32540: 80 im 0
32541: 08 load // load from bd000
32542: 05 add
00032543 <.LM7>:
register const byte *source = dc_source;
32543: af im 47
32544: 99 im 25
32545: c4 im -60
32546: 08 load // load from bccc4

All global variables are located close in memory but they are all fetched using absolute addresses. IMs are used to describe the absolute addresses. In fact almost all arrays and global variables are loaded using 3 im to address them. Even if most global variable are close together
It would be nice to have something like a frame pointer and then a relative load.

im 47 // set fp to at bccc4
im 25
im -60

popfp

fracstep = dc_iscale;
im -1
loadfprel // load from bccc8

frac = dc_texturemid + (dc_yl-centery)*fracstep;
loadsp 36
im 5
im 80
loadfprel // load from bd824
sub
loadsp 4
mult
im 1
im 79
loadfprel // load from bd000

register const byte *source = dc_source;
im 0
loadfprel // load from bccc4

In the above example 3 instructions would be saved using a loadfprel instruction.
(To clarify, there is no such function. But it might be a good addition if there ever is a ZPUv2)

loadb

Most pixel operations are done using char*, these will in most cases use loadb. It is surprising to see that loadb is only 1.7% given that this instruction is executed a lot in the inner loops of R_Drawxxxx functions.

Opcode histogram for all frames

im	32.99%
storesp	12.77%
load	12.49%
loadsp	8.39%
add	9.35%
addsp	4.22%
neqbranch	3.17%
store	2.79%
loadb	1.76%
eq	1.39%
and	1.28%
storeb	1.16%
lshiftright	0.98%
ashiftright	0.61%
not	0.56%
sub	0.54%
lessthanorequal	0.53%
mult	0.51%
poppcrel	0.51%
lessthan	0.50%
nop	0.47%
pushspadd	0.36%
ulessthanorequal	0.35%
ulessthan	0.35%
popsp	0.33%
or	0.33%
poppc	0.21%
ashiftleft	0.16%
loadh	0.15%
callpcrel	0.15%
pushsp	0.06%
call	0.05%
storeh	0.05%
flip	0.02%
xor	0.01%
neg	0.009%

What is next?

It would be nice to port this to FPGA. Preferably to an already existing enviroment. Maybe the ZPUIno?

Some things that the platform needs to handle:

Code size is ~700kB
File data is ~4 MB
4MB RAM (either onchip or offchip+cache)
Display output (VGA/HDMI, preferably 6bits per plane)
FPGA resources for a Mode 13 frame buffer

17 comments:

Rodrigo RoblesAugust 26, 2014 at 5:52 AM
Great work man! I really like these profiling/speed optimization studies.
AnonymousNovember 16, 2014 at 6:10 PM
Awesome work! :D

I think it might be fun to try and get it working on my FleaFPGA board:
http://www.fleasystems.com/fleaFPGA.html

Just curious, but is there a preferred way to contact you? Feel free to drop me an email when you can. Thanks! Valentin Angelovski
UnknownNovember 21, 2014 at 3:00 AM
Nice Work ...... preferably.....
Free Game Doom Download
AnonymousNovember 24, 2014 at 5:17 AM
would it run 100000 frames per second on the actual hardware?
WarlockDDecember 14, 2016 at 9:35 AM
You wouldn't still have the source would you? I am trying to make a frame buffer only port of doom for the STM32 and while somone has done it, there are memory issues and I am trying to find a more modern port.
Amadeus777999July 5, 2017 at 7:35 AM
Interesting - Doom levels and tech talk is always appreciated!
UnknownOctober 6, 2017 at 3:12 PM
I recommend that you contact cyberghosthacker830@gmail.com if you need to check on your partner’s sincerity,employee’s honesty,recover lost email,institutional servers-keylogging -University grades changing / Admin(staff) Account hack -Access/Password,Facebook, instagram, bbm,Skype, snapchat, Various blogs, icloud, apple accounts etc Clearing of criminal records,email accounts hack ( gmail,yahoomail,hotmail )Database hack- Untraceable IP,change your school grades,gain access to bank accounts. Contact him via Email at Contact us @ cyberghosthackers at outlook dot com OR cyberghosthacker830 at gmail dot com
Call +1 (847) 469-3558
WhatsApp : +1 (847) 497-0407
ICQ: 70442209

Mika HarrisOctober 16, 2017 at 10:16 PM
Hello All
I'm offering following hacking services

..Western union Trf
..wire bank trf
..credit / debit cards
..Perfect Money / Bintcoing adders
..email hacking /tracing
..Mobile hacking / mobile spam

..hacking Tools
..Spamming Tools
..Scam pages
..spam tools scanners make your own tools
..Keyloggers+fud+xploits

Fake peoples have just words to scam peoples
they just cover their self that they are hacker
but when you ask them a questions they don't have answer
they don't have even knowledge what is hacking
am dealing with real peoples who interested and honest
also teaching hacking subjects in reasonable price
with private tools and proof.

Availability 24/7 contact only given below addresses
salvrosti@gmail.com
Icq: 718684828
Skype: live:Salvrosti@gmail.com
AliMazharNovember 26, 2018 at 1:57 PM
Great post man thanks for sharing this useful information but I was i serach for Jailbreak download finally i found one original and working PS3 Jailbreak & PS4 Jailbreak Games PKG for free follow the link to read more.

Alan WalkerJanuary 5, 2019 at 1:44 PM
So, I have a Logitech Harmony Smart Control, (the predecessor of the Harmony Companion). It does not have any specific buttons for smart light control ps3 jailbreak 4.83 download
Alan WalkerJanuary 19, 2019 at 4:11 AM
It's a thing that the PS4 does that you have to opt out of where it will display games on the home screen that Sony thinks you might like to buy. Its a global thing but clearly you have opted out of it and forgotten about it. ps4 hack 6.20
Alan WalkerJanuary 21, 2019 at 4:49 AM
Had a low rank, as i only played a few games this week. Think i got 2 Jumbo Premium Gold packs and 7k coins? fifa 19 hack
NdroXsusMarch 3, 2019 at 8:15 AM
Got an opportunity to read the fantastic and imaginary blogs. It gives me lots of great pleasure and interest. Thanks once again for sharing the resourceful blog.
download gta 4 pc game highly compressed
Abdul QayyumJuly 31, 2019 at 5:05 PM
Really Liked the information you have provided. I have an article relaed to it. I was searching about it on the internet and I found an amazing article on Microsoft Official Site. The provided Article was about a site that provides working modded android apps. The name of the site was “Fineapkapps”. The Article was very halpful, You should read that. Click Here to reach that amazing article: Latest Modded Apps.
AllyFebruary 17, 2022 at 5:18 AM
TOOLS&FULLZ SHOP
_______________

hi EveryonE!

Are you been stuck for looking valid products or been scammed by scammers :(

Here the Valid store available for all kind of tools,tutorials & Fullz with quality

Learn hacking and spamming and do it on your own way & enjoy..........

_______________

1)FRESHLY SPAMMED USA FULLZ
2)HACKING & SPAMMING TOOLS
3)TUTORIALS
_______________

*Contact*
*ICQ :748957107
*Telegram : @James307
*Skype : Jamesvince$
_______________
USA SSN FULLZ WITH ALL PERSONAL DATA+DL NUMBER
-FULLZ FOR PUA & SBA
-FULLZ FOR TAX REFUND
*fullz/lead with DL num
*SSN+DOB
*Premium info
ID's Photos For any state (back & front)
________________
+US cc Fullz
+(Dead Fullz)
+(Email leads with Password)
+(Dumps track 1 & 2 with pin and without pin)
+HACKING & CARDING TUTORIALS
+SMTP LINUX
+SAFE SOCK
+CPANEL
+RDPs
+Spamming Tutorial
+SERVER I.Ps
+EMAIL COMBO
+DUMPS TUTORIAL
+BTC FLASHER
+KEYLOGGER COMP&MOB
+EMAIL BOMBER
+SQLI INJECTOR
+ETHICAL HACKING TUTORIAL
+GMAIL HACKING TUTORIAL
+PENETRATION TESTING TUTORIAL
+PayPal Cracker
+BTC Cracker
+BLUE PRINTS BLOCKCHAIN
+EMAIL BLASTER
+SMS SENDER
+NORD VPN
+ONION LINKS AND TOR BROWSER (LATEST VERSION)
+DARK HORSE TROJAN
+NETFLIX CHECKER
+IP ROUTING
+KEYSTROKE LOGGER
+WESTERN UNION LOGINs
+ALI BABA IPs
+KEYLOGGER
+SHELL SCRIPTING
_______________
*Let's do a long term business with good profit
*Contact for more details & deal

*Contact*
*ICQ :748957107
*Telegram :@James307
*Skype : Jamesvince$
Peeter October 17, 2022 at 8:15 PM
Watz up Guys

LEGIT TOOLS, FULLZ, TUTORIALS ARE AVAILABLE
All Spamming, Hac-king, Car-ding stuff
Valid & Working tools/Fullz

"Contact for Deal"
ICQ = 752 822 040 <-------------------
Tele Gram = @killhacks <-------------
Wickr/Skype = peeterhacks <-----------

All type of "Fullz"
CC FULLZ
SSN+DOB+DL FULLZ
DUMPS TRACK 101/202
HIGH CS FULLZ 700+
FOR SBA/PUA/TAX-RETURN FULLZ

"ICQ = 752822040"
"Tele-gram = @leadsupplier"

TOOLS WITH "TUTORIALS, EBOOKS, GUIDES"
Ka-li Linux Complete Master Class
Ke-ylogg-ers
Vir-uses
SMTP/RDP's/Mailers
Btc Cracker/Flas-her
FB/WA Hac-king tuts & guides
Spoof-ing/Boo-mbing

Tools/Tutorials on demand
24/7 Fast delivery
Crypto Payment Mode Only

...

Saturday, June 21, 2014

DOOM on ZPU