Saturday, May 24, 2014

Fixed Daphne 64-bit overflow bug

So, not too long ago, I modified the way vertical blanking timing is calculated to be very accurate.  I was working on Star Rider emulation, so I didn't spend that much time thinking about overflow issues when designing this.  Well, it turns out that my initial design was pretty prone to overflow, plus it was computationally expensive.  Here is how I calculated CPU cycle counts relative to vertical blanking timing:

#define DIVIDER 63000000

void VblankTimer::OnTopFieldVblankStart(unsigned int uFrameIdx)
{
// this operation is essentially [lines finished] * [microsecond / line] * [ cycles / microsecond]
// (uFrameIdx * 525) * (4004/63) * (cpuHz / 1000000)

// TODO : would floating point math be faster here?

// determine the base cycle count using a VERY accurate, self-correcting, and expensive method
m_u64Line1StartCycle = (uFrameIdx * 525);
m_u64Line1StartCycle *= 4004;
m_u64Line1StartCycle *= m_u32CpuHz;
m_u64Line1StartCycle /= DIVIDER;

// set up next event
this->m_pCallback->RegisterCpuEvent(m_u64Line1StartCycle + m_u32TopVblankEndCycleOffset, OnTopFieldVblankEndCallback, this);
}

As I show in my comments, the basic algorithm to compute the CPU cycle count for when vblank of the top field should begin is:

currentFrame * 525 * 4004 / 63 * CpuHz / 1000000

Now, one can plug this into a calculator and get a pretty accurate floating point result.  For example, if the cpu frequency is 1 MHz (1,000,000 Hz) the answer would be 0 if the currentFrame is 0, or 33,366.666666 or 33,367 if one rounds it up.

However, doing math the way a typical calculator does it in this program is expensive, and on CPUs such as ARM that handle floating point operations (ie fractional numbers) poorly, doing floating-point math brings the CPU to its knees.

So it is much faster to do integer math.

However, when doing integer math, the result of any math operation discards any fraction.  So 3/2 is not 1.5 but instead 1.  That means in order to get the best accuracy, division should be saved until the last possible moment.

This means that the order that one would perform the math would be:

currentFrame * 525 * 4004 * CpuHz

and then divide the whole thing by 63 * 1000000

This makes for a very large number before the division takes place!

I was noticing that Dragon's Lair was locking up after running for a while (meaning almost 24 hours) and I tracked it down to when the currentFrame value is greater or equal to 2,193,848.  At 29.97 frames per second, this is 20.33 hours.

2,194,847 is still manageable.  2194847 * 525 * 4004 * 4000000 (Dragon's Lair CPU speed) is 0xFFFFFF20BC896C00 in hex (64-bit number).

However, 2194848 * 525 * 4004 * 4000000 when performed on calc.exe is 0x1DE62BA75C8000 in hex (64-bit number) which is less than the previous result.  This means that overflow has occurred.  So the solution here is to either try to do 128-bit math which is expensive no matter what or try to find another way to solve this problem :)

I decided that the best solution to avoid overflow issues like this is to get rid of the multiplication entirely and just use addition (adding onto previous result each new frame).  This is dangerous to do with integer math because it can lead to greater and greater inaccuracy over time.

However, I did some experimentation and found a pattern with part of the calculation.  currentFrame * 525 * 4004 / 63 has a repeating pattern which can be mimicked with integer math:

Frame: 0, time is 0.000000
Frame: 1, time is 33366.666667
Frame: 2, time is 66733.333333
Frame: 3, time is 100100.000000
Frame: 4, time is 133466.666667
Frame: 5, time is 166833.333333
Frame: 6, time is 200200.000000

Every 3rd value is a whole number.

I observed that this pattern can be approximated (without loss of accuracy over time) by adding 33367, 33366, and 33367 over and over again.

I've now implemented a new system that does just this: adds 33367,33366,33367 after a one-time calculation of multiplying these numbers by cpuhz/1000000.  This now is much faster, doesn't lose accuracy, and will still work naturally when the total cycle count overflows (which, since I am using a 64-bit number, will take quite a while to do).

No comments:

Post a Comment