Friday, August 31, 2012

Raspberry Pi JPEG decoding to RGB now

After spending what feels like nearly a week, I've successfully got the Pi decoding to RGBA instead of YV12.  This was a lot harder than it sounds because I had to add on a hardware "resizer" to the hardware "decoder".  It is the resizer which is doing the colorspace conversion from YV12 to RGBA.  Getting these two to interact with each other was quite tricky despite having example source code to refer to.  But it's all working now so the next step is to add support for RGBA surfaces to my existing GLES2 code and then modify my LDImage code also work with RGBA surfaces in addition to the YUV444 surfaces that it already supports.  Finally, I need to copy the pi's reference EGL code into the dexter code to make my GLES2 code work on the pi.

And then... video should work!

Tuesday, August 28, 2012

The YV12->YUV444 method is the culprit!

So my software implementation of an algorithm to convert a YV12 image to a YUV444 image is apparently dog slow so I will need to take steps to eliminate this (either by sending the YV12 image to GLES2 or using hardware to convert the image to RGB instead).

With the YV12->YUV444 algorithm active, here is the benchmark:

pi@raspberrypi ~/vldp-hw/src/unit_tests $ ./daphne_test.bin
Starting test jpeg1_rpi
Total time: 3573 ms (83.963056 FPS)
Stopping test jpeg1_rpi (3610 ms)

Commenting it out  (and thus breaking functionality) here is the benchmark:

pi@raspberrypi ~/vldp-hw/src/unit_tests $ ./daphne_test.bin
Starting test jpeg1_rpi
Total time: 1635 ms (183.486239 FPS)
Stopping test jpeg1_rpi (1671 ms)

These are release builds.  83 FPS is actually still acceptable but seeing that it should be 183 FPS otherwise tells me that I should still try and optimize.

OpenMAX JPEG decoder working inside Dexter code

I've successfully gotten the OpenMAX JPEG decoder working inside Dexter code, although it does not yet render to the screen (just to a buffer).  There is some extra software overhead that was not present before and this is causing it to still run too slow to play at full speed (currently 49.97 FPS and it needs to be over 60) but I still have quite a few optimizations I can apply to get it up to speed.  Here's a comparison right now using libjpeg vs hardware:

pi@raspberrypi ~/vldp-hw/src/unit_tests $ ./daphne_test_dbg.bin
Starting test jpeg1_libjpeg
Total time: 16128 ms (18.601190 FPS)
Stopping test jpeg1_libjpeg (16147 ms)
pi@raspberrypi ~/vldp-hw/src/unit_tests $ ./daphne_test_dbg.bin
Starting test jpeg1_rpi
Total time: 6003 ms (49.975012 FPS)
Stopping test jpeg1_rpi (6034 ms)

Friday, August 24, 2012

Got abbreviated JPEG converted to full JPEG and ran benchmark

I read a little bit of the JPEG specification documents -- enough to accomplish my goal of converting an abbreviated JPEG and JPEG header buffer back to a full JPEG -- and logged my findings here .  I then re-ran the benchmark on one of these 640x240 files and came up with this with this result:

pi@raspberrypi /opt/vc/src/hello_pi/hello_jpeg $ ./hello_jpeg.bin nukeme.jpg
time_T is 4
Total elapsed milliseconds: 1562
Total frames decoded: 300
Total frames / second is 192.061460

That's looking great.  It needs to be 59.94 frames per second and it is 192.  So it's more than 3X as fast as it needs to be.  This is looking very good.

Now I need to modify some of the LDImage code to use OpenMAX instead of libjpeg.

Thursday, August 23, 2012

Hardware decoding more than fast enough!

I've modified my test JPEG decode program to loop through 300 frames in as rapid succession as possible and the results are _very_ encouraging!

pi@raspberrypi /opt/vc/src/hello_pi/hello_jpeg $ ./hello_jpeg.bin lair1.jpg
Total elapsed milliseconds: 2346
Total frames decoded: 300
Total frames / second is 127.877238

The source JPEG is 640x480 in size.  The .ldimg JPEGs are 640x240 or 720x240 in size since they are fields instead of frames and they need to decode at 59.94 frames per second to run at full speed.  So in other words, the hardware JPEG decoder is more than adequate to the task! (it would need to be decoding at 29.97 frames per second and it is running at 127.9 so that is awesome!)

Difference between full JPEG and headers+abbreviated

The libjpeg library I have been using supports creating "abbreviated" JPEGs which basically means that one can create a sequence of JPEGs that all share the same headers.  This cuts down on disk size.  This is what I have done with the .LDIMG file format.  But now that I am working with hardware decoders that apparently have no concept of this, it becomes my challenge to recreate the original "full" JPEG from the "headers" and the "abbreviated" JPEG.  Here is a file compare of how the headers and abbreviated content relate to the full.  As you may be able to see, it appears that a careful algorithm can do this reconstruction fairly simply without having to understand the JPEG header format at all.  At least, that's my hope as I do not want to spend time understanding the JPEG header.

Wednesday, August 22, 2012

Got basic hardware accelerated JPEG decoding working on Raspberry Pi!

I've got OpenMAX hardware JPEG decoding working on the Raspberry Pi!  I wrote a bunch of .cpp/.h files to make doing the OpenMAX API calls nice and organized so I should be able to use this in the Dexter source code.

Things left to do before I can say that the JPEG decoding problem is conquered:
- decode multiple images in sequence and verify that the speed is acceptable.
- decode the "abbreviated" JPEGs that I am using inside the LDImage file format.  The pi hardware can't handle these abbreviated JPEGs so I will need to figure out how to construct a "regular" JPEG from the "abbreviated" JPEG.  Concatenating the JPEG headers with the abbreviated JPEG does not do the trick unfortunately.
- (optional) decode to RGBA format instead of YV12 just to see if I can.  Then I can easily compare the pixel values with what I see inside something like GIMP.

Here's all of the files I wrote to accomplish this (it's an impressive amount of work in a short period of time!)

pi@raspberrypi /opt/vc/src/hello_pi/hello_jpeg $ ls -l *.cpp *.h
-rw-r--r-- 1 pi pi 5983 Aug 23 03:19 hello_jpeg.cpp
-rw-r--r-- 1 pi pi  139 Aug 22 17:33 ILocker.h
-rw-r--r-- 1 pi pi  166 Aug 22 19:15 ILogger.h
-rw-r--r-- 1 pi pi  820 Aug 22 17:11 Locker.cpp
-rw-r--r-- 1 pi pi  496 Aug 22 16:54 Locker.h
-rw-r--r-- 1 pi pi  128 Aug 22 19:31 Logger.cpp
-rw-r--r-- 1 pi pi  159 Aug 22 19:29 Logger.h
-rw-r--r-- 1 pi pi  381 Aug 22 17:07 MyDeleter.h
-rw-r--r-- 1 pi pi 6111 Aug 23 03:16 OMXComponent.cpp
-rw-r--r-- 1 pi pi 3136 Aug 23 03:10 OMXComponent.h
-rw-r--r-- 1 pi pi 1365 Aug 22 19:26 OMXCore.cpp
-rw-r--r-- 1 pi pi  826 Aug 22 19:26 OMXCore.h

And here is what it looks like to run my hello_jpeg.bin program on an arbitrary JPEG:

pi@raspberrypi /opt/vc/src/hello_pi/hello_jpeg $ ./hello_jpeg.bin lair1.jpg
Got event: 0
Got event: 0
Got event: 0
Got event: 0
Got event: 0
Got EmptyBufferDone
Got EmptyBufferDone
Got event: 3
Width: 640 Height: 480 Output Color Format: 20 Buffer Size: 460800
Got event: 0
Got event: 4
Got FillBufferDone

Color format "20" is YUV420Planar mode.

pi@raspberrypi /opt/vc/src/hello_pi/hello_jpeg $ ls -l output.raw
-rw-r--r-- 1 pi pi 460800 Aug 23 04:36 output.raw

This means that the file is organized with the Y plane coming first at a full resolution of 640x480, 8 bits per pixel.  So that takes up 307200 bytes.  Then comes the V plane (I believe) at a half resolution of 320x480, so that is 76800 bytes.  Then the U plane, also at half resolution of 320x480 for another 76800.  307200 + 76800 + 76800 does indeed equal 460800 bytes.  So it appears to be wookin' perfectly!

Monday, August 20, 2012

Raspberry Pi is awesome!

I got a power supply for the Raspberry Pi and tried it out tonight.  It is awesome!  I love it much more than the Beagleboard already.

  • Doesn't need X to render to the TV (GLES2 or video hardware decoding)
  • Boots up faster than the Beagleboard
  • Defaults to TV-out mode upon bootup
  • You can run the GLES2/video apps over SSH and debug with GDB without crashing the whole thing.
  • Much, much cheaper than the Beagleboard!
All in all, I am very optimistic about the Raspberry Pi doing everything Dexter needs it to do!

I will continue to learn the OpenMAX API and hope to have something really cool to show soon. :)

Belated CAX report

I had this conversation saved in my text editor so before I lose it I will post it here.  It's Warren talking about Firefox being shown running Dexter at this year's CAX (which I did not attend).

<Warren_O> I think the highpoint of the show was when Owen Rubin played Firefox, and remarked on how smoothly / seamlessly it played.
<Warren_O> it played beautifully the entire time we had it running, which was most of the show
<zaphX> No plobels at all?
<Warren_O> I think saw a few short freezes, which were probably due to the Windows 7 PC, and what looked like occasional single-field overruns (or under?)
<Warren_O> we're not sure if they were due to the game itself, or if dexter didn't process the skip commands until the next field
<Warren_O> but it played ASSOME
<Warren_O> We ran it briefly on a real VP931, and it looked crappy by comparison
<Warren_O> it did skip properly, but IIRC you could still tell that something was happening
<Warren_O> and then it crapped out :)
<Warren_O> when we tried switching discs to see if that was the problem, the lid interlock broke, so it wouldn't spin up anymore
<Warren_O> (Doug Jeffreys had fixed the interlock just before the show, so I guess it still wasn't quite right.)
<Warren_O> Unfortunately, this happened just before I was going to hook up my logic analyzer
<Warren_O> (I was finishing up soldering the passthrough adapter when this happened)

Friday, August 17, 2012

Beagleboard or Raspberry Pi will need hardware JPEG decoding

I've optimized the software JPEG decoding on the Beagleboard as much as I possibly can (using the libjpeg-turbo library) and also profiled Dexter. With no JPEG decoding running but everything else active, it was using about 9-10% cpu which is acceptable. Here is what the profile shows as using all of the resources on the Beagleboard:

(with no JPEG decoding running the two methods taking up all the time are listener::Think and update_soundbuffer)

CPU: ARM Cortex-A8, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
968187   32.7438  dexter.bin               h2v2_fancy_upsample
745262   25.2046  dexter.bin               null_convert
614287   20.7750  dexter.bin               decode_mcu
365378   12.3570  dexter.bin               jsimd_idct_ifast_neon
95491     3.2295  dexter.bin               decompress_onepass
22163     0.7495  dexter.bin               jsimd_idct_ifast
16686     0.5643  dexter.bin               jpeg_make_d_derived_tbl
14974     0.5064  dexter.bin               __jsimd_idct_ifast_neon_from_thumb
13994     0.4733  dexter.bin               sep_upsample
8031      0.2716  dexter.bin               process_data_context_main
6007      0.2032  dexter.bin               jpeg_fill_bit_buffer
5789      0.1958  dexter.bin               jpeg_read_scanlines
4332      0.1465  dexter.bin               listener::Think(unsigned int)
3232      0.1093  dexter.bin               read_markers
3147      0.1064  dexter.bin               start_pass
2507      0.0848  dexter.bin               LDImageJPEG::DecompressAbbreviated(void*, unsigned char const*, unsigned int, bool)
2256      0.0763  dexter.bin               jinit_master_decompress
2078      0.0703  dexter.bin               ldp_img::OnVBlankStopped()
1904      0.0644  dexter.bin               jpeg_huff_decode
1899      0.0642  dexter.bin               LoopCommon::Think()
1799      0.0608  dexter.bin               alloc_small
1678      0.0567  dexter.bin               VideoObjectGLES2::LoadYUV444Field(void const*, unsigned int)
1563      0.0529  dexter.bin               audio_write_buf(void const*, unsigned int, void* (*)(void*, void const*, unsigned int), unsigned int)
1377      0.0466  dexter.bin               get_sof
1372      0.0464  dexter.bin               serial_rx_char_waiting()
1255      0.0424  dexter.bin               mpom::read_lile64(void*)
1192      0.0403  dexter.bin               start_pass_main
1164      0.0394  dexter.bin               SerialStream::Read(void*, unsigned int, unsigned int)
1157      0.0391  dexter.bin               serial_rx()
1149      0.0389  dexter.bin               consume_markers
1074      0.0363  dexter.bin               MpoContainer::JumpToBlob(unsigned long long)
1062      0.0359  dexter.bin               jinit_upsampler
1053      0.0356  dexter.bin               mpo_read(void*, unsigned int, unsigned int*, mpo_io*)
1041      0.0352  dexter.bin               jzero_far
982       0.0332  dexter.bin               numstr::my_strlen(char const*)
974       0.0329  dexter.bin               LDImageJPEGThreadStart(void*)
963       0.0326  dexter.bin               fullsize_upsample
909       0.0307  dexter.bin               LDImage::LoadVideoFieldStart()
854       0.0289  dexter.bin               jpeg_read_header
814       0.0275  dexter.bin               MpoContainer::StartReadBlob(unsigned int&)
763       0.0258  dexter.bin               jinit_color_deconverter
762       0.0258  dexter.bin               jpeg_calc_output_dimensions
729       0.0247  dexter.bin               jinit_d_coef_controller
727       0.0246  dexter.bin               examine_app0
721       0.0244  dexter.bin               update_soundbuffer(unsigned int)
699       0.0236  dexter.bin               MpoPipe::BlockingWrite(void const*, unsigned int, unsigned int*)
697       0.0236  dexter.bin               main
691       0.0234  dexter.bin               prepare_for_output_pass
673       0.0228  dexter.bin               listener::ProcessPacket()

Thursday, August 16, 2012

Beagleboard video working.. kinda..

I've got the GLES2 code working "perfectly" on the Beagleboard now.  The two problems are that I have only been able to get it rendering in a window on a desktop (instead of fullscreen) and it is far too slow right now.  But this does represent fantastic progress because I had to write and rewrite a lot of code to get this far.  Just having the correct image displayed with graphical overlay is huge.

Saturday, August 11, 2012

Beagleboard can see Dexter's serial port!

For fun I just tried plugging Dexter into the Beagleboard and I was very happy to see that it autodetected the USB serial port. I was able to see some Dexter chatter using minicom. This was expected but is still very exciting. It validates my decision to use a USB serial port on the Dexter board instead of a traditional DB9 port (which are rapidly becoming obsolete).

[68528.769439] usb 1-2.3: new full speed USB device using ehci-omap and address 4
[68528.973022] usbcore: registered new interface driver usbserial
[68528.973175] USB Serial support registered for generic
[68528.981475] usbcore: registered new interface driver usbserial_generic
[68528.981506] usbserial: USB Serial Driver core
[68529.004516] USB Serial support registered for FTDI USB Serial Device
[68529.008178] ftdi_sio 1-2.3:1.0: FTDI USB Serial Device converter detected
[68529.008605] usb 1-2.3: Detected FT232RL
[68529.008636] usb 1-2.3: Number of endpoints 2
[68529.008636] usb 1-2.3: Endpoint 1 MaxPacketSize 64
[68529.008666] usb 1-2.3: Endpoint 2 MaxPacketSize 64
[68529.008666] usb 1-2.3: Setting MaxPacketSize 64
[68529.010101] usb 1-2.3: FTDI USB Serial Device converter now attached to ttyUSB0
[68529.016174] usbcore: registered new interface driver ftdi_sio
[68529.016204] ftdi_sio: v1.6.0:USB FTDI Serial Converters Driver

Tuesday, August 7, 2012

Thayer's Quest fully working with Dexter now

I am ready to declare Thayer's Quest fully working with Dexter now after having fixed the left/right audio issue.

Wednesday, August 1, 2012

Lots of progress!

What does a vacation involving a laptop and a lack of internet result in? Major progress on Dexter! :)

The first thing I did was add support for left/right audio when used in conjunction with .ldimg files (which is what Thayer's Quest needs to function properly). This actually turned out to be a decent amount of work since it had not been implemented at all, and I had to implement it in such a way so that both .ldimg code and legacy VLDP code could share it. So I got that working and I am very pleased with the results.

Then in order to test it, I had to fix a bunch of problems in Daphne's Thayer's Quest driver that I broke when I did things like rip out SDL. Since Thayer's Quest uses a keyboard, I had to design a completely new (and abstract) way to pass keyboard events to a game driver. I'm happy to say that Thayer's Quest is running nicely once again (minus the speech synth which I still need to hook up)

I then decided to tackle something that's been bugging me for a long time and that is to clean up the IVideoObject interface. This interface has been around for a while in my personal code but I've never released it to the public (because it is only in WIP code). But I've done enough work on it that it was growing to monstrous proportions and so I spent quite a bit of time splitting it up into smaller interfaces. This will be very beneficial when it comes time to port the dexter viewer to something like beagleboard or raspberry pi so it was a good investment.

Lastly, I did more work on the VP931 interpreter and am pleased with how it is coming along. I am going to write a plethora of unit tests to make sure it is behaving exactly like the original VP931 behaved (using MAME/firefox as a reference so hopefully MAME is mostly correct) even though I do not have documentation on what the status codes mean. This is the best way I can think of to add support for those status codes without documentation. It will take some time but should be quite stable in the end.
The one thing MAME can't provide me with is details on what the interface does during disc spin-up time and non-instant seek time. If anyone with a real Firefox machine could somehow sniff this for me, it would be pretty awesome and the clock is actually ticking since these players are almost all universally dead.