I believe that we could further enhance Retropie for the RPi3 by compiling with Thumb2 or even in Thumb2 mode if it’s faster. GCC can generally change its mode based on profile data to use either ARM or Thumb2 or even using thumb-interwork on RPi2 to speed things along with a minor increase in code size for 16-bit based computations.
It’s a shame to have the ability to use Thumb2 and not do it.
I suppose this would, in essence, further split the repositories, but if we’re using deltas to store changes anyway (which I would hope we are), the repositories should be fairly small and compressible anyway without requiring a large server to handle them.
Now, getting into the 64-bit side of things, using Aarch64 would produce somewhat faster code if we can manage to process more in fewer registers, but that would require further breakdown of repositories and a complete recompile of all binaries to 64-bit which is a real pain (especially if you try to bootstrap all of your compilers with the older versions of GCC presently used on Retropie).
I’ve managed a few optimizations with GCC 5.3 that aren’t available in version 4.x that reduces redundant code and speeds things along nicely — mainly using LTO. It would probably be better to attempt compiling with it if we can get it to work properly (it’s a hassle since it doesn’t seem to have been built yet). Profiling the builds with thumb-interwork enabled might help to build the best balance of performance and size for the kernel, programs, etc.
But there’s my 2 cents.