Yeah, this is a fun problem. This is where the majority of my work went when writing an x86-virus that lived in the NOP areas between functions. Compilers produced those blocks of NOPs after function bodies to make the following function addresses aligned to to 16-byte boundaries. I chained all of those cavities together with jumps, distributing as much code as would fit and then putting a jump to the next cavity at the end. The trick is that you could fit more instructions (3 more bytes?) if you had a shorter jump.
I still want to write my own assembler some day.