Re variable instruction lengths: Consider some forced alignment, e.g. every 16-byte (or every 32-byte) boundary MUST begin a new instruction. This has minimal impact on the effective code density (if your compiler is not stupid), while allowing almost random access decoding of instructions. This is good for the processor (short dependency chains with respect to instruction decoding/parsing, as opposed to infinite), and even more important for reverse engineering software, debuggers, etc. Even better: Declare that only 16-byte aligned addresses are valid jump targets.
Effect: You can decode a huge binary, and be sure that all valid jump targets have been found. It is easy to check whether some instruction is contained in the code or not: use grep. Your compiler should be able to ensure alignment without issuing too many NOPs.
Enforce "write XOR execute". This has a lot of advantages. Firstly, security-- but this should rather be the application programmer's job.
Secondly, simplified cache coherency: instructions and data use separate caches, and attempting to ensure that a write to an instruction already in the pipeline does correctly roll back... yeah. On ARM you need to explicitly flush the instruction cache for this to work.
Third, and the biggest advantage: Consider QEMU. Emulating ARM is a breeze: you KNOW where code is located, and every jump/instruction cache flush triggers software decoding, translation to e.g. x86 and optimization (LLVM), caching of the result, etc, and then continuation.
Do not try this with x86. You don't know where instructions start, you don't know whether the code is self-modifying, everything sucks.
Btw: self-modifying code, jit-compilers, etc should be minimally affected by such a restriction: Between each self-modification / jit-round, you pay a round-trip to main memory and possibly a syscall (if you don't have permissions to map a page as executable).
Indeed, I imagine that your ISA could first find adoption as an intermediate representation (like llvm IR)-- if you ensure that it is very fast to jit/compile from your language to fast x86 and arm, and pretty fast to verify properties of code.
Bonus request: Performance-hint/NOP: Allow the instruction to hint at likely-ness, often the compiler/programmer can predict the branch pretty well.
Bonus request: The stack area, or some part of it, is always in a incoherent cache-state: The current core/context is the only one that can write/read there. Or maybe provide a separate magic memory location for this, if the stack would clash too much with calling conventions.
Desired effect: The compiler/programmer is limited by the number of addressable registers. Often, you will need to temporarily store stuff you need later. In reality, more registers exist. Allowing the processor to avoid hitting the L1-cache and using a renamed/anonymous register should give huge performance boosts in future processors with more anonymous registers running legacy code.
At the same time, this makes dependency analysis (in software and hardware) easier.
Bonus request: Hardware support for constant-time operations. It turns out that this is quite important in a lot of crypto protocols (timing side-channels). I am not entirely sure how to best help here, but if it was possible to get an accurate cycle counter, and a NOP-until-cycle-xyz instruction, this would probably be supremely useful. Both instruction (count-cycle and wait-until-cycle) are allowed to take very long: these are rare instructions, and it is totally fine to take 10000 a cycle hit! On the other hand, these would also simplify a lot of debugging and performance testing.
Re floating point exceptions: I think control registers, trapping and fault-handling are the right way to go for "extremely unlikely" branches. This way, you basically have a conditional branch on every memory access or floating point operation (page fault?), which has a very high cost for mispredictions, and is very cheap on correct predictions (effectively zero overhead).