Did you know that hand-optimized assembler is slower then allegedly “slow” languages, say, Lisp or JavaScript? It’s true, if you wait long enough.
Imagine you have found some old hand-tuned Lisp code, and a hand-tuned PDP-1 assembly language version. If you executed both programs today, using a modern Lisp interpreter and a modern PDP-1 emulator, the Lisp code would perform much better. That’s because it could take advantage of all the improvements in Lisp interpreters and computers since the 1960’s, while the PDP-1 code would be held back by trade-offs made long ago to appease obsolete hardware.
I expect that ultimately JavaScript will outperform contemporary SSE code, because hardware inevitably becomes obsolete over time.
When assembly code becomes obsolete, you are stuck with two slow options. Use obsolete hardware to execute the code, or emulate your obsolete hardware on a new computer (since computers keep getting faster, emulation will become the fastest choice). Meanwhile the high-level code will still be executable, and able to take advantage of all subsequent hardware improvements.
Some x86 instructions are already obsolete; by which I mean they give worse performance then more common instructions.
Assembly/Compiler Coding Rule 31. (ML impact, M generality) Avoid using complex instructions (for example, enter, leave, or loop) that have more than four µops and require multiple cycles to decode. Use sequences of simple instructions instead.
— Intel®64 and IA-32 Architectures Optimization Reference Manual
You could say that this is all academic, because in half a century the code you wrote today will be irrelevant, and I would agree. PDP-1 vs Lisp was hypothetical hyperbole, show that the benefits of low-level optimization decrease the longer the code is in use, because the advantages compared to high-level code decrease over time.
What really drove this home for me was Apple’s switch from PPC to x86. Suddenly all the AltiVec code I’d ever written was obsolete. If someone bought a new Mac, my AltiVec code would be emulated, and slower then unchanged (but recompiled) C code! I could have reused existing AltiVec code to create SSE code, but then I would be stuck doing optimization work every time a sufficiently new computer came out.
Algorithmic-level optimization holds it’s value over time; low-level optimization does not. However, software is such a rapidly-changing environment that a short-term investment can still make a hell of a lot of sense.