#EPIC and #VLIW aren't quite synonymous: the former implies utilizing a more compact language, more synchronization, and less no-op holes, and also conveys the meaning better (long instructions are not the goal, but the explicit parallelism).
Generally, the idea is that there is an *interpreter* in the modern processors, as sophisticated as something embedded can be, and interpreters aren't very efficient for running prepared programs (not from a REPL), especially there.
So it does make sense to replace it with a *compiler* doing the dispatching only once (and not every time the program runs). Sure this way it may be optimized significantly better, and save some energy later at running time (and space on the crystal, of course).
• unpredictable delays due to the asynchronous I/O
• higher memory size and bandwidth requirements
Nevertheless, it's like complaining that Emacs is slow and bloated… browsers and GUI are much more so now.
WRT "execution" of the empty instructions (no-op):
In a simple processor, executing the instructions one-by-one, strictly in the order they're read from memory, there are actually different circuits for different operations, and the operation code activates just one of them at a time.
If the processor is more sophisticated, contains a microcode interpreter, and the operation-specific circuits may be activated simultaneously, the operations may be executed in parallel.
@wictory The general theme of that thread was "#superscalar" microarchitectures: what that is and why that's good.
All superscalar processors use "#VLIW" internally: the #microcode parses the instructions of the ISA and translates them into the internal "instructions" (I'm not sure about the exact representation).
Technically one might change the microcode and make the processor execute another ISA — say, JVM…
Or one might remove the interpreter altogether.
@wictory Another dimension of a micro-architecture is pipelining.
IMO, it's even subtler technique than superscalarity. Although the idea is to increase the clock rate and the throughput at the expense of the response time, AFAIU.
So the point of pipelining and execution of the instructions in multiple stages is to simply increase the clock speed. Or we could just have a huge circuits, taking substantial time for the signals to propagate through them.
@wictory And for the logical level, the last thing I heard of was "hybrid" computing, where the CPUs, GPUs, and DSPs can be configured to solve the same task. That is, #AMD's #HSA, #OpenCL, #OpenMP (and #OpenACC), and also #CUDA.
Or distributed computing (cluster), made of myriads of low-power, and mostly idle by now, yet Internet-connected, embedded devices.
@wictory Or 'H' in HSA stands for "heterogeneous" — not sure. Either way, the common theme is offloading: (re)organizing the program in form of the "kernels", then at runtime the kernels are assigned to the available computing devices.
The applications may include not only CC, ML, VR; but also en-/decryption and (de)compression for secure communications, and I suspect compilation itself may be parallelized:
• character stream
• token stream
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!