parallel computing 

and aren't quite synonymous: the former implies utilizing a more compact language, more synchronization, and less no-op holes, and also conveys the meaning better (long instructions are not the goal, but the explicit parallelism).

Generally, the idea is that there is an *interpreter* in the modern processors, as sophisticated as something embedded can be, and interpreters aren't very efficient for running prepared programs (not from a REPL), especially there.

parallel computing 

So it does make sense to replace it with a *compiler* doing the dispatching only once (and not every time the program runs). Sure this way it may be optimized significantly better, and save some energy later at running time (and space on the crystal, of course).

Obstacles:
• unpredictable delays due to the asynchronous I/O
• higher memory size and bandwidth requirements

Nevertheless, it's like complaining that Emacs is slow and bloated… browsers and GUI are much more so now.

Show thread

parallel computing 

WRT "execution" of the empty instructions (no-op):

In a simple processor, executing the instructions one-by-one, strictly in the order they're read from memory, there are actually different circuits for different operations, and the operation code activates just one of them at a time.

If the processor is more sophisticated, contains a microcode interpreter, and the operation-specific circuits may be activated simultaneously, the operations may be executed in parallel.

Show thread

parallel computing 

@amiloradovsky
I really appreciate your responses. My interest in this discussion is apart from just learning new stuff is that I want to understand what other people "in the community" thinks about these things.

I assumed that a simple processor is basically an ideal MIPS architecture. In those the "whole" circuit is always activated. The reason for the quotation marks is the inversed Moores law years.

In modern simple processors, I.e. cortex-AX, where X is a subjectivity small number, they push the limit of that not OOO is.

In an OOO processor only promises to retire instructions in order. In a risc processor, the promise is the same. Also in a VLIW processor the same but the instructions may be independent, but must be retired at the same time. In an OOO processor the internal architecture can be like an VLIW, but allows asynchronous retirement.

parallel computing 

@wictory The general theme of that thread was "" microarchitectures: what that is and why that's good.
All superscalar processors use "" internally: the parses the instructions of the ISA and translates them into the internal "instructions" (I'm not sure about the exact representation).
Technically one might change the microcode and make the processor execute another ISA — say, JVM…
Or one might remove the interpreter altogether.

parallel computing 

@amiloradovsky ok, my argument then I guess is that the VLIW specifically is a a risc architecture that exposes multiple functional unit through the ISA. It is important that the VLIW instructions, in their entirety, are retired atomically. A VLIW processor may implement ILP, although it's kind of stupid if it doesn't.

A superscalar processor necessarily implements ILP, but doesn't necessarily retire the issued instructions synchronously. There are examples of superscalar processors that have an internal VLIW architecture like what transmeta were doing and I think nvidia were working on an ARM ISA processor with internal translation to an VLIW.

But these are in general an exception. The x86 architecture is often quoted to be able to keep ~200 instructions in flight concurrently, retiring them in any order. This capability is different from the usual VLIW architectures since, for example, by tweaking the architecture, the instructions ca come from different 'logical cores'. A problem with ILP in general is that there isn't much parallelism in most programs, which is why there are few microarchitectures with more than 2^2 functional units.

parallel computing 

@wictory Another dimension of a micro-architecture is pipelining.
IMO, it's even subtler technique than superscalarity. Although the idea is to increase the clock rate and the throughput at the expense of the response time, AFAIU.
So the point of pipelining and execution of the instructions in multiple stages is to simply increase the clock speed. Or we could just have a huge circuits, taking substantial time for the signals to propagate through them.

parallel computing 

@amiloradovsky I'm not sure what the state of the art is, but there are benchmarks that have been used as a basis for microarchitecture design. Some of these benchmarks include code from gcc, perl, awk, bzip and so on. There's a lot of criticisms of these benchmarks, but if one honestly would want to improve them, at least in some fundamental way, I wouldn't know where to start.
Follow

parallel computing 

@wictory And for the logical level, the last thing I heard of was "hybrid" computing, where the CPUs, GPUs, and DSPs can be configured to solve the same task. That is, 's , , (and ), and also .
Or distributed computing (cluster), made of myriads of low-power, and mostly idle by now, yet Internet-connected, embedded devices.

parallel computing 

@wictory Or 'H' in HSA stands for "heterogeneous" — not sure. Either way, the common theme is offloading: (re)organizing the program in form of the "kernels", then at runtime the kernels are assigned to the available computing devices.

The applications may include not only CC, ML, VR; but also en-/decryption and (de)compression for secure communications, and I suspect compilation itself may be parallelized:
• character stream
• token stream
• expressions

Sign in to participate in the conversation
Functional Café

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!