More

taktoa · 2026-03-22T21:20:54 1774214454

Regarding synthesis, I think approaches like this often seem promising to software engineers but ignore the realities of physical design. Hierarchical physical design tends to be worse than flat PD because there are many variables to optimize (placement density, congestion, IR drop, thermal, parasitics, signal integrity, di/dt, ...) and even if you have some solution in mind that optimizes area for a highly regular block, that layout could be worse than a solution that intersperses lower-power cells throughout that regular logic to reduce hotspots. And since placement is not going to be regular in any real design, delay won't be either, and there is a technique called resynthesis that restructures the logic network based on exactly which paths are critical which will essentially destroy whatever logic regularity existed.

The other thing is that high level optimizations tend to be hard to come by in hardware. Most datapath hardware is not highly fixed-function, instead it consists of somewhat general blocks that contain a few domain specific fused ops. So we either have hardware specifications that are natural language or RTL specifications that are too low level to do meaningful design exploration. Newer RTL languages and high level synthesis tools _also_ tend to be too low level for this kind of thing, it's a pretty challenging problem to design a formal specification language that is simultaneously high level enough and yet allows a compiler to do a good job of finding the optimal chip design. Approximate numerics are the most concrete example of this: there just aren't really any good algorithms for solving the problem of "what is the most efficient way to approximate this algorithm with N% precision", and that's not even including the flexibility vs efficiency tradeoff which requires something like human judgement, or the fact that in many domains it's hard to formulate an error metric that isn't either too conservative or too permissive.

tasty_freeze · 2026-03-22T22:58:03 1774220283

I read the summary but not the paper and it seems like it has nothing to do with physical design. This is a means of making the elaborate/compile/simulation performance of the language faster.

Say someone wrote this code:

    wire [31:0] a, b, c;
    for(i=0; i<32; i=i+1) begin
       assign c[i] = a[i] & b[i];
    end

it sounds like this paper is about recognizing it could be implemented something akin to this:

    wire [31:0] a, b, c;
    assign c = a & b;

Both will produce the exact same gates, but the latter form will compile and simulate faster.

taktoa · 2026-03-22T23:11:34 1774221094

In section 4.4 it discusses the effect of the technique on Cadence Genus, which is a PD/synthesis tool. My point is that you have to flatten the graph at some point, and most of the benefit of flattening it later (keeping/making things vectorized) is to do higher level transformations, which are mostly not effective.

kyboren · 2026-03-23T02:20:41 1774232441

Well, to be fair, the authors propose this thesis: "Although the vectorization of Verilog designs does not change the hardware they describe, it reduces their symbolic complexity, enabling faster and more scalable analysis and verification."

Maybe it doesn't help Design Compiler turn your shitty design into gold, but faster verification is an unalloyed good.

kyboren · 2026-03-23T02:04:14 1774231454

> Hierarchical physical design tends to be worse than flat PD because there are many variables to optimize (placement density, congestion, IR drop, thermal, parasitics, signal integrity, di/dt, ...) and even if you have some solution in mind that optimizes area for a highly regular block, that layout could be worse than a solution that intersperses lower-power cells throughout that regular logic to reduce hotspots.

This paragraph goes hard. And this is exactly why design space exploration is essential. I think you're right that basically, simplistic delay/area models are insufficient and the exploration must be driven by actual metrics of complete P&R flows.

> [...] it's a pretty challenging problem to design a formal specification language that is simultaneously high level enough and yet allows a compiler to do a good job of finding the optimal chip design

My experience in this domain is that actually the challenging problem isn't so much the design of a formal language. Instead the challenge lies primarily in expressing your design in such a way that both generalizes over and meaningfully exposes the freedom in the design space.

> [...] solving the problem of "what is the most efficient way to approximate this algorithm with N% precision"

> [...] in many domains it's hard to formulate an error metric that isn't either too conservative or too permissive.

I think the latter comment alludes to my objection about the former: It all depends on what "N% precision" means.

Does it mean that for every input/output pair, the output is always within N% of the correct value?

Or does it mean that for N% of the inputs, the output is correct? Is that weighted by the likelihood distribution of getting those inputs?

Or does it mean that the total mean squared error over all input/output pairs is within N%? Etc. etc.

In other words, I think it goes beyond even conservative vs. permissive; simply, the devil is in the details, and digital circuit design is a difficult multi-objective optimization problem.

taktoa · 2025-10-09T03:30:21 1759980621

I think what you're describing is SPMD, which is a compilation strategy, not a hardware architecture. I am not sure but I think SIMT is SIMD but with multiple program counters (1 per N lanes) to enable some limited control flow divergence between lane groups.

AlotOfReading · 2025-10-09T03:50:38 1759981838

The PC is shared in traditional SIMT, but diverging branches are masked out until they execute. Nvidia introduced per-thread PCs with Volta. I think AMD still uses a shared PC across each wavefront?

taktoa · 2025-09-19T07:33:27 1758267207

Rust and C++ implement generics with monomorphization rather than boxing, so there is a potential performance hit associated with a type like this in Java that is guaranteed not to exist in Rust.

In practice, the JVM may still monomorphize it, but it is not guaranteed to, and this would be a good reason to avoid unnecessary uses of generics in a high performance codebase like a kernel, if you chose to write one in Java.

lock1 · 2025-09-19T09:04:01 1758272641

Sure, I guess that's worth mentioning in the context of the original post. It's true that Java implementation of parametric polymorphism has a performance drawback compared to Rust or C++. And it's certainly a bad idea to use Java generics without considering the drawback in hot code paths.

But GP described something they wanted from a type system and basically said container with `Functor`-like behavior is not possible to do in Java. It's possible, albeit with a performance drawback and a bit more clunky to work with compared to Rust, Haskell, or a language with native HKT support.

wavemode · 2025-09-19T12:35:00 1758285300

the parent commenter states that Java is their everyday language, so in this context I don't think we're talking about performance, nor about the needs of the kernel.

taktoa · 2025-07-08T00:13:54 1751933634

Not just when the codomain is a field, but more generally when the codomain is itself a vector space. The former is a special case of the latter where you construct a 1D vector space from a field.

taktoa · on Aug 31, 2024

Definitely not an ARM situation.

taktoa · on March 28, 2024

> In a clocked design the clock signal needs to be routed to every element on the chip which requires a lot of power, the more so the higher the frequency is.

Clock only needs to be distributed to sequential components like flip flops or SRAMs. The number of clock distribution wire-millimeters in typical chip is dwarfed by the number of data wire-millimeters, and if a neural network is well trained and quantized activations should be random, so number of transitions per clock should be 0.5 (as opposed to 1 for clock wires), meaning that power can't be dominated by clock. The flops that prevent clock skew are a small % of area, so I don't think those can tip the scales either. On the other hand, in asynchronous digital logic you need to have valid bit calculation on every single piece of logic, which seems like a pretty huge overhead to me.

HarHarVeryFunny · on March 28, 2024

There's obvious potential savings in not wasting FLOPs recalculating things unnecessarily, but I'm not sure how much of that could be realized by just building a data-flow digital GPU. The only attempt at a data-flow digital processor I'm aware of was AMULET (by ARM designer Steve Furber), which was not very successful.

There's more promise in analog chip designs, such as here:

https://spectrum.ieee.org/low-power-ai-spiking-neural-net

Or otherwise smarter architectures (software only or S/W+H/W) that design out the unnecessary calculations.

It's interesting to note how extraordinarily wasteful transformer-based LLMs are too. The transformer was designed part inspired by linguistics and part based on the parallel hardware (GPU's etc) available to run it on. Language mostly has only local sentence structure dependencies, yet transformer's self-attention mechanism has every word in a sentence paying attention to every other word (to some learned degree)! Turns out it's better to be dumb and fast than smart, although I expect future architectures will be much more efficient.

taktoa · on Sept 29, 2023

The open source release of XLA predates Lattner's tenure at Google by 7 months, and it definitely existed before that -- the codebase was already 66k SLOC at that point. During his tenure it went from 100k SLOC to 250k SLOC. It's now 700k SLOC. He also has, as far as I can tell, zero commits in the XLA codebase. "Of LLVM fame" would be more accurate I think.

swyx · on Sept 29, 2023

my bad - i guess i was just saying he led that team but didnt mean to imply he originated it

you seem to have very precise knowledge of the SLOC at a point in time - just curious is there any tooling you used to do that? that can be pretty nifty to pull out on occasion

taktoa · on Sept 29, 2023

I git cloned the repo and then ran sloccount after checking out various commits (just did `git log | grep -C3 'Jan 1 [0-9:]* 2017'` or similar to find the relevant commits)

swyx · on Sept 29, 2023

ha, simple enough. thx

taktoa · on Jan 20, 2023

I'm pretty sure those numbers are for training, not inference. I've run it on _CPU_ and gotten ~1 token per second.

taktoa · on July 23, 2022

Spanish flu is an orthomyxovirus, not a coronavirus. There are some cold-causing coronaviruses though.

nicoburns · on July 23, 2022

My understanding is that we don’t know for sure wha kind of virus caused Spanish flu (it occurring before our ability to analyse this kind of thing). What are you badi by your assertion that it was an orthomyxovirus on?

doktorhladnjak · on July 23, 2022

Scientists have been able to retrieve genetic material from the remains of those who are confirmed to have died during that pandemic. They've effectively sequenced the genome even to know how it relates to other flu viruses https://www.news.vcu.edu/article/Genetic_sequencing_of_deadl...

You might be thinking of the 1889-90 "flu" pandemic which has been theorized to be from a coronavirus known now as OC43, but it's not certain.

taktoa · on July 18, 2022

The heat emitted by burning fossil fuels is completely irrelevant compared to the greenhouse impact. We burn about 11.7 gigatons of oil equivalent a year, which over 200 years would be 2.7 * 10^16 kWh. The greenhouse effect increase over the last 20 years leads to over 2.2 * 10^15 kWh/yr added to the planet, over 20x as much.