More

rafaelmn · 2026-04-15T09:37:11 1776245831

From what I understand you can have secure chats e2ee ? I like that I can login from multiple devices and continue the conversation. This was always annoying with whatsapp and signal. Worst case is mildly embarrassing stuff leaks.

lxgr · 2026-04-15T10:51:02 1776250262

> From what I understand you can have secure chats e2ee ?

Not with bots, though.

> I like that I can login from multiple devices and continue the conversation

This is also not possible with Telegram E2E, while it is with Signal and WhatsApp.

rafaelmn · 2026-04-13T15:22:43 1776093763

This should be the real benchmark of AI coding skills - how fast do we get safe/modern infrastructure/tooling that everyone agrees we need but nobody can fund the development.

If Anthropic wants marketing for Mythos without publishing it - show us servo contrib log or something like that. It aligns nicely with their fundamental infrastructure safety goals.

I'd trust that way more than x% increase on y bench.

Hire a core contributor on Servo or Rust, give him unlimited model access and let's see how far we get with each release.

mort96 · 2026-04-13T15:26:48 1776094008

We do not need vibe-coded critical infrastructure.

falcor84 · 2026-04-13T15:31:10 1776094270

As I see it, the focus should not be about the coding, but about the testing, and particularly the security evaluation. Particularly for critical infrastructure, I would want us to have a testing approach that is so reliable that it wouldn't matter who/what wrote the code.

jbvlkt · 2026-04-13T21:09:16 1776114556

I have been thinking about that lately and isn't testing and security evaluation way harder problem than designing and carefully implementing new features? I think that vibecoding automates easiest step in SW development while making more challenging/expensive steps harder. How are we suppose to debug complex problems in critical infrastructure if no one understands code? It is possible that in future agents will be able to do that but it feels to me that we are not there yet.

bawolff · 2026-04-13T15:49:01 1776095341

I dont think that will ever be possible.

At some point security becomes - the program does the thing the human wanted it to do but didn't realize they didn't actually want.

No amount of testing can fix logic bugs due to bad specification.

skrtskrt · 2026-04-13T17:25:42 1776101142

AI as advanced fuzz-testing is ridiculously helpful though - hardly any bug you can in this sort of advanced system is a specification logic bug. It's low-level security-based stuff, finding ways to DDOS a local process, or work around OS-level security restrictions, etc.

bawolff · 2026-04-13T19:25:31 1776108331

I'm kind of doubtful that AI is all that great at fuzz testing. Putting that aside though, we are talking about web browsers here. Security issues from bad specification or misunderstanding the specification is relatively common.

thephyber · 2026-04-13T19:03:19 1776106999

Re-read the thread you are replying to.

Each of the last 4 comments in your thread (including yours) are conflating what they mean by AI.

skrtskrt · 2026-04-14T19:46:20 1776195980

You must be lost.

falcor84 · 2026-04-13T16:20:55 1776097255

Well, yes, agreed - that is the essential domain complexity.

But my argument is that we can work to minimize the time we spend on verifying the code-level accidental complexity.

bawolff · 2026-04-13T17:21:04 1776100864

Sure, but that is what we've been doing since the early 2000s (e.g. aslr, read only stacks, static analysis, etc).

And we've had some succeses, but i wouldn't expect any game changing breakthroughs any time soon.

mort96 · 2026-04-13T15:33:58 1776094438

I disagree. Thorough testing provides some level of confidence that the code is correct, but there's immense value in having infrastructure which some people understand because they wrote it. No amount of process around your vibe slop can provide that.

px43 · 2026-04-13T15:56:40 1776095800

That's just status quo, which isn't really holding up in the modern era IMO.

I'm sure we'll have vibed infrastructure and slow infrastructure, and one of them will burn down more frequently. Only time will tell who survives the onslaught and who gets dropped, but I personally won't be making any bets on slow infrastructure.

falcor84 · 2026-04-13T16:18:34 1776097114

I somewhat agree, but even then would argue that the proper level at which this understanding should reside is at the architecture and data flow invariants levels, rather than the code itself. And these can actually be enforced quite well as tests against human-authored diagrammatical specs.

t43562 · 2026-04-13T16:38:10 1776098290

If you don't fully understand the code how do you know it implements your architecture exactly and without doing it in a way that has implications you hadn't thought of?

As a trivial example I just found a piece of irrelevant crap in some code I generated a couple of weeks ago. It worked in the simple cases which is why I never spotted it but would have had some weird effects in more complicated ones. It was my prompting that didn't explain well enough perhaps but how was I to know I failed without reading the code?

jbvlkt · 2026-04-13T21:46:32 1776116792

Exactly. We do not have another artifact than code which can be deterministically converted to program. That is reason we have to still read the code. Prompt is not final product in development process.

mort96 · 2026-04-13T16:35:22 1776098122

I disagree. The code itself matters too.

irishcoffee · 2026-04-14T02:25:44 1776133544

Who is writing the tests? An LLM? If so, they have little value.

teaearlgraycold · 2026-04-13T17:54:38 1776102878

Well if the big players want to tell me their models are nearly AGI they need to put up or shut up. I don't want a stochastically downloaded C compiler. I want tech that improves something.

rl3 · 2026-04-13T22:31:42 1776119502

>> ...give him unlimited model access

>We do not need vibe-coded critical infrastructure.

I think when you have virtually unlimited compute, it affords the ability to really lock down test writing and code review to a degree that isn't possible with normal vibe code setups and budgets.

That said for truly critical things, I could see a final human review step for a given piece of generated code, followed by a hard lock. That workflow is going to be popular if it already isn't.

mort96 · 2026-04-13T22:45:58 1776120358

The availability or lack thereof of compute has absolutely nothing to do with my opinion. More vibe coded tests doesn't fix the problem.

rl3 · 2026-04-13T22:52:08 1776120728

It might when an individual function has 50 different models reviewing it, potentially multiple times each.

Perhaps part of a complex review chain for said function that's a few hundred LLM invocations total.

So long as there's a human reviewing it at the end and it gets locked, I'd argue it ultimately doesn't matter how the code was initially created.

There's a lot of reasons it would matter before it gets to that point, just more to do with system design concerns. Of course, you could also argue safety is an ongoing process that partially derives from system design and you wouldn't be wrong.

It occurred to me there's some recent prior art here:

https://news.ycombinator.com/item?id=47721953

It's probably fair to say the Linux kernel is critical infra, or at least a component piece in a lot of it.

mort96 · 2026-04-13T23:07:28 1776121648

I do not care how strong your vibes are and how many claudes you have producing slop and reviewing each others' slop. I do not think vibe coding is appropriate for critical infrastructure. I don't understand why you think telling me you'd have more slop would make me appreciate it more.

rl3 · 2026-04-13T23:22:26 1776122546

Fair enough. I respect the commitment to purity.

In the not so distant future you'll probably be one of the few who haven't had their actual coding skills atrophy, and that's a good thing.

mort96 · 2026-04-13T23:27:56 1776122876

A terrifying thought but not implausible. IMO, the world needs more people with a deep understanding of how stuff works, but that's not the direction we're moving in.

rafaelmn · 2026-04-13T15:46:43 1776095203

If you're trusting core contributors without AI I don't see why you wouldn't trust them with it.

Hiring a few core devs to work on it should be a rounding error to Anthropic and a huge flex if they are actually able to deliver.

mort96 · 2026-04-13T16:36:27 1776098187

I trust people to understand the code they write. I don't trust them to understand code they didn't write.

weregiraffe · 2026-04-14T05:32:20 1776144740

So you don't trust projects with more than one author? By definition, they'd have to understand each other's code.

mort96 · 2026-04-14T06:49:12 1776149352

Different people can understand different parts of the code.

t43562 · 2026-04-13T16:42:46 1776098566

It's extremely tempting to write stuff and not bother to understand it similar to the way most of us don't decompile our binaries and look at the assembler when we write C/C++.

So, should I trust an LLM as much as a C compiler?

jddj · 2026-04-13T18:41:37 1776105697

What if it impairs judgement?

scrame · 2026-04-13T15:58:24 1776095904

Unfortunately we're going to get it whether or not we need it.

andai · 2026-04-13T18:04:35 1776103475

They're getting really good at proofs and theorems, right?

IshKebab · 2026-04-13T20:07:22 1776110842

Proofs/theorems and memory safety vulnerabilities are a special case because there's an easy way to verify whether the model is bullshitting or not.

That's not true for coding in general. The best you can do is having unreasonably good test coverage, but the vast majority of code doesn't have that.

nicoburns · 2026-04-13T16:00:53 1776096053

> show us servo contrib log or something like that

Servo may not be the best project for this experiment, as it has a strict no-AI contributions allowed policy.

beepbooptheory · 2026-04-13T18:24:45 1776104685

Oh good, I was worried for a sec that people wouldn't be talking about AI in this thread.

Night_Thastus · 2026-04-13T18:54:07 1776106447

The problem with such infrastructure is not the initial development overhead.

It's the maintenance. The long term, slow burn, uninteresting work that must be done continually. Someone needs to be behind it for the long haul or it will never get adopted and used widely.

Right now, at least, LLMs are not great at that. They're great for quickly creating smaller projects. They get less good the older and larger those projects get.

rafaelmn · 2026-04-13T19:08:48 1776107328

I mean the claim is that next generation models are better and better at executing on larger context. I find that GPT 5.4 xhigh is surprisingly good at analysis even on larger codebases.

https://x.com/mitchellh/status/2029348087538565612

Stuff like this where these models are root causing nontrivial large scale bugs is already there in SOTA.

I would not be surprised if next generation models can both resolve those more reliability and implement them better. At that point would be sufficiently good maintainers.

They are suggesting that new models can chain multiple newly discovered vulnerabilities into RCE and privilege escalations etc. You can't do this without larger scope planning/understanding, not reliabily.

andai · 2026-04-13T18:02:13 1776103333

Replicating Chromium as a benchmark? ;)

Replicating Rust would also be a good one. There are many Rust-adjacent languages that ought to exist and would greatly benefit mankind if they were created.

dabinat · 2026-04-13T17:56:27 1776102987

The true solution to this is to fund things that are important, especially when billion-dollar companies are making a fortune from them.

manx · 2026-04-13T15:32:48 1776094368

Agreed. Which other software does society need badly?

raincole · 2026-04-13T19:26:21 1776108381

Perhaps, you know, not every thing, especially not every thread on HN, has to be about AI?

I read the link twice and no AI or LLM mentioned. I don't know why people are so eager to chime in and try to steer the conversation towards AI.

rafaelmn · 2026-04-14T11:48:35 1776167315

Someone in the thread said they vibe coded something trivial so I just made the connection. I'd like to see Servo get to full browser status. I don't think they have the resources to do it. Anthropic is virtue signaling about their commitment to security in foundational software. Seems like a perfect match - even if Servo won't take it upstream - other companies spent hundreds of millions on Firefox/Chromium skins - Anthropic could ship their OSS browser based on Servo and showcase how effective their models are at coding. Hiring a few core contributors and giving them model access should be cheap in comparison to ARC acquisition and such. Will echo way louder than toy C compilers and benchmaxxing.

rafaelmn · 2026-04-12T15:49:57 1776008997

I'm in the same camp but I mostly do backed. My coworker doing frontend is chewing through rate limits consistently. React code is quite logic shallow, stuff gets pulled in all over so not localized, especially when you start using js styling frameworks - hundreds of k of tokens to do simple changes.

If you start to parallelize and you have permission prompts on you're likely missing cache windows as well.

rafaelmn · 2026-04-11T21:28:00 1775942880

I've heard plenty of anecdotes of people well off financially getting psychologically distressed after a layoff so I don't think it's purely financial.

cortesoft · 2026-04-11T22:01:29 1775944889

Sure, I am certain there are some people who feel that way.

The person I was directly responding to was talking about people who faced both money worries and identity struggles. I think a good portion of those people are likely mostly being affected by the financial worries, and won't feel better until that is resolved.

rafaelmn · 2026-04-11T08:29:00 1775896140

That's based on a silly belief (that's becoming more obvious with AI, but is silly in general) : just because you can read about something it means you learned it.

Even if I gave you exact instructions on how to use even basic stuff like power tools - if you had no experience using stuff like grinders/saws/routers and I gave you full detailed instructions on how to do something non-trivial - you're more likely to cut off body parts than achieve what you intended. There's so much fundamental stuff that you must internalize subconsciously/through trial and error - before you can have enough mental capacity to think about the higher level objectives.

Actually AI demonstrates this perfectly - once they get RL harness for programming they start to get better at it. Without experimentation they can ingest all source code/tutorials/books in the world and still produce shit.

ben_w · 2026-04-12T22:47:30 1776034050

Aye, that's the kind of thing I had in mind for the difference between information and competency.

(It's also why I wrote that I know what a calutron is and not claimed that I could build one today; the "and a few years" was not decorative).

rafaelmn · 2026-04-09T12:34:34 1775738074

Does this work with CSS in JS stuff and CSS frameworks - like if I was using Chakra would this be able to edit the site elements and have the agent reverse map to where the style attributes need to go ?

SirHound · 2026-04-09T12:39:40 1775738380

Yes - if your agent can edit your codebase it will put the changes in the right place.

rafaelmn · 2026-04-09T11:01:16 1775732476

I'm still paying the 10$ GH copilot but I don't use it because :

  - context is aggressively trimmed compared to CC obviously for cost saving reasons, so the performance is worse
  - the request pricing model forces me to adjust how I work

Just these alone are not worth saving the 60$/month for me.

I like the VSCode integration and the MCP/LSP usage surprised me sometimes over the dumb grep from CC. Ironically VSCode is becoming my terminal emulator of choice for all the CLI agents - SSH/container access and the automatic port mapping, etc. - it's more convenient than tmux sessions for me. So Copilot would be ideal for me but yeah it's just tweaked for being budget/broad scope tool rather than a tool for professionals that would pay to get work done.

lbreakjai · 2026-04-09T11:26:21 1775733981

You can use your GH subscription with a different harness. I'm using opencode with it, it turns GH into a pure token provider. The orchestration (compacting, etc.) is left to the harness.

It turns it into a very good value for money, as far as I'm concerned.

rafaelmn · 2026-04-09T11:41:53 1775734913

But you still get charged per turn right ? I don't like that because it impacts my workflow. When I was last using it I would easily burn through the 10$ plan in two days just by iterating on plans interactively.

lbreakjai · 2026-04-09T12:08:14 1775736494

Honestly I'm not sure, I'm on my company's plan, I get a progress bar vaguely filling, but no idea of the costs or billing under the hood.

sourcecodeplz · 2026-04-09T15:36:42 1775749002

But you still get the reduced context-window.

briHass · 2026-04-09T11:17:40 1775733460

Disagree entirely.

GHCP at least is transparent about the pricing: hit enter on a prompt= one request. CC/Codex use some opaque quota scheme, where you never really know if a request will be 1,2,10% of your hourly max, let alone weekly max.

I've never seen much difference with context ostensibly being shorter in GHCP, all of the models (in any provider) lose the thread well before their window is full, and it seems that aggressive autocompaction is a pretty standard way to help with that, and CC/Codex do it frequently.

rafaelmn · 2026-04-09T11:38:56 1775734736

>I've never seen much difference with context ostensibly being shorter in GHCP, all of the models (in any provider) lose the thread well before their window is full, and it seems that aggressive autocompaction is a pretty standard way to help with that, and CC/Codex do it frequently.

Then we've had wildly different results. Running CC and GH copilot with Opus 4.6 on same task and the results out of CC were just better, likewise for Codex and GPT 5.4. I have to assume it's the aggressive context compaction/limited context loading because tracking what copilot does it seems to read way less context and then misses out on stuff other agents pick up automatically.

rafaelmn · 2026-04-08T17:16:22 1775668582

> OneOrMoreOrNone

So IEnumerable<T> ? What's up with wrapping everything into fancy types just to arrive at the exact same place.

rafaelmn · 2026-04-08T15:08:39 1775660919

And underenginered at the same time !

rafaelmn · 2026-04-07T18:56:26 1775588186

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

sho_hn · 2026-04-07T19:22:56 1775589776

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

boring-human · 2026-04-07T21:09:54 1775596194

Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.

Jcampuzano2 · 2026-04-07T19:03:12 1775588592

Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

camdenreslink · 2026-04-07T22:05:55 1775599555

ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

leobuskin · 2026-04-07T19:02:27 1775588547

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

blazespin · 2026-04-07T20:48:53 1775594933

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.

porker · 2026-04-07T21:02:50 1775595770

What is RE in this context?

astrange · 2026-04-07T21:16:48 1775596608

Reverse engineering

aizk · 2026-04-08T01:57:48 1775613468

I've been messing with using Claude, Codex, and Kimi even for reverse engineering at https://decomp.dev/ it's a ton of fun. Great because matching bytes is a scoring function that's easy for the models to understand and make progress on.

adamgoodapp · 2026-04-08T07:34:04 1775633644

I want to get into RE with AI. Which model you liking the most?

rustystump · 2026-04-08T15:20:53 1775661653

This. People drastically underestimate how much more useful a lightning fast slightly dumb model is compared to a super smart but mega slow model is. Sure, u may need to bust out the beef now and then. However, the overwhelming majority of work the fast stupid model is a better fit.

19h · 2026-04-07T23:53:49 1775606029

Mind sharing the use cases you're using IDA via MCP for?

zarzavat · 2026-04-07T19:05:03 1775588703

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

chaos_emergent · 2026-04-07T19:32:49 1775590369

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

zarzavat · 2026-04-08T11:50:07 1775649007

Alignment is a subspace of capability. Feeling good is nice, but it's also a manifestation of the level that the model can predict what I do and don't want it to do. The more accurately it can predict my intentions without me having to spell them out explicitly in the prompt, the more helpful it is.

GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.

Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.

For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.

lilytweed · 2026-04-07T19:17:34 1775589454

Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

kranke155 · 2026-04-07T21:59:00 1775599140

GPT was clearly changed after its sycophantic models lead to the lawsuits.

josephg · 2026-04-07T23:10:10 1775603410

It still has a very ... plastic feeling. The way it writes feels cheap somehow. I don't know why, but Claude seems much more natural to me. I enjoy reading its writing a lot more.

That said, I'll often throw a prompt into both claude and chatgpt and read both answers. GPT is frequently smarter.

kranke155 · 2026-04-08T10:14:34 1775643274

GPT is more accurate. But Claude has this way of association between things that seems smarter and more human to me.

whalesalad · 2026-04-07T18:59:57 1775588397

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

ctoth · 2026-04-07T19:35:51 1775590551

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!