More

Topfi · 2026-05-08T11:25:14 1778239514

Title was slightly editorialized by adding the word "AI" to get the point across.

Topfi · 2026-05-05T13:35:42 1777988142

> Bing AI - Acquired by Microsoft.

> Microsoft's Bing search engine with AI-enhanced features The product has since been folded into Microsoft; visitors to the original URL are now redirected to copilot.microsoft.com.

What? Besides the fact that Bing was always a MSFT product, the LLM assisted search feature on Bing is still separate [0] from copilot.microsoft.com. At most it was a rename, though Copilot on the MSFT side is different from the one on Bing, is different from the one relying on your local TPU, is different from the one on Github... Great branding.

Even if the content was unreviewed LLM slop, I'd be hard pressed to find a model that outputs that Bing was bought by MSFT when at no point were the two separate.

Also, missing some of the greatest failures like Bard, Dia Browser, Sora, etc.

[0] https://www.bing.com/copilotsearch

Topfi · 2026-05-05T13:07:52 1777986472

Coming from someone who invested after it dropped by almost half (and never saw any value in Twitter to begin with once it stopped being an SMS platform first and foremost), the reason that happened was of his own making. And fortunately one cannot go around making binding agreements to take a company private, go around trashing the value with unfounded, incorrect, nonsensical statements and just back out of the deal. Consistency in enforcement makes Delaware pretty appealing for companies.

And yes, I was well compensated and must thank him personally. Funded an apartment renovation with the profits, got out a bit early at slightly below $ 51,- per share. I can however not say whether this settlement is justifiable, just because another party profited that generally does not justify other harm, but I am not knowledgeable in regards to the legal situation, so happy to hear what the legal grounding here was.

Topfi · 2026-05-04T14:30:13 1777905013

Welch Labs tends to do amazing explaining topics with visual representation, so I feel a purely written article would loose a lot of context and be far harder to parse. Additionally, at 30mins (minus sponsor segment), covering multiple fields and interspersing an interview with LeCun, it's honestly so information dense that pure text likely wouldn't be faster.

andsoitis · 2026-05-04T15:13:07 1777907587

I bet it can be done in an article that doesn’t take more than 10mins to read.

Topfi · 2026-05-03T11:55:21 1777809321

For what it's worth, here some of my experiences, as recently I have had some major deviations from what I've come to expect:

1.) Opus 4.7 via the API is great. Unlike 4.6, I have found the model to degrade far less beyond 120k, even 600k can be relied upon. Task Inference, Task Evaluation, Task Adherence, tool calling, all do very well on my evals. I did however for the first time in a while end my Claude Max subscription because, after their post-mortem [0] I for the first time saw true, reproducible, incredibly frustrating regressions in model output when using Claude Code.

Yes, this was after their post in the last week of April 26 and yes, I have been fortunate enough to never have been affected by regressions up to this point. The model via API with other harnesses provides consistent, useful and high quality output, but the recent changes have become an avalanche of "this requires more than two changes so we should table this for later" and "it seems the subagent finding was wrong and this is not actionable" with a healthy mix of suggestions that clearly are there to safe tokens, but go against clear instructions. I understand that they are compute constraint but as someone who until recently has never maxed out their weekly and nearly never their 5 hour limits on the Max 5x plan, these changes are not just frustrating (and make reasonable users think the model was nerfed rather than the harness) but also cost more as I now have to prompt four times and thousands of tokens more for a task that previously the same harness with the same model did far more efficiently. I regularly check the numbers and yes, by trying to be more efficient, they made what I am costing them far higher, going beyond what I pay for the subscription. Ironically, and I must emphasise this, I did not have regressions before, which suggests some major luck in A/B testing at least.

2.) GPT-5.5 is amazing, a true jump I have not seen since GPT-5 and far more than even GPT-5.4 is approaching the way Anthropic models have handled task inference, which also has lead to far reduced reasoning needs. I very much like it, with the exception of the reduced context window and degradation in compaction. GPT-5.4 did compaction so consistently well, that the 272k standard window before the price increase was of no concern and going beyond it was reliably possible. With GPT-5.5, the cost per token is doubled and compaction is far less reliable, leading to loss of task adherence and preventing task completion in certain cases. I am aware GPT-5.5 is a new pretrain (though how new given frontend is still abhorrently poor and has been since Horizon Alpha which I maintain was worse than GPT-4.1) and am hopeful they can integrate some of the solutions they were leveraging for GPT-5.4 compaction, but until then, it remains a model great for very challenging and complex blockers, but not a GPT-5.4 drop-in replacement.

3.) Kimi K2.6 is great for the API price, efficient, fast and does very well on all my metrics. I very much like it, far more so than Deepseek V4 Pro, any Qwen, Z.AI or Meta model and I truly am impressed. Composer 2 has shown how you can take the base even further given the right data and if I had to pay exclusively API pricing without any subscriptions, I think I'd have no problem leaning on K2.6 for most needs. It is what I'd love to see from Mistral or Apple and shows that one can't just succeed in a few narrow areas (Z.AI with tool calling, Deepseek with world knowledge, Mistral with being European, etc.) but provide a balanced product across all areas as an open weight company. I just wish they'd expose Agent Swarms via the API, there are a few experiments I'd like to try.

[0] https://www.anthropic.com/engineering/april-23-postmortem

Topfi · 2026-04-24T21:18:32 1777065512

Pricing by context length:

Input: $5/M tokens at <=272K, $10/M tokens above 272K.

Output: $30/M tokens at <=272K, $45/M tokens above 272K.

Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.

Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.

Topfi · 2026-04-21T10:03:06 1776765786

Endpoint Detection and Response?

Heck, not giving the person Admin privileges would have sufficed to prevent this. Or better hiring preventing people who install Roblox cheats on work devices...

There is no excuse and no fine line here. Even outside them boasting about SOC 2 Type II, this would be embarrassing for an SME not in the tech sector.

baxtr · 2026-04-21T10:14:04 1776766444

OP was talking about the security team. Not sure what you are proposing?

Do you want to let any applicant be screened by the security team?

Topfi · 2026-04-21T10:16:45 1776766605

Any security team that gives unrestricted admin privileges to random employees is not a security team. So doing the most basic parts of their job, that would be my proposal.

If specific to my hiring comment, was meant a bit facetious, though I will point out this line in their "compliance" report by "auditor" Delve:

> The organization carries out background and/or reference checks on all new employees and contractors prior to joining in accordance with relevant laws, regulations and ethics. Management utilizes a pre-hire checklist to ensure the hiring manager has assessed the qualification of candidates to confirm they can perform the necessary job requirements.

Maybe those pre-hire checklists should include a question like "Are you a massive idiot, who'd install a game on their work computer, then on top of that be the type of idiot who likes to cheat, then on top of that be the type of idiot to install cheats on your work computer?", maybe that'd prevent this in the future. Or again, just don't give everyone Admin privileges...

baxtr · 2026-04-22T07:26:01 1776842761

I think one of us misunderstood how the event happened.

In my understanding restricting local admin rights would not have change anything here.

The Vercel employee signed up for Context.ai (a third-party tool) using their work account and granted it "Allow All" access to their environment.

Maybe Admin-Managed Consent would have helped prevent context.ai access the environment but this is not configured locally on the employee's machine.

It is a cloud-level setting managed within their identity provider's administrative portal.

Topfi · 2026-04-21T13:07:14 1776776834

Just an addition to the prior comment: To be as generous as possible, I just pulled their audit report [0] and to answer your question, all I propose is that they stick to this (especially the part on minimum permissions, any extended permissions need to be reasonable and reasoned for, etc), which they did not. The fault lies threefold:

First of all with the team members as Context.ai, that either weren't experienced or did not care enough to know that the "all green" they got from Delve straight away couldn't have been accurate.

Secondly, with the people at Delve who, at least in this isolated case, seem to not have fulfilled their obligations and are suspected to have done so in a consistent, repeated and intentionally malicious manner.

Third, the people who, despite claiming to have done their due diligence, being experienced investors and professionals in the field whose own prior companies also had to undergo audits in the past, looked at Delve and were willing to overlook the misdeeds for financial gain.

[0] https://news.ycombinator.com/item?id=47848077

Topfi · 2026-04-21T09:51:55 1776765115

Odd, they used Delve [0] and a SOC2 compliant company like Context.ai [1] should have an AUP, EDR, etc. that prevents their employees from installing a Roblox cheat on their work computer. Heck, even outside SOC2, I have never worked at a company without endpoint restrictions to prevent unauthorised installs.

It's almost like the denials were in fact false and Delve truly was just selling a sticker, not providing an actual service.

If I were a VC that had funded Delve for a considerable amount of time, I'd be embarrassed that we did not catch that. I'd probably rework my processes, publicly analyse how this alleged fraud got past me and go far and beyond in disclosing my findings to rebuild trust. I'd most certainly not think just cutting funding is sufficient given the situation. Even more so if I'd encouraged other companies funded by me to use their "services". I'd maybe even reevaluate whether a circular approach wherein our funded companies are incentivised to rely on other also by us funded companies leads to the best options being chosen and whether that isn't antithetical to a forward thinking environment and competition. At the same time, I'd also think that maybe such a setup just hides unsuccessful companies and potentially even alleged fraud which once it gets to the broader market, may cause significant harm...

[0] https://web.archive.org/web/20250918025724/https://trust.del...

[1] https://web.archive.org/web/20260217220817/https://www.conte...

Topfi · 2026-04-20T18:22:37 1776709357

K2.6-code-preview was a minor, but noticeable jump, especially in a long running testing task and prior Moonshot releases have been the only models that I'd consider a suitably competitive replacement for Anthropic models. The way they approach tool calls, task inference and adherence is far closer than any other providers output, similar to how GLM models map far more closely to OpenAIs releases. Whether task adherence, task assessment, task evaluation or task inference, K2.5 got closer to Opus 4.5 than any other model (but was still behind overall).

I will have to test this full release of K2.6 but could see it serve as a very good overall drop-in replacement for Opus 4.5 and Opus 4.6 at 200k across the vast majority of tasks.

I will say however that Opus 4.7 Max 1M has been a very significant jump in performance for me, especially in tasks beyond 120k token where I'd argue it is now the most reliable model in continued task adherence and tool calling without compaction. Ironically, my initial experience was less than pleasant as on XHigh I found task adherence to have regressed even with less than 1/10th of the context window having been used.

Am very interested in K2.6s compaction strategy (which appears to be very simply all things considered) and how it performs beyond 100k tokens. As it stands, only OpenAI models have made compaction for long running tasks work well, though overall, GPT-5.4 is still inferior in my tests regardless of context window over other models such as Opus 4.6 1m and Opus 4.7 1m. Haven't gotten around to testing Opus 4.7 200k and will have to do this to properly assess K2.6 fairly, but I'd be very surprised if K2.6 truly beat Opus 4.7 200k given the jump I have experienced.

Topfi · 2026-04-20T09:47:37 1776678457

Am very much the same, took a bunch private two years ago for multitude of reasons. I can, however, see why no public repos could be a partial indicator and of concern, in conjunction with sudden star growth, simply because it is hard for a person with no prior project to suddenly and publicly strike gold. Even on Youtube it is a rare treat to stumble across a well made video by a small channel and without algos to surface repos on Github in the same way, any viral success from a previously inactive account should be treated with some suspicion. Same the other way, if you never made any PR, etc. sudden engagement is a bit odd.

nottorp · 2026-04-20T10:28:46 1776680926

I think they're using it as a signal for the accounts doing the starring, not the account being starred...