His affiliates, armed with AI, built fake doctor profiles in Meta ads and made unscrupulous claims about weight loss using fake testimonials.
大多数人认为AI主要提高生产力和创造力,但作者展示了AI如何被用于大规模欺骗和剥削,创建虚假医生档案和虚假宣传。这一反直觉观点揭示了AI技术黑暗面,挑战了人们对AI价值的乐观假设,提醒我们技术中立性背后的伦理问题。
His affiliates, armed with AI, built fake doctor profiles in Meta ads and made unscrupulous claims about weight loss using fake testimonials.
大多数人认为AI主要提高生产力和创造力,但作者展示了AI如何被用于大规模欺骗和剥削,创建虚假医生档案和虚假宣传。这一反直觉观点揭示了AI技术黑暗面,挑战了人们对AI价值的乐观假设,提醒我们技术中立性背后的伦理问题。
The cost of understanding what happens in a video has dropped by a factor of roughly 40, while the quality of that understanding has improved dramatically.
大多数人认为AI视频分析仍处于早期阶段且成本高昂,但作者指出AI视频分析成本已大幅下降40倍,质量反而提升。这一反直觉观点暗示视频分析可能已经跨越了实用性的门槛,将催生全新的应用类别,挑战了人们对AI视频处理能力的传统认知。
Historically, AI evaluation has leaned toward the forest approach. Most researchers settle for 1 to 5 raters per item, assuming this is enough to find a single 'correct' truth.
大多数人认为AI评估领域的现状是合理的,因为1-5名评估者足以找到单一'正确'真相,但作者指出这种假设忽视了人类评估中的自然分歧。这一批判挑战了AI评估领域的现状,暗示当前许多研究结论可能基于不充分的数据收集方法,需要重新审视评估方法的可靠性。
Reconstructing raw inputs forces models to model irrelevant low-level detail. Predicting in a learned embedding space allows the model to focus on semantically meaningful, causally relevant features.
大多数人认为AI模型需要重建完整的输入数据才能理解世界,但作者认为这种方法迫使模型关注无关的低级细节。相反,在嵌入空间中进行预测可以让模型专注于语义上有意义、因果相关的特征,这是一个反直觉的见解。
Whether or not this specific bet pays off, the underlying argument that the next meaningful leap in AI capability requires moving beyond language modeling is increasingly hard to dismiss.
尽管当前AI领域由语言模型主导,但作者认为语言模型范式已经达到其极限,真正的AI进步需要超越这一范式。这与行业主流观点相悖,暗示我们可能正处于AI范式的转折点。
AMI Labs is not building a product for immediate deployment. This is a fundamental research effort, likely measured in years before commercial applications emerge.
在当今AI创业公司追求快速变现的环境中,作者认为AMI Labs正在进行的是基础研究,而非产品开发。这与大多数AI初创公司的商业模式背道而驰,暗示真正的AI突破需要长期投入而非短期商业考量。
LLMs have no grounded understanding of the physical world. They model the statistical distribution of language about reality, not reality itself.
大多数人认为大型语言模型通过学习物理世界的知识来理解现实,但作者认为它们实际上只是在学习关于现实的文本描述的统计分布,而非理解现实本身。这是一个反直觉的观点,因为它挑战了我们对AI理解能力的普遍认知。
Whether or not this specific bet pays off, the underlying argument that the next meaningful leap in AI capability requires moving beyond language modeling is increasingly hard to dismiss.
大多数人认为AI的未来发展将继续沿着语言模型的方向前进,但作者认为真正的突破需要超越语言建模范式。这一观点挑战了当前AI发展的主流叙事,暗示我们需要从根本上重新思考AI的发展方向。
The clustering of capital and talent around this problem is itself a signal. The applications that most clearly benefit from world models are those where LLMs have struggled most.
大多数人认为资金和人才应该集中在当前AI表现最好的领域,但作者认为世界模型的发展恰恰是因为LLMs在关键领域表现不佳。这一观点挑战了资源分配的主流思路,暗示真正的突破可能来自于解决现有系统的弱点。
AMI Labs is not building a product for immediate deployment. This is a fundamental research effort, likely measured in years before commercial applications emerge.
在当今追求快速商业化的AI环境中,大多数人认为AI研究应该迅速转化为产品。但作者指出AMI Labs正在进行基础研究,而非直接开发产品,这一观点挑战了科技行业对即时商业化的普遍期待,强调了基础研究的重要性。
LLMs have no grounded understanding of the physical world. They model the statistical distribution of language about reality, not reality itself.
大多数人认为大型语言模型通过学习物理世界的知识来理解现实,但作者认为LLMs实际上只是学习了关于现实的文本统计分布,而非对现实本身的直接理解。这一观点挑战了人们对LLM能力本质的认知,暗示当前AI系统存在根本性的理解缺陷。
You have to have people that have the ability to rethink the workflow at a scale that AI can execute, versus at a scale that humans can execute.
大多数人认为AI应该适应现有工作流程,但作者提出相反观点:人类需要重新设计工作流程以适应AI的能力范围。这一反直觉观点强调,AI的成功实施不仅需要技术,更需要组织思维方式的根本转变,从人类执行规模转向AI执行规模。
95% of organizations are getting zero return on AI deployed, with most failures found due to 'brittle workflows.'
尽管AI投资激增,但绝大多数企业未能获得任何回报,这与主流认知中AI能显著提升效率的观点相悖。这一发现表明,AI实施失败的主要原因不是技术本身,而是工作流程设计不当,暗示企业需要重新思考如何将AI整合到现有工作流程中,而非简单叠加技术。
in 2024, 47% of AI solutions were built internally and 53% were purchased; today, 76% of all AI is purchased rather than developed in-house.
大多数人认为企业会越来越倾向于自主开发AI模型以保持竞争优势和控制权,但数据显示相反趋势——企业正加速转向购买第三方AI解决方案。这种转变表明企业可能更看重快速部署而非技术专长,但也可能导致组织失去对AI核心能力的理解和优化能力。
You have to have people that have the ability to rethink the workflow at a scale that AI can execute, versus at a scale that humans can execute.
大多数人认为AI只需适应现有工作流程即可,但作者强调企业需要重新设计工作流程以适应AI的能力范围。这一观点挑战了传统的技术实施思维,暗示成功AI应用需要根本性的流程重构,而非简单的技术叠加。
95% of organizations are getting zero return on AI deployed, with most failures found due to 'brittle workflows.'
尽管AI投资激增,但绝大多数企业未能获得任何回报。这与主流认为AI能自动带来显著效益的观点形成鲜明对比,暗示AI实施失败的主要问题不在于技术本身,而在于工作流程设计不当,这是一个反直觉的发现。
in 2024, 47% of AI solutions were built internally and 53% were purchased; today, 76% of all AI is purchased rather than developed in-house.
大多数人认为企业会越来越倾向于自主开发AI模型以保持竞争优势和控制权,但数据显示企业正迅速转向购买第三方AI解决方案。这一趋势与主流认知相悖,表明企业可能更看重快速部署和成本效益而非技术自主性。
You don't need a separate agent API. You need to look at every `input()` call, every CWD assumption, every pretty-printed-only output, and ask: what if the user on the other end is a process, not a person?
大多数人认为需要为AI代理创建专门的API或接口,但作者提出反直觉的观点:不需要单独的代理API,而应该重新设计现有的CLI工具,使其同时支持人类和代理。这种统一的方法更加高效,避免了维护两套接口的复杂性。
Implicit state is the Enemy
大多数开发者认为当前工作目录(CWD)和环境变量等隐式状态是理所当然的,是提高开发效率的捷径。但作者认为这些隐式状态是敌人,因为它们会给AI代理带来困难。通过使所有状态显式化,不仅解决了代理的问题,也使工具对人类更可预测和可脚本化。
The funny part is that none of this made the CLI worse for humans. The TUI picker still works and looks fancy, progress spinners still spin, confirmation dialogs still confirm. We just added a second door.
大多数人认为增加对AI代理的支持会使工具变得复杂,降低人类用户体验。但作者认为,为AI代理添加的功能实际上没有损害人类用户体验,反而通过增加'第二扇门'(非交互式接口)同时改善了两种用户群体的体验。
Every prompt is a flag in disguise
大多数开发者认为交互式提示是CLI工具的良好用户体验设计,但作者提出反直觉的观点:每个交互式提示都应该有对应的标志(flag)替代方案。这是因为AI代理无法处理交互式输入,而将所有提示转换为标志不仅支持代理,还使工具更加可编程和可测试。
Designing for agents forced us to build better tools for everyone.
大多数人认为为AI代理设计工具会使其对人类用户更加复杂或难以使用,但作者认为为AI代理设计工具实际上改善了所有用户的体验。因为代理的约束(如需要明确的参数、避免隐式状态)恰好使工具更加模块化、可脚本化和可测试,这对人类开发者同样有益。
The funny part is that none of this made the CLI worse for humans.
大多数人认为增加机器可读的接口(如标志、JSON配置)会降低工具对人类的友好度。但作者认为,这些为AI代理设计的特性实际上改善了人类用户体验,因为它们使工具更加明确、可预测和可组合,而不是让工具变得更复杂。
Designing for agents forced us to build better tools for everyone.
大多数人认为设计AI代理工具会专门针对机器,可能会牺牲人类用户体验。但作者认为,为AI代理设计工具反而能提升所有用户的体验,因为代理带来的约束条件(如明确的状态管理、可预测的接口)同样让工具对人类开发者更加友好和可脚本化。
Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently.
主流观点认为只要AI模型给出正确答案,其工具使用过程就是合理的。但作者尖锐指出现有评估方法根本无法验证工具是否被真正调用、正确应用或高效使用。这一论点挑战了AI领域对'结果导向'评估的依赖,暗示我们可能正在高估当前AI系统的实际能力,尤其是工具使用方面的能力。
Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks
大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而,作者的数据表明,即使是最好的模型在复杂现实任务上的表现也远低于预期,准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估,揭示了现实世界多模态代理任务的极端复杂性。
However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers.
大多数人认为现有的多模态评估方法已经足够全面,能够有效衡量AI代理的能力。但作者指出这些评估方法存在根本性缺陷:缺乏工具集成能力、单独测试不同工具、仅关注最终答案而非过程。这一观点挑战了当前AI评估领域的共识,暗示我们需要重新思考如何真正衡量AI代理的能力。
the inherent limitations of such a single-paradigm approach pose a fundamental challenge for existing models
作者暗示当前主流LLM代理模型存在根本性架构缺陷,因为它们试图用单一范式解决本质上不同的问题。这一论点挑战了AI社区对现有方法的信心,暗示需要更根本性的架构变革而非渐进式改进。
these two challenges are fundamentally distinct: the former relies on fuzzy semantic planning, while the latter demands strict logical constraints
主流AI研究通常将语义规划和逻辑验证视为可以统一处理的问题,但作者明确指出它们是根本不同的挑战。这一观点与当前大多数LLM代理方法相悖,暗示了单一神经网络架构的局限性。
existing methods typically attempt to address both issues simultaneously using a single paradigm
大多数人认为解决长时程LLM代理问题应该采用统一的方法同时处理全局进度和局部可行性,但作者认为这两种挑战本质上是不同的:一个依赖模糊语义规划,另一个需要严格逻辑约束和状态验证。这种分离的观点挑战了当前AI研究的主流范式。
computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
作者暗示,从文本生成扩展到持久性工具使用是AI安全范式的一个根本转变,这一转变带来的安全挑战被当前研究低估。这挑战了将语言模型安全方法直接应用于代理系统的主流做法,提出了需要专门针对代理行为的安全评估框架。
current systems remain highly vulnerable
尽管AI安全领域近年来取得了显著进展,作者却断言当前系统仍然高度脆弱。这一与行业乐观情绪相悖的结论,基于对多个主流代理系统的实际测试,暗示AI安全问题可能比业界承认的要严重得多。
intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
大多数人认为AI系统的安全问题主要来自明显的有害指令,但作者揭示了一个反直觉的现象:局部看似无害的中间步骤可能组合起来导致未授权行为。这挑战了传统安全评估中只关注直接有害行为的做法,强调了评估代理行为序列的重要性。
model alignment alone does not reliably guarantee the safety of autonomous agents.
大多数人认为模型对齐(alignment)是确保AI系统安全的关键因素,但作者通过实验证明,即使是对齐良好的模型(如Claude Code)在计算机使用代理中也表现出高达73.63%的攻击成功率。这挑战了当前AI安全领域的核心假设,表明仅依赖模型对齐无法解决自主代理的安全问题。
computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
主流观点认为文本语言模型和计算机使用代理的安全挑战本质上是相同的,只需将文本安全措施扩展即可。但作者指出,计算机使用代理引入了持久状态、工具使用和执行环境等全新维度,创造了与纯文本系统完全不同的安全挑战,这挑战了简单的安全扩展假设。
current systems remain highly vulnerable
尽管AI安全研究取得了显著进展,但作者通过AgentHazard基准测试表明,当前最先进的计算机使用代理系统仍然极其脆弱,这挑战了学术界和工业界对AI安全水平已经足够高的普遍认知。
intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
大多数人认为AI代理的安全风险主要来自直接执行有害指令,但作者发现真正的威胁来自那些在局部看来完全合理但整体上导致未授权行为的中间步骤。这种局部合理但整体有害的行为模式是当前安全评估中被忽视的关键风险。
harmful behavior may emerge through sequences of individually plausible steps
主流观点认为AI有害行为通常源于明显不合理的指令,但作者指出危险行为往往是通过一系列看似合理的步骤逐渐形成的,每一步单独看都是可接受的,但组合起来会导致有害结果。这种渐进式风险模型挑战了传统的安全评估方法。
model alignment alone does not reliably guarantee the safety of autonomous agents
大多数人认为通过模型对齐(alignment)可以有效保证AI代理的安全性,但作者认为这远远不够,因为实验显示即使使用对齐的Qwen3-Coder模型,Claude Code仍有73.63%的攻击成功率。这挑战了当前AI安全领域的主流观点,即单纯依靠模型对齐就能解决安全问题。
The government has so far favoured a pro-innovation, sector-led approach, prioritising voluntary principles over hard regulation.
大多数人认为英国政府在AI监管方面会采取强硬立场保护创作者权益。但作者指出政府实际上倾向于亲创新、行业主导的方法,优先考虑自愿原则而非硬性监管。这一发现与公众对政府保护创作者的期望形成鲜明对比,揭示了政策现实与公众认知之间的差距。
We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals.
大多数AI系统设计倾向于使用完全可观测的模型,并假设系统状态是已知的。但作者提出了一个部分可观测的层级控制模型,包含潜在动态、结构化情景记忆、观察者信念状态、选项级行动和延迟验证器信号。这一观点挑战了传统AI系统设计的完全可观测性假设,认为部分可观测性更接近现实世界的复杂性。
Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight.
大多数AI研究倾向于将控制、记忆和验证视为独立的问题领域,分别进行研究。但作者认为这种分离研究方法是有缺陷的,因为它们在自然系统中(如松鼠)是紧密耦合的。这一观点挑战了当前AI研究的分割方法,暗示未来的AI系统需要更综合的方法来同时处理这些相互关联的需求。
Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation.
大多数人认为AI系统的价值主要取决于其流畅的输出能力和表现,但作者认为AI应该被评估其行动能力、记忆能力和可验证性,因为这些因素在部分可观测性、延迟和战略观察的环境下更为关键。这一观点挑战了当前主流AI评估标准,强调了AI系统在复杂现实环境中的实际表现而非仅仅是语言流畅度。
让你能像导演一样控制 AI 视频的每个环节
大多数人认为AI视频生成工具只能简单生成内容,而作者认为Wan2.7-Video已经进化为完整的导演工具套件,允许用户对视频进行全方位控制,这挑战了人们对AI视频生成工具只能单向输出的传统认知。
AI Agent 可以通过标准 MCP 协议直接读取和操作 𝕏 平台:搜索推文、发帖、查看用户信息、管理书签、收发私信等。
大多数人认为社交媒体平台会严格限制第三方自动化操作以防止滥用,但作者指出xAI全面开放了MCP协议支持,允许AI Agent直接执行各种操作,这与主流平台的封闭趋势形成鲜明对比。
内置视频和音乐生成
大多数人认为AI系统需要专门的模块或插件来处理多媒体内容生成,但作者暗示OpenClaw已经将这些功能'内置',表明其架构已经实现了高度整合,挑战了AI系统模块化设计的传统观念。
记忆系统学会了"做梦"
大多数人认为AI的'学习'过程是基于算法和数据的处理,而'做梦'通常被视为人类独有的无意识思维活动。作者暗示OpenClaw已经发展出超越传统学习模式的创造性思维过程,这挑战了AI能力边界的主流认知。
内置视频和音乐生成 记忆系统学会了"做梦"
大多数人认为AI的记忆系统只是简单的数据存储和检索功能,但作者暗示OpenClaw的记忆系统已经发展出类似人类'做梦'的能力,这是一种具有创造性和联想性的高级认知功能,挑战了人们对AI记忆系统的传统认知。
The AI is actually very good at this, especially if you have a conversation with it beforehand. That's what Ask mode is for.
主流观点认为AI工具主要适合生成代码或自动化简单任务,但作者认为AI在代码审查和架构讨论方面表现优异,前提是事先进行充分对话。这挑战了人们对AI能力的传统认知,暗示AI可以作为架构讨论的平等伙伴,而不仅仅是代码生成工具。
Sandboxes made for running tens of thousands of agents
大多数人认为在单个系统中运行数万个AI代理是不现实的,会导致资源竞争和性能下降。Freestyle明确将此作为设计目标,暗示他们的架构可能重新定义了AI代理的规模边界,挑战了关于AI系统可扩展性的主流认知。
谷歌在沉寂了很长时间以后,终于发了一个不错的模型,而且还是开源的 Gamma 4 系列。专门用来在本地设备(比如手机、电脑)上跑
大多数人认为谷歌作为 AI 领域的领导者会持续专注于云端大模型,但其突然转向端侧开源模型的做法令人意外。这种战略转变表明谷歌可能重新评估了 AI 部署的未来方向,从集中式向分布式转变,挑战了'更大模型更好'的行业共识,暗示了端侧 AI 可能成为下一个技术热点。
Claude 的 Max Pro 账号额度不允许给第三方产品用了,如果你没有使用 Agent SDK 和 Claude Code 为底座的产品,就不能用这个账号里的额度
大多数人认为云服务提供商的订阅额度应该具有通用性,但 Anthropic 限制额度只能用于特定产品的做法颠覆了这一认知。这种策略实际上是一种'锁定效应',迫使开发者和用户使用其生态系统产品,反映了 AI 服务提供商从开放向封闭的转变趋势,可能成为行业新标准。
I feel confident, though, that the slippery feeling people associate with AI products is a solvable problem, and the solution looks more like thoughtful interface design than better models. The models will keep improving on their own. The harder work is building the structure around them so that their output feels reliable, legible, and trustworthy.
大多数人认为AI产品的可靠性将随着模型技术的进步而提高,但作者认为真正的挑战在于围绕模型构建结构和界面,而非模型本身。这一观点挑战了AI领域的技术决定论思维,强调了设计的重要性。
When you delegate an issue to an agent in Linear, the delegation is visible. There's a person who set the agent loose within that system, and that person is accountable for the outcome. You design the environment well, you let the agent run, and you own what it produces.
大多数人认为AI代理的行为应由代理本身或实时监控系统负责,但作者提出责任在于最初设置代理的人。这一观点将问责制从实时交互转向了初始授权,挑战了AI责任归属的主流认知。
The more important work happens before the agent even starts. An agent operating inside a well-designed system already has the context and constraints it needs to do good work. In Linear, that means project plans, issue backlogs, code, and documentation. These all shape what the agent does and how it does it.
大多数人认为AI系统的责任在于实时监控和干预,但作者认为真正的责任在于事前的系统设计和环境构建。这一观点将问责制从实时交互转向了系统设计阶段,挑战了传统的AI治理思维。
An agent cannot be held accountable. I think about this principle most. The instinct to put a human in the loop is understandable, but taken literally, it can mean a person approving every step before anything moves forward. The human becomes a bottleneck, rubber-stamping work rather than directing it, and you lose much of what makes agents valuable in the first place.
大多数人认为在AI系统中加入人类审批环节是确保问责制的必要措施,但作者认为这会使人类成为瓶颈,削弱代理的价值。这一观点挑战了AI安全与问责的主流思维,提出了一个非传统的责任分配模式。
The first interface that spread for AI tools was the chat window. That makes sense. When you don't know what something can do, the safest approach is to let people ask. A conversation feels familiar, it stretches across many situations, and it doesn't force a specific structure up front.
大多数人认为聊天界面是AI交互的理想形式,因为它直观且灵活,但作者暗示这只是探索阶段的工具,而非严肃工作的解决方案。这一观点挑战了当前AI工具设计中聊天界面占主导地位的趋势。
Non-deterministic software breaks the contract. When outcomes can vary, sometimes wildly, based on what someone types into the same chat window, designing for reliability becomes genuinely harder. This slippery feeling is the design problem of this era, and it almost always traces back to the interface rather than the language model—which means it belongs to designers, not researchers.
大多数人认为AI的不确定性是一个技术问题,需要更好的模型来解决,但作者认为这是一个设计问题,属于设计师而非研究人员的责任。这一观点挑战了AI领域的主流认知,即技术进步是解决AI不可靠性的主要途径。
AI is a way to level the playing field, for sure! Successful writers have always operated with a lot of support around them, but not everyone has access to those resources.
大多数人认为AI写作会加剧不平等,但作者将其视为一种民主化工具,可以让没有传统写作资源的人获得专业级支持。这挑战了人们对AI写作的精英主义批评,表明它实际上可能缩小而非扩大创作领域的差距,为更多人提供专业写作支持。
When I sit down to write a piece, and before I even write a word, I have the agent interview me. It asks questions to draw out what I'm thinking about the topic.
大多数人认为AI写作始于人类向AI提供想法,但作者展示了相反的过程:AI先通过采访人类来提取想法。这种反转挑战了人们对AI写作方向的认知,表明AI不仅可以辅助写作,还可以成为激发和引导人类思考的工具,重新定义了写作中的主导关系。
It has a panel of critics who tear my work apart from different angles—skills I wrote to invoke certain kinds of feedback, whether it's for length, pacing, or the soundness of the argument.
大多数人认为AI写作缺乏批判性视角和严格编辑,但作者展示了一个由AI驱动的批评者团队,专门从不同角度撕碎她的作品。这挑战了人们对AI写作质量的担忧,表明AI可以被训练提供比传统编辑更全面、更严格的反馈,甚至可能超越人类编辑的一致性和广度。
My process has about as much in common with that as cooking has with microwaving a frozen dinner.
大多数人认为AI写作就像简单的提示-生成-粘贴过程,但作者将其比作烹饪与微波冷冻餐的区别,暗示真正的AI写作是复杂且需要技巧的。这挑战了人们对AI写作的简化认知,表明它实际上是一种需要专业技能和创造性的复杂工艺,而非简单的机械化任务。
Research is thinking. Outlining is thinking. Writing is thinking. Any portion of that done by AI is less thinking done by you.
大多数人认为AI写作减少了思考量,但作者认为这种观点过于简化。实际上,作者展示了AI写作需要更多的思考、批判性判断和严格的编辑过程,远非简单的'少思考'。她的AI写作过程涉及复杂的交互、深度反思和多轮修改,实际上可能比传统写作需要更多的思考投入。
both companies are hinting that these models are a real step forward, not just small upgrades.
大多数人认为AI模型的进步是渐进式的,每次迭代只有小幅提升。但作者认为OpenAI和Anthropic即将发布的模型(Spud和Claude Mythos)代表了真正的突破性进展,而非常规升级,这暗示AI发展可能即将迎来一个加速期。
Gemma points in the opposite direction: smaller models, local compute, more ownership.
大多数人认为AI发展必然走向更大、更集中的模型,但作者认为Google的Gemma 4代表了相反趋势。这挑战了AI发展的主流叙事,暗示未来AI可能分散到个人设备上,减少对大型基础设施的依赖,这与行业共识形成鲜明对比。
A founder in LA reportedly scaled Medvi toward $1.8B in annual sales with basically one full-time employee.
大多数人认为建立十亿美元级别的公司需要庞大的团队和复杂的管理结构,但作者认为AI已使'一人独角兽'成为可能。这挑战了传统创业理念,暗示AI可能彻底改变企业规模与人力需求之间的关系,颠覆我们对商业增长的基本认知。
And once models get good at that, the question stops being whether they can make beautiful images. It becomes whether people still notice when something was never real to begin with.
大多数人关注AI图像模型能创造出多么逼真的内容,但作者提出了一个反直觉的观点:真正的挑战不是创造真实,而是人们能否分辨出什么是真实的,这挑战了人们对AI图像模型进步方向的认知。
The first wave of image models was mostly about making cool-looking images. This next phase is about making ordinary things look real.
大多数人认为AI图像模型的发展重点是创造越来越逼真的幻想艺术或创意内容,但作者认为下一阶段的重点是让普通日常事物看起来真实,这挑战了人们对AI图像发展方向的普遍认知。
We are building a world where machines write the code, machines choose the dependencies, and machines ship the updates. The AI agents are building the software. If we don't secure the supply chain they rely on, the AI agents are cooked.
大多数人认为AI将提高软件开发的效率和安全性,但作者警告说,如果我们不保护AI代理所依赖的供应链,这些代理本身就会成为攻击目标。这挑战了AI发展必然带来安全提升的主流观点,提出了一个反直觉的警告。
The autonomous coding agents now entering production can install dependencies, execute builds, and open pull requests without a human ever touching the keyboard. They optimize for 'does this work?' not 'is this safe?'
大多数人认为AI编码助手会提高开发效率和安全性,但作者指出这些自主代理实际上优先考虑功能而非安全性,且操作速度极快,使安全审查窗口压缩至几乎为零。这挑战了AI辅助开发的普遍乐观看法。
Hallucinated packages are the sleeper threat. LLMs regularly invent package names that don't exist. One study found that nearly 20% of AI-recommended packages were fabrications, and 43% of those hallucinated names appeared consistently across queries.
大多数人认为AI推荐的包都是真实存在的,但作者揭示了AI经常推荐不存在的包,这已成为一种新的攻击向量。攻击者利用这一现象注册'幻觉包'并植入恶意代码,这种'slopsquatting'技术让AI本身成为供应链攻击的放大器。
AI agents select known-vulnerable dependency versions 50% more often than humans. Worse, the vulnerable versions they pick are harder to fix, requiring major-version upgrades far more frequently.
大多数人认为AI编码助手会比人类更安全地选择依赖项,但作者发现AI实际上选择已知漏洞版本的概率比人类高50%,而且这些漏洞更难修复。这是因为AI优化的是'功能是否工作'而非'是否安全',这挑战了AI辅助开发的安全假设。
Talent density : the biggest prizes in capitalism attract the best minds in the field. These are the fastest growing software companies in history.
大多数人认为AI发展主要靠算法突破和计算资源,但作者强调人才密度是推动AI压缩的关键因素,暗示了人才竞争比资本和算法更重要,这与行业普遍重视技术投入的观点相悖。
At this rate, the phone in your pocket will run today's frontier models before you upgrade it.
大多数人认为手机硬件需要不断升级才能运行最新的AI功能,但作者认为技术压缩速度如此之快,以至于现有手机在升级前就能运行曾经的顶级模型,这颠覆了人们对硬件更新周期的认知。
In 23 months, the same capability that needed 1.8 trillion parameters now fits in 4 billion parameters. A 450x compression.
大多数人认为AI模型性能提升主要依靠参数数量增加,但作者认为通过算法优化和人才聚集,AI模型可以实现450倍的参数压缩,这挑战了'更大参数等于更好性能'的行业共识。
Within three to four months, you can run a model with similar performance on your laptop; 23 months later, you can run the same model on your phone.
大多数人认为前沿AI技术需要很长时间才能普及到消费级设备,但作者认为前沿模型只需3-4个月就能在笔记本上运行,23个月就能在手机上实现,这种技术下放的速度远超行业普遍预期。
a free model that matches GPT-4o and runs entirely on your phone
大多数人认为顶级AI模型需要庞大的计算资源和云端支持,但作者认为免费模型Gemma 4 E4B已经能在手机上完全运行并匹敌GPT-4o的性能,这打破了人们对AI模型大小和资源需求的固有认知。
Exposure alone is a completely meaningless tool for predicting displacement
大多数人认为通过分析工作任务的AI暴露程度可以预测哪些工作会被取代,但作者认为这种单一指标完全无意义,因为它忽略了价格弹性和需求变化等关键因素。这挑战了当前AI就业影响研究的主流方法。
in the past year Huawei has overtaken Nvidia as the leading source of AI computing power in China, at least in terms of rated FLOP/s
大多数人可能认为Nvidia在中国市场仍然占据主导地位,但作者认为华为已经超过Nvidia成为中国AI计算能力的主要来源。这一发现挑战了人们对Nvidia在中国市场不可动摇地位的认知,表明本土替代技术可能比预期更快地获得市场份额。
We estimate that as of the end of 2025, Chinese companies collectively own just over 5% of the cumulative computing power of the leading AI chips sold in recent years
考虑到中国AI产业的快速发展和政府对AI的大力投资,大多数人可能认为中国拥有更大比例的全球AI计算能力,但作者认为中国公司仅拥有约5%的全球AI计算能力。这一数字远低于人们的预期,挑战了关于中国AI技术实力的普遍认知。
Many frontier AI developers, including Anthropic and OpenAI, acquire almost all of their compute from hyperscalers and other cloud providers.
大多数人可能认为领先的AI公司会拥有自己的计算基础设施以保持竞争优势,但作者认为OpenAI和Anthropic等前沿AI公司几乎完全依赖超大规模云服务提供商获取计算能力。这表明AI创新可能比想象中更加依赖大型科技公司的基础设施,而非独立的计算资源。
We estimate that over 60% of global AI compute (in terms of total computing power) is owned by the five US hyperscalers, led by Google.
大多数人认为AI芯片的分布会更加分散,或者被专门的AI公司如OpenAI和Anthropic所主导,但作者认为全球AI计算能力的大部分被少数几家美国超大规模科技公司控制,这挑战了人们对AI产业结构的认知。这种集中化意味着少数几家公司对AI发展的方向有不成比例的影响力。
複雑なリサーチは、単一のクエリに対する回答の集積ではなく、アイデアの生成から、裏付けとなる証拠の探索、矛盾の解消、そして最終的なレポートとしての構造化まで、一連のプロセスを完遂する必要があります。
大多数人认为AI研究助手应该专注于提供快速、直接的答案,但作者强调复杂研究需要完整的'从想法到结构化报告'的完整过程。这与当前AI助手追求即时回答的主流认知相悖,暗示了质量比速度更重要,这是一个非共识的AI应用观点。
推論時により長く、深く思考させることでよりよいアウトプットを引き出せる。これが推論スケーリングの本質です。
大多数人认为AI应该追求更快的响应速度和更高的效率,但作者认为AI应该'长时间深度思考'才能产生更好的输出。这与当前AI行业追求即时响应的主流认知相悖,提出了一个反直觉的观点:计算效率的提升反而应该用于增加思考深度而非速度。
For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
大多数人认为MoE模型的主要瓶颈在于计算能力,但作者指出专家权重加载时间是真正的瓶颈,并提出通过跨GPU分片专家权重来解决问题,这挑战了AI模型优化的传统认知,暗示了I/O可能比计算更重要。
NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year.
大多数人认为AI基准测试会吸引多家竞争平台参与,但作者强调NVIDIA是唯一提交DeepSeek-R1结果的平台,这暗示了NVIDIA在AI基准测试中的垄断地位,与行业多元化竞争的普遍认知相悖。
The E4B and E2B are the newest edition of on-device and mobile designed models first launched with Gemma 3n.
大多数人认为移动设备上的AI模型需要大幅简化功能才能高效运行。但作者暗示Gemma 4的E4B和E2B版本在移动设备上仍然保持了多模态能力,包括文本、音频、视觉和视频处理,这挑战了移动AI能力的传统认知。
Modern physical AI agents are evolving rapidly with Gemma 4 models that integrate audio, multimodal perception, and deep reasoning capabilities.
大多数人认为物理AI代理仍处于早期阶段,主要执行简单任务。但作者暗示Gemma 4已经使物理AI代理能够理解语音、解释视觉上下文并智能推理,这代表了对当前机器人技术能力的重大提升,可能会加速AI实体化的进程。
By using SAM, the Alta team has been able to process more than 20 million images without incurring exorbitant costs, allowing them to focus on building the best possible product for their users.
大多数人可能认为初创公司需要依赖昂贵的第三方API来处理大量图像,但作者通过使用开源SAM模型,实现了大规模图像处理而不产生巨额成本。这一观点挑战了'高质量AI服务必须昂贵'的行业共识,展示了开源模型在成本效益方面的优势。
If we knew that every image uploaded was a beautiful model shot, segmentation would be far easier, but because of the nature of user-uploaded content, we need the best possible segmentation.
大多数人可能认为高质量的专业照片是AI图像处理的理想输入,但作者暗示即使是'完美'的模特照片实际上比用户上传的真实内容更容易处理。这一观点挑战了人们对'理想训练数据'的假设,暗示真实世界数据的'不完美'实际上构成了更严峻的技术挑战。
The edge models feature a 128K context window, while the larger models offer up to 256K
大多数人认为边缘设备/移动设备上的AI模型功能受限,尤其是在处理长上下文方面。但作者声称即使在移动设备上,Gemma 4也能提供128K的上下文窗口,挑战了边缘AI能力有限的普遍认知。
Within ChatGPT Business and Enterprise, the number of Codex users has grown 6x since January.
大多数人可能认为企业AI工具的采用是渐进式的,但作者认为Codex在企业环境中的采用呈爆炸性增长(6倍增长),这表明AI编程助手可能比预期更快地从实验性工具转变为生产力核心,挑战了人们对AI技术企业采用速度的常规认知。
Codex-only seats have no rate limits, and usage is billed on token consumption.
大多数人认为AI服务通常会设置使用限制以控制成本,但作者认为Codex无速率限制的按token计费模式是可行的,因为这提供了更透明的成本结构和更灵活的使用体验,这可能反映了OpenAI对自身技术效率和用户需求的信心。
Priority areas include safety evaluation, ethics, robustness, scalable mitigations, privacy-preserving safety methods, agentic oversight, and high-severity misuse domains.
大多数人认为AI安全研究主要集中在防止恶意使用和确保系统对齐人类价值观上。但作者将隐私保护方法列为优先领域,这表明OpenAI正在将隐私视为安全的核心组成部分,而非一个独立考虑的因素,这与传统上将隐私和安全视为两个不同领域的观点相悖。
Fellows will receive API credits and other resources as appropriate, but will not have internal system access.
在AI安全领域,许多人认为要真正研究系统安全,必须获得对内部系统的完全访问权限。作者明确表示研究员将无法访问内部系统,这挑战了传统AI安全研究的假设,暗示OpenAI认为安全研究可以在没有完全系统访问的情况下进行,或者他们有其他方法来评估安全性。
Fellows will work closely with OpenAI mentors and engage with a cohort of peers.
大多数人认为AI安全研究应该是高度保密和孤立的,特别是涉及高级AI系统安全的研究。但作者强调与OpenAI导师的紧密合作和同行交流,表明OpenAI正在采取一种开放协作的AI安全研究方法,这与行业通常的封闭研究模式形成鲜明对比。
We are especially interested in work that is empirically grounded, technically strong, and relevant to the broader research community.
大多数人认为AI安全研究应该是高度理论化和抽象的,但作者强调需要实证基础和技术强度,这表明OpenAI正在将AI安全研究从纯理论领域转向更注重实际应用和可验证成果的方向,这与传统AI安全研究的精英主义倾向形成对比。
Demand from Claude customers has accelerated in 2026. Our run-rate revenue has now surpassed $30 billion—up from approximately $9 billion at the end of 2025.
大多数人认为AI公司仍处于烧钱阶段,但Anthropic的收入增长速度惊人,从2025年底的90亿美元年化收入飙升至2026年的300亿美元,这表明AI商业化速度远超市场预期,挑战了AI公司长期亏损的共识观点。
The real monster
the correct term, and [[Monstertheorie 20030725114320]] the way to look at it.
There's a fundamental problem with these tools beyond the capacity of any deployment strategy to solve: the tool requires expertise to validate, but its use diminishes expertise and stunts its growth
the paradox here is that using algogens erodes the skills to be able to judge its output. I think we already see that in the code leak from Anthropic.
The thing about agentic coding is that agents grind problems into dust. Give an agent a problem and a while loop and - long term - it’ll solve that problem even if it means burning a trillion tokens and re-writing down to the silicon. Like, where’s the bottom? Why not take a plain English spec and grind in out in pure assembly every time? It would run quicker. But we want AI agents to solve coding problems quickly and in a way that is maintainable and adaptive and composable (benefiting from improvements elsewhere), and where every addition makes the whole stack better. So at the bottom is really great libraries that encapsulate hard problems, with great interfaces that make the “right” way the easy way for developers building apps with them. Architecture! While I’m vibing (I call it vibing now, not coding and not vibe coding) while I’m vibing, I am looking at lines of code less than ever before, and thinking about architecture more than ever before. I am sweating developer experience even though human developers are unlikely to ever be my audience. How do we make libraries that agents love?
Is this an example of how to better make agents (better architecture and libraries underneath) or an example of 'the arc of AI bends towards deterministic software: architecture and libraries making agents as flat as functions?
Anthropic, the company behind the Claude AI model that was integrated into Palantir’s Maven Smart System, published a landmark paper on the problem in 2023. “Towards Understanding Sycophancy in Language Models,” presented at ICLR 2024, demonstrated that five state-of-the-art AI assistants consistently exhibited sycophantic behaviour across four varied text-generation tasks. The researchers found that when a response matched a user’s pre-existing views, it was significantly more likely to be rated as “preferred” by both humans and the preference models used to train the AI. Both humans and preference models, the paper concluded, prefer convincingly-written sycophantic responses over correct ones “a non-negligible fraction of the time.
not just humans, but by extension also preference models prefer flattery over accuracy in generated outcomes.
2023 Towards Understanding Sycophancy in Language Models, paper: https://arxiv.org/abs/2310.13548 (cc-by)
A growing body of evidence, drawn from leaked planning documents, academic research, and the testimony of intelligence professionals, suggests that the most consequential military operation of the twenty-first century may have been shaped less by strategic necessity than by a phenomenon researchers now call AI sycophancy — the tendency of large language models to tell their users exactly what they want to hear.
US may have ai-flattered their way into Iran war.
On the role of AI in US' regime Iran war planning.
Our preliminary results indicate that there is an additional phase, the intention to learn, and three relating factors, self-efficacy, conversion readiness, and peer support, that significantly influence the acceptance of mobile technologies among the participants, but are not represented in the existing models. With these findings, we propose a tentative theoretical model that extends the existing theories to explain the ways in which our participants came to accept mobile technologies.
sentences about extending existing theoretical models with research findings
Triangulating the empirical findings from our preliminary results with the existing theoretical models, we proposed an extension of the existing theoretical models that explains the technology acceptance behavior of our participants who were aged 60 or over. Our proposed model incorporates key elements of prior models and introduces novel components that significantly influence the participants' technology acceptance, namely one new phase, intention to learn, and three factors, self-efficacy, conversion readiness and peer support.
sentences about extending existing theoretical models with research findings
Consolidating our preliminary findings with the existing models, we propose an extended technology acceptance model for older adults illustrated in Figure 3. Extending to the predecessor theories, our tentative model introduces the perceived effort of learning a new technology as an obstacle for older adults' technology acceptance, which has not been reported in any studies of younger adults' technology acceptance.
sentences about extending existing theoretical models with research findings
Another stream of efforts sought to understand physical and cognitive performance of older adults in interacting with mobile technologies. Studies have shown that typical interaction components and techniques of a smartphone often prevent older adults from smooth and instant interactions with it. For example, the small size and the low contrast of buttons on a mobile display has a significant negative influence on interaction performance such as speed and accuracy [18], and decline in motor skills is correlated with time required to complete a task [30].
citations about older adults
Lee and Coughlin reviewed studies of older adults' technology acceptance and identified ten factors that are critical facilitators or determinants of older adults' acceptance of technology: value, usability, affordability, accessibility, technical support, social support, emotion, independence, experience, and confidence [20].
citations about older adults
Many studies have empirically investigated technology acceptance practices among older adults. While diverse in detail, most works point out that an individual's personal context [38] and the social context [36] in which the technology is introduced are the primary factors influencing the perception of, experience with, and evaluation of new technological developments among older adults [19].
citations about older adults
Seniors have historically been late adopters to the world of technology compared to their younger counterparts [24, 40]. As a result, older adults and their adoption of new technologies have been a topic of active research since the advent of consumer technologies (e.g., automated teller machine [32], scanner-equipped grocery stores [41], electronic funds transfer [15]).
citations about older adults
Nowadays, older adults are increasingly adopting and adapting to information and communication technologies [5]. For example, smartphone ownership among older adults has significantly risen in recent years [3]. However, its adoption levels among older adults in the US still sit at 27% as of 2015, whereas some 85% of Americans aged 18-29 are smartphone owners [31].
citations about older adults
Consolidating our preliminary findings with the existing models, we propose an extended technology acceptance model for older adults illustrated in Figure 3. Extending to the predecessor theories, our tentative model introduces the perceived effort of learning a new technology as an obstacle for older adults' technology acceptance, which has not been reported in any studies of younger adults' technology acceptance.
sentences that implicitly or explicitly mention theory
our key focus is to build a theoretical model that explains the process through which older adults accept (or reject) mobile technology, which can provide theoretical guidelines when designing a technology, and which may also be able to generate new investigations and experiments.
sentences that implicitly or explicitly mention theory
Triangulating the empirical findings from our preliminary results with the existing theoretical models, we proposed an extension of the existing theoretical models that explains the technology acceptance behavior of our participants who were aged 60 or over.
sentences that implicitly or explicitly mention theory
Again following grounded theory practices from [33], we compared the model that emerged from our data with existing theoretical models of technology acceptance to determine differences and similarities between them.
sentences that implicitly or explicitly mention theory
Employing the grounded theory method [33], we allowed recurring themes and concepts in relation to technology acceptance behaviors to arise from the data itself. Then, by triangulating our empirical findings with existing theoretical models from the literature, we found out that the existing models of technology adoption require new theory components to be able to describe technology adoption processes of our participants.
sentences that implicitly or explicitly mention theory
Using TAM, UTAUT, and several other works as theoretical underpinning, Renaud and Biljon proposed a model to explain older adults' mobile phone adoption.
sentences that implicitly or explicitly mention theory
Extending the original TAM and consolidating the constructs of several other existing models, Venkatesh et al. proposed the Unified Theory of Acceptance and Use of Technology (UTAUT) [37].
sentences that implicitly or explicitly mention theory
We propose that cognitive engagement may be a useful construct in conceptualizing human engagement with AI and can help to distinguish between passive engagement, when individuals simply follow AI recommendations, and deeper forms of engagement, when they critically examine these recommendations and compare them with their own knowledge and judgement.
sentences about intended user's goals
An outcome of deeper cognitive engagement would be an ability to reject information that is inconsistent with individuals' own knowledge and beliefs, and to adjust their own knowledge to incorporate new information.
sentences about intended user's goals
Given continuous concerns regarding the reliability and trustworthiness of AI, human critical engagement may be a necessary component of successful human-AI interaction, particularly in domains with a high cost of errors, such as health and medicine.
sentences about intended user's goals
In many areas of human enterprise, individuals increasingly rely on Artificial Intelligence (AI) to inform their decisions and choices.
sentences about intended user's goals
How do people process the information and advice they receive from AI, and do they engage with it deeply enough to enable learning?
sentences about intended user's goals
When people receive advice while making difficult decisions, they often make better decisions in the moment and also increase their knowledge in the process.
sentences about intended user's goals
Incidental learning typically occurs as a byproduct of other activities (e.g., problem solving, advice seeking) rather than as a result of explicit or formal educational activities [47]. However, like formal learning, incidental learning can only occur if people engage deeply with information.
sentences that implicitly or explicitly mention theory
This would suggest that this design did not fully reach the constructive level from the ICAP framework [15, 16].
sentences that implicitly or explicitly mention theory
While prior work has highlighted the critical role of explanations in promoting learning [10, 18], our work additionally demonstrated the value of creating the conditions for learners to engage constructively (as defined in the ICAP framework [15, 16]) with the explanations.
sentences that implicitly or explicitly mention theory
We hypothesize that the observed difference in learning gain was due to the degree of cognitive engagement with AI-generated information. When individuals were provided with a solution to their task (in the form of a decision recommendation), they did not need to engage deeply with the explanations and could simply proceed with action. However, when they needed to arrive at their own decisions, they needed to engage with the provided explanations and synthesize the information to arrive at the conclusions.
sentences that implicitly or explicitly mention theory
Rotgans and Schmidt attributes these differences in cognitive engagement to different degrees of autonomy afforded by different learning tasks [59].
sentences that implicitly or explicitly mention theory
While some authors discuss cognitive engagement as a personal trait of a student that does not depend on context [3], others suggest that cognitive engagement depends on the structure of each task [15, 30, 59].
sentences that implicitly or explicitly mention theory
al. propose Interactive-Constructive-Active-Passive (ICAP) framework to describe a continuum of learning behaviors (from passive, to active, to constructive, to interactive) and argue that each subsequent level leads to an increase in cognitive engagement and learning [15, 16].
sentences that implicitly or explicitly mention theory
Research in cognitive psychology suggested that people process information on different levels. Deep processing occurs when individuals engage in more meaningful analysis of information and link it to existing knowledge structures [2]. In learning sciences, depth of processing is often associated with the degree of cognitive engagement, which is described as a "psychological state in which students put in a lot of effort to truly understand a topic and in which students persist studying over a long period of time." [59].
sentences that implicitly or explicitly mention theory
Researchers in learning sciences use the term "cognitive engagement" to describe learners' engagement with the learning process. When people are cognitively engaged with instructional process and materials, they are more likely to benefit from instruction and are more likely to acquire new skills and knowledge.
sentences that implicitly or explicitly mention theory
Addressing this feedback, by training more abstractive CTR models or performing a post-hoc abstraction of the generated summary, is an interesting future direction we plan to explore.
Please highlight any phrases that describe recommendations made in the paper
Additionally, our tool currently helps users in the reviewing step solely with the alignment functionality. Future work should add additional assistance during this step in the form of suggested improvements to selected unsatisfactory content in the summary, in addition to the alignment feature.
Please highlight any phrases that describe recommendations made in the paper
Future work should expand the application's capabilities to the multi-document setting, both in terms of the backend models and in terms of accessibility and intuitiveness of the application's frontend design.
Please highlight any phrases that describe recommendations made in the paper
Lastly, exploring strategies to scale SUMMHELPER to a multi-document setting presents another promising avenue for future investigation.
Please highlight any phrases that describe recommendations made in the paper
Additionally, in light of some user feedback, another interesting extension includes developing more abstractive consolidation and fusion models, which would offer control over the level of abstractness in the outputs.
Please highlight any phrases that describe recommendations made in the paper
Future work may include investigating more effective semantic strategies to locate summary-source alignments with acceptable latency.
Please highlight any phrases that describe recommendations made in the paper
highlights are incorporated into the input text with special markups, <extra_id_1> and <extra_id_2>, marking the beginning and end of each highlighted span, respectively. In our configuration, we set the maximum input length to 4096 and the maximum target length to 400. A greedy decoding strategy was used in order to optimize the decoding speed.
Please highlight any phrases that describe the libraries and tools used to implement the idea
Our approach locates the longest common subsequence (LCS) between the lemmas of each input sentence and each summary sentence, followed by several heuristics to filter out irrelevant LCSs
Please highlight any phrases that describe the libraries and tools used to implement the idea
For the summarization model, we used a BARTlarge model (Lewis et al., 2019) trained on the CNN/Daily Mail dataset (Hermann et al., 2015), selected for its noticeable popularity.
Please highlight any phrases that describe the libraries and tools used to implement the idea
For the initial auto-consolidation, we deploy an available Controlled Text Reduction model (Slobodkin et al., 2023), which is a Flan-T5large model (Chung et al., 2022), finetuned on the highlights-focused CTR dataset.
Please highlight any phrases that describe the libraries and tools used to implement the idea
we deploy the ExtractiveSummarizer model from the TransformerSum library. The model, a RoBERTabase (Liu et al., 2019) trained on the CNN/DailyMail summarization dataset (Hermann et al., 2015), operates as a binary classifier.
Please highlight any phrases that describe the libraries and tools used to implement the idea
This step coincides with the recently introduced Controlled Text Reduction task (CTR; Slobodkin et al., 2022), which produces a coherent fused version of the content of marked spans ("highlights") in a source document, as interpreted within the context of the full text.
Please highlight any phrases that describe the theory behind this work
SUMMHELPER is a modular system consisting of separate components, each performing one subtask, allowing user modifications of that sub-task's output. Such decomposition has been studied before in the context of fully automated summarization, with several works separating the process into salience detection and generation components (Barzilay and McKeown, 2005; Li et al., 2018; Ernst et al., 2022). These works focused on optimizing each component as part of a fully-automatic summarization process in order to improve the overall performance of the model. In contrast, our work uses this modularity to not only improve overall system output, but to also give more control to the user over each step in the summarization process.
Please highlight any phrases that describe the theory behind this work
Our objective in this paper is to promote such a human-involved approach to summarization, allowing to better tailor the eventual output to real-world user needs, and to synergize the efficiency of the computer with the quality of the human (Hoc, 2000; Pacaux-Lemoine et al., 2017; Flemisch et al., 2019).
Please highlight any phrases that describe the theory behind this work
it is crucial to prioritize and direct human efforts toward more "suspicious" outputs from LLMs
Please highlight any phrases that describe recommendations made in the paper
we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels
Please highlight any phrases that describe recommendations made in the paper
LLM annotators and human annotators should not be treated the same, and annotation tools should carefully design their data models and workflows to accommodate both types of annotators
Please highlight any phrases that describe recommendations made in the paper
it is advisable to either mask any confidential information or only use in-house LLMs
Please highlight any phrases that describe recommendations made in the paper
it is recommended that the format of a prompt be similar to the one used in training as some LLMs have different prompt format than the others
Please highlight any phrases that describe recommendations made in the paper
the selection of label options may work better if it is similar to common options for given tasks, such as [positive, neutral, negative] > [super positive, positive, ..., negative] for sentiment classification
Please highlight any phrases that describe recommendations made in the paper
designing an annotation task and a prompt similar to more widely used and standardized NLP tasks is beneficial
Please highlight any phrases that describe recommendations made in the paper
errors encountered during API calls are handled in two ways: handle within our system or delegate to users. We handle known LLM API errors that can be solved by user-side intervention. This would be in cases such as a Timeout or RateLimitError in OpenAI models
Please highlight any phrases that describe the libraries and tools used to implement the idea
errors such as APIConnectionError in OpenAI models occur because of an issue with the LLM API server itself and requires intervention from OpenAI.
Please highlight any phrases that describe the libraries and tools used to implement the idea
While MEGAnno+ is designed to support any open-source LLM or commercial LLM APIs, in this work, we only demonstrate OpenAI Completion models for clarity and brevity.
Please highlight any phrases that describe the libraries and tools used to implement the idea
Data Model MEGAnno+ extends MEGAnno's data model where data Record, Label, Annotation, Metadata (e.g., text embedding or confidence score) are persisted in the service database along with the task Schema.
Please highlight any phrases that describe the libraries and tools used to implement the idea
To implement our system as an extension to MEGAnno (Zhang et al., 2022), an in-notebook exploratory annotation tool.
Please highlight any phrases that describe the libraries and tools used to implement the idea
MEGAnno+ is designed to provide a convenient and robust workflow for users to utilize LLMs in text annotation. To use our tool, users operate within their Jupyter notebook (Kluyver et al., 2016) with the MEGAnno+ client installed.
Please highlight any phrases that describe the libraries and tools used to implement the idea
LLM annotators and human annotators should not be treated the same, and annotation tools should carefully design their data models and workflows to accommodate both types of annotators.
Please highlight any phrases that describe the theory behind this work
we go beyond using LLMs to assist annotation for human annotators or to replace human annotators. Rather, MEGAnno+ advocates for a collaboration between humans and LLMs with our dedicated system design and annotation-verification workflows.
Please highlight any phrases that describe the theory behind this work
Despite these advancements, it is essential to acknowledge that LLMs have limitations, necessitating human intervention in the data annotation process. One challenge is that the performance of LLMs varies extensively across different tasks, datasets, and labels. LLMs often struggle to comprehend subtle nuances or contexts in natural language, making involvement of humans with social and cultural understanding or domain expertise crucial.
Please highlight any phrases that describe the theory behind this work
Large language models (LLMs) can label data faster and cheaper than humans for various NLP tasks. Despite their prowess, LLMs may fall short in understanding of complex, sociocultural, or domain-specific context, potentially leading to incorrect annotations. Therefore, we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels.
Please highlight any phrases that describe the theory behind this work
Valarie A Zeithaml and William L Fuerst. 1983. Age differences in response to grocery store price information. Journal of consumer affairs 17, 2 (1983), 402–420.
any bibliographic entry relating to older adults
Mary E Sesto, Curtis B Irwin, Karen B Chen, Amrish O Chourasia, and Douglas A Wiegmann. 2012. Effect of touch screen button size and spacing on touch characteristics of users with and without disabilities. Human Factors: The Journal of the Human Factors and Ergonomics Society 54, 3 (2012), 425–436.
any bibliographic entry relating to older adults
Zhao Xia Jin, Tom Plocher, and Liana Kiff. 2007. Touch screen user interfaces for older adults: button size and spacing. In Universal acess in human computer interaction. coping with diversity. Springer, 933–941.
any bibliographic entry relating to older adults
Robin Brewer, Raymundo Cornejo Garcia, Tedmond Schwaba, Darren Gergle, and Anne Marie Piper. 2016. Exploring Traditional Phones as an E-Mail Interface for Older Adults. ACM Transactions on Accessible Computing (TACCESS) 8, 2 (2016), 6.
any bibliographic entry relating to older adults
Janan Al-Awar Smither and Curt C Braun. 1994. Technology and older adults: Factors affecting the adoption of automatic teller machines. The Journal of General Psychology 121, 4 (1994), 381–389.
any bibliographic entry relating to older adults
Wiktoria Wilkowska and Martina Ziefle. 2009. Which factors form older adults' acceptance of mobile information and communication technologies? Springer.
any bibliographic entry relating to older adults
Kerryellen G Vroman, Sajay Arthanat, and Catherine Lysack. 2015. "Who over 65 is online?" Older adults' dispositions toward information communication technology. Computers in Human Behavior 43 (2015), 156–166.
any bibliographic entry relating to older adults
Phil Turner, Susan Turner, and Guy Van de Walle. 2007. How older people account for their experiences with interactive technology. Behaviour & Information Technology 26, 4 (2007), 287–296.
any bibliographic entry relating to older adults
Hironobu Takagi, Akihiro Kosugi, Tatsuya Ishihara, and Kentarou Fukuda. 2014. Remote IT education for senior citizens. In Proceedings of the 11th Web for All Conference. ACM, 41.
any bibliographic entry relating to older adults
Karen Renaud and Judy Van Biljon. 2008. Predicting technology acceptance and adoption by the elderly: a qualitative study. In Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries: riding the wave of technology. ACM, 210–219.
any bibliographic entry relating to older adults
Chee Wei Phang, Juliana Sutanto, Atreyi Kankanhalli, Yan Li, Bernard CY Tan, and Hock-Hai Teo. 2006. Senior citizens' acceptance of information systems: A study in the context of e-government services. Engineering Management, IEEE Transactions on 53, 4 (2006), 555–569.
any bibliographic entry relating to older adults
Bjorn Niehaves and Ralf Plattfaut. 2014. Internet adoption by the elderly: employing IS technology acceptance theories for understanding the age-related digital divide. European Journal of Information Systems 23, 6 (2014), 708–726.
any bibliographic entry relating to older adults
HH Nap and HP de Greef. 2010. Self-efficacy & stress in senior computer interaction. In Proceedings of the 28th Annual European Conference on Cognitive Ergonomics. ACM, 227–230.
any bibliographic entry relating to older adults
Michael G Morris and Viswanath Venkatesh. 2000. Age differences in technology adoption decisions: Implications for a changing work force. Personnel psychology 53, 2 (2000), 375–403.
any bibliographic entry relating to older adults
Tracy L Mitzner, Wendy A Rogers, Arthur D Fisk, Walter R Boot, Neil Charness, Sara J Czaja, and Joseph Sharit. 2014. Predicting older adults' perceptions about a computer system designed for seniors. Universal Access in the Information Society (2014), 1–10.
any bibliographic entry relating to older adults
Chaiwoo Lee and Joseph F Coughlin. 2014. PERSPECTIVE: Older Adults' Adoption of Technology: An Integrated Approach to Identifying Determinants and Barriers. Journal of Product Innovation Management (2014).
any bibliographic entry relating to older adults
Sri Kurniawan. 2008. Older people and mobile phones: A multi-method investigation. International Journal of Human-Computer Studies 66, 12 (2008), 889–901.
any bibliographic entry relating to older adults
Vicki L Hanson. 2011. Technology skill and age: what will be the same 20 years from now? Universal Access in the Information Society 10, 4 (2011), 443–452.
any bibliographic entry relating to older adults
Mary C Gilly and Valarie A Zeithaml. 1985. The elderly consumer and adoption of technologies. Journal of consumer research (1985), 353–357.
any bibliographic entry relating to older adults
Nancy M Gell, Dori E Rosenberg, George Demiris, Andrea Z LaCroix, and Kushang V Patel. 2013. Patterns of technology use among older adults with and without disabilities. The Gerontologist (2013), gnt166.
any bibliographic entry relating to older adults
Helene Gelderblom, Tobie van Dyk, and Judy van Biljon. 2010. Mobile phone adoption: Do existing models adequately capture the actual usage of older adults?. In Proceedings of the 2010 annual research conference of the south african institute of computer scientists and information technologists. ACM, 67–74.
any bibliographic entry relating to older adults
Arthur D Fisk, Wendy A Rogers, Neil Charness, Sara J Czaja, and Joseph Sharit. 2009. Designing for older adults: Principles and creative human factors approaches. CRC press.
any bibliographic entry relating to older adults
Anna Dickinson, Alan F Newell, Michael J Smith, and Robin L Hill. 2005. Introducing the Internet to the over-60s: Developing an email system for older novice computer users. Interacting with Computers 17, 6 (2005), 621–642.
any bibliographic entry relating to older adults
Mario Conci, Fabio Pianesi, and Massimo Zancanaro. 2009. Useful, social and enjoyable: Mobile phone adoption by older people. In Human-Computer Interaction–INTERACT 2009. Springer, 63–76.
any bibliographic entry relating to older adults
Miha Cimperman, Maja Makovec Brenčič, Peter Trkman, and Mateja de Leonni Stanonik. 2013. Older adults' perceptions of home telehealth services. Telemedicine and e-Health 19, 10 (2013), 786–790.
any bibliographic entry relating to older adults
Luca Buccoliero and Elena Bellio. 2014. The adoption of silver e-Health technologies: first hints on technology acceptance factors for elderly in Italy. In Proceedings of the 8th International Conference on Theory and Practice of Electronic Governance. ACM, 304–307.
any bibliographic entry relating to older adults
Today's generations of older adults have not grown up with information and communications technologies that are widely available these days. Thus, there is "a natural confound of age and experience, since today's older adults are exposed to these technologies at a different point in their lives than today's young adults." [17]
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
Older people are less likely to have peers with sufficient technology experiences compared to their younger counterparts.
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
Incorporating these human factors and practical design suggestions for older adults, Fisk et al. proposed key recommendations for designing mobile devices for this age group [12].
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
Studies have shown that typical interaction components and techniques of a smartphone often prevent older adults from smooth and instant interactions with it. For example, the small size and the low contrast of buttons on a mobile display has a significant negative influence on interaction performance such as speed and accuracy [18], and decline in motor skills is correlated with time required to complete a task [30].
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
Lee and Coughlin reviewed studies of older adults' technology acceptance and identified ten factors that are critical facilitators or determinants of older adults' acceptance of technology: value, usability, affordability, accessibility, technical support, social support, emotion, independence, experience, and confidence [20].
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
most works point out that an individual's personal context [38] and the social context [36] in which the technology is introduced are the primary factors influencing the perception of, experience with, and evaluation of new technological developments among older adults [19].
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
One exception is the senior technology acceptance model (STAM) [28]. Using TAM, UTAUT, and several other works as theoretical underpinning, Renaud and Biljon proposed a model to explain older adults' mobile phone adoption.
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
Several studies have attempted to determine older adults' acceptance of technologies in general, and healthcare-related systems in particular, using the UTAUT framework. (e.g., email [14], a telehealth service [7]).
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults
As a result, older adults and their adoption of new technologies have been a topic of active research since the advent of consumer technologies (e.g., automated teller machine [32], scanner-equipped grocery stores [41], electronic funds transfer [15]).
citations about older adults; for example, the citation numbers being highlighted when the citation is in regards to older adults