18 Matching Annotations
  1. Jan 2026
    1. why asynchronous agents deserve more attention than they currently receive, provides practical guidelines for working with them effectively, and shares real-world experience using multiple agents to refactor a production codebase.

      3 things in this article: - why async agents deserve more attention - practical guidelines for effective deployment - real world examples

    1. While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public, it’s an open challenge for AI labs that believe they can do better — something Foody fully expects in the months to come.

      expectation that models will get trained against the tests they currently fail.

    2. “The way we do our jobs isn’t with one individual giving us all the context in one place. In real life, you’re operating across Slack and Google Drive and all these other tools.” For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.

      I understand this para but the phrasing is off. slack and google drive is not 'multi-domain' but tools. Seems like two arguments joined up: multitool / multidomain, meaning ai agents can't switch. (In practice I see people build small agents for each facet and then chain / join them)

    3. The new research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. The result is a new benchmark called APEX-Agents — and so far, every AI lab is getting a failing grade. Faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the model came back with a wrong answer or no answer at all.

      In consulting, investment banking, law, ai agents had 18-24% score or worse (and in real life circumstances you don't know which is which so you need to check all output)

  2. Dec 2025

    Tags

    Annotators

    URL

    1. this type of thing sounds like what I thought wrt annotation of [[AI agents als virtueel team]]. The example prompts of questions make me think of [[Filosofische stromingen als gereedschap 20030212105451]] die al per stroming een vraagstramien bevat. Making persona's of diff thinking styles, lines of questioning. Idem for reviews, or starting a project etc.

    1. Het zijn markdown bestanden met een persoonlijkheid, frameworks, en output templates. Die heb ik niet zelf geschreven - ik heb Claude gevraagd om ze te maken. “Maak een Product Owner agent die goed is in prioriteren en impact/effort analyses kan doen.” Claude schrijft dan het volledige bestand, inclusief werkwijze en voorbeelden.Als ik vervolgens zeg “vraag dit aan Tessa”, laadt Claude dat bestand en wordt Tessa.

      Seems like these agent .md files contain description of a role that is then included in a prompt.

    1. In mijn werkmap heb ik een verzameling “agents” - tekstbestanden die Claude vertellen hoe hij zich moet gedragen. Tessa is er één van. Als ik haar “laad”, denkt Claude vanuit het perspectief van een product owner.

      Author has .md files that describe separate 'agents' she involves in her coding work, for each of the roles in a dev team. Would something like that work for K-work? #openvraag E.g. for project management roles, or for facets you're less fond of yourself?

  3. Nov 2025
    1. AI checking AI inherits vulnerabilities, Hays warned. "Transparency gaps, prompt injection vulnerabilities and a decision-making chain becomes harder to trace with each layer you add." Her research at Salesforce revealed that 55% of IT security leaders lack confidence that they have appropriate guardrails to deploy agents safely.

      abstracting away responsibilities is a dead-end. Over half of IT security think now no way to deploy agentic AI safely.

  4. Jun 2025
    1. https://web.archive.org/web/20250630134724/https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/

      'agent washing' Agentic AI underperforms, getting at most 30% tasks right (Gemini 2.5-Pro) but mostly under 10%.

      Article contains examples of what I think we should agentic hallucination, where not finding a solution, it takes steps to alter reality to fit the solution (e.g. renaming a user so it was the right user to send a message to, as the right user could not be found). Meredith Witthaker is mentioned, but from her statement I saw a key element is missing: most of that access will be in clear text, as models can't do encryption. Meaning not just the model, but the fact of access existing is a major vulnerability.