HackerNews中文版

我尝试用基准测试的方式评估一个AI代理。它以我意想不到的方式失败了。大多数失败并非源于模型质量，而是系统层面的问题。一些小型测试套件中的例子： - 工具调用中的损坏URL → 得分降至22 - 代理在云环境中调用localhost → 卡在46 - 真实的CVE被标记为幻觉 → 是评估问题，而非模型问题 - Reddit阻止请求 → 外部依赖失败 - 生产环境中缺少API密钥 → 静默失败每次运行都暴露了一个真实的问题，但并不是我最初试图衡量的那种。令我惊讶的是，评估代理不仅仅是评分输出。它关乎验证整个系统：工具、环境、数据访问以及代理如何与这些部分交互。换句话说，大多数失败模式看起来更像是软件错误，而不是大语言模型的失误。这让我想到，代理的评估循环应该更像软件测试……

查看原文

I tried to evaluate an AI agent using a benchmark-style approach.It failed in ways I didn’t expect.Instead of model quality issues, most failures came from system-level problems. A few examples from a small test suite:- Broken URLs in tool calls → score dropped to 22- Agent calling localhost in a cloud environment → got stuck at 46- Real CVEs flagged as hallucinations → evaluation issue, not model issue- Reddit blocking requests → external dependency failure- Missing API key in production → silent failureEach run surfaced a real bug, but not the kind I was originally trying to measure.What surprised me is that evaluating agents isn’t just about scoring outputs. It’s about validating the entire system: tools, environment, data access, and how the agent interacts with all of it.In other words, most of the failure modes looked more like software bugs than LLM mistakes.This made me think that evaluation loops for agents should look more like software testing than benchmarking: - repeatable test suites - clear pass/fail criteria - regression detection - root cause analysisOtherwise it’s very easy to misattribute failures to the model when they’re actually coming from somewhere else.I ended up building a small tool to structure this process, but the bigger takeaway for me is how messy real-world agent evaluation actually is compared to standard benchmarks.Curious how others are approaching this, especially in production settings.

我在生产环境中尝试评估AI代理时出了什么问题