About 64,200 results
Open links in new tab
  1. Claude Opus 4.1 - anthropic.com

    Aug 5, 2025 · Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements …

  2. Create strong empirical evaluations - Claude Docs

    Learn how to craft prompts that maximize your eval scores. More code examples of human-, code-, and LLM-graded evals.

  3. Claude Opus 4.1 vs Claude Opus 4 – How good is this upgrade?

    Aug 6, 2025 · GitHub’s Evaluation: Claude Opus 4.1 demonstrates notable performance gains in multi-file code refactoring, surpassing Opus 4 in tasks that require nuanced understanding and …

  4. Introducing Claude Opus 4.5 \ Anthropic

    Nov 24, 2025 · Claude Opus 4.5 sets a new standard for Excel automation and financial modeling. Accuracy on our internal evals improved 20%, efficiency rose 15%, and complex tasks that once …

  5. Claude Opus 4.5 \ Anthropic

    Aug 5, 2025 · Claude Opus 4.5 sets a new standard for Excel automation and financial modeling. Accuracy on our internal evals improved 20%, efficiency rose 15%, and complex tasks that once …

  6. An update on our preliminary evaluations of Claude 3.5 Sonnet ...

    Jan 31, 2025 · METR conducted preliminary evaluations of Anthropic’s upgraded Claude 3.5 Sonnet (October 2024 release), and a pre-deployment checkpoint of OpenAI’s o1. In both cases, …

  7. Claude Opus 4 and Claude Sonnet 4 Evaluation Results

    May 25, 2025 · A detailed analysis of Claude Opus 4 and Claude Sonnet 4 performance on coding and writing tasks, with comparisons to GPT-4.1, DeepSeek V3, and other leading models.