Builder topic

AI benchmark radar

Benchmark and evaluation updates that keep marketing claims separate from independent results.

Items

published only

Avg RDR

current set

Sources

linked domains

Updated

Jun 10, 2026

indexable

Open Source AIGitHub trend signalMay 27, 2026

LangChain: The Agent Engineering Platform for LLM Applications

LangChain is an open-source Python framework designed for building and deploying LLM-powered applications and agents. It provides tools for chaining interoperable components, integrating with various data sources and models, and supporting rapid prototyping and production-ready features like monitoring and debugging.

1 min readConfidence 95%github.com

RDR88

AI CodingGitHub trend signalMay 24, 2026

Origin: A Local-First Rust Daemon for AI Agent Memory and Context Management

Origin is a local-first Rust daemon designed to manage AI agent memory and context. It features Git-versioned memories, distilled wiki pages, and supports sessions for various AI clients like Claude Code, Cursor, and Codex, aiming to provide persistent context across AI workflows.

1 min readConfidence 85%github.com

RDR88

AgentsGitHub trend signalMay 24, 2026

CUA: Open-Source Infrastructure for Desktop-Controlling AI Agents

CUA is an open-source project providing infrastructure for developing, training, and evaluating AI agents capable of controlling full desktop environments across macOS, Linux, and Windows. It includes sandboxes, SDKs, and benchmarks to facilitate the creation of computer-use agents.

1 min readConfidence 90%github.com

RDR88

AI CodingGitHub trend signalJun 9, 2026

Boxlite: A Compute Substrate for AI Agents

Boxlite is a new compute substrate designed for AI agents. It aims to be lightweight for local development and scalable for cloud deployment, offering a flexible environment for building and running AI agents.

3 min readConfidence 90%github.com

RDR87

AI CodingGitHub trend signalJun 7, 2026

Wide-Moat's Open-Source MCP Server for LLM-Powered Computing

Wide-Moat has released an open-source MCP server designed to provide Large Language Models (LLMs) with their own managed computing environments. This self-hosted solution offers Docker workspaces with integrated browser, terminal, and code execution capabilities, enabling LLMs to perform complex tasks autonomously.

3 min readConfidence 90%github.com

RDR87

AI ToolsGitHub trend signalJun 6, 2026

Cordum: Open Agent Control Plane for Governing Autonomous AI Agents

Cordum introduces an open-source agent control plane designed to govern autonomous AI agents. It provides features for pre-execution policy enforcement, approval gates, and audit trails, aiming to enhance the safety and manageability of AI agent deployments.

3 min readConfidence 90%github.com

RDR87

AI ToolsGitHub trend signalJun 5, 2026

Notion MCP Server: Integrating AI Agents with Notion

A new open-source project, the Notion MCP Server, enables AI agents to interact with Notion data. It supports various AI models and allows access to Notion pages, databases, and files.

2 min readConfidence 95%github.com

RDR87

AgentsGitHub trend signalMay 26, 2026

Google's ADK-Python: An Open-Source Toolkit for AI Agent Development

Google has released ADK-Python, an open-source, code-first Python toolkit designed for building, evaluating, and deploying AI agents. The toolkit, currently at version 2.1.0, emphasizes flexibility and control in agent development and includes a graph-based execution engine for workflows and a structured Task API for agent-to-agent delegation.

1 min readConfidence 90%github.com

RDR87

AI ToolsGitHub trend signalMay 26, 2026

Hermes Katana: A Defense-in-Depth Security Toolkit for LLM Agents

Hermes Katana is a Python-based security toolkit designed for LLM agents, offering defense-in-depth capabilities including taint tracking, a proxy secret guard, a policy engine, and red-team benchmarking. It aims to protect AI agents from various attacks like prompt injection and unauthorized command execution.

1 min readConfidence 85%github.com

RDR87

AgentsGitHub trend signalMay 25, 2026

wshobson/agents: A Multi-Harness Agentic Plugin Marketplace for AI Code Assistants

The wshobson/agents GitHub repository presents a multi-harness agentic plugin marketplace designed for various AI code assistants, including Claude Code, Codex CLI, Cursor, OpenCode, and Gemini CLI. It offers a collection of plugins, agents, skills, and commands from a single Markdown source, generating native artifacts for each supported harness.

1 min readConfidence 85%github.com

RDR86

AI ToolsGitHub trend signalJun 6, 2026

AI-Powered CesiumJS 3D Globe Control with Model Context Protocol

A GitHub project, cesium-mcp, offers AI-powered control for CesiumJS 3D globes. It utilizes the Model Context Protocol (MCP) to enable natural language commands for managing camera, entities, layers, animation, and spatial analysis within 3D GIS environments.

3 min readConfidence 95%github.com

RDR85

AI CodingGitHub trend signalMay 24, 2026

Swarm Orchestrator v10.0.0: AI-Generated PR Audit and Merge Gate

Swarm Orchestrator v10.0.0 introduces `swarm audit`, a new subcommand and GitHub Action designed to audit pull-request diffs for ten categories of AI-coding-agent 'cheat patterns'. It can block merges if blocking findings are detected and generates hash-chained audit ledgers and AI-BOM artifacts.

1 min readConfidence 85%github.com

RDR85

Research Papersresearch signalJun 2, 2026

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge

Researchers have identified and analyzed "Perceptual Judgment Bias" in multimodal large language models (MLLMs) when used as evaluators. This bias causes MLLMs to prioritize plausible textual narratives over perceptually accurate visual evidence. To address this, they introduced the Perceptually Perturbed Judgment Dataset and a unified training framework combining GRPO-based rewards with a batch-ranking objective, aiming to improve perceptual fidelity and alignment with human evaluation.

1 min readConfidence 88%arxiv.org

RDR83

Research Papersresearch signalMay 27, 2026

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything is a new framework for vision-language grounding and detection that uses Parallel Box Decoding (PBD) to improve both speed and accuracy. Unlike traditional methods that decode 2D boxes token by token, PBD decodes geometric elements as atomic units in a single step, enhancing parallelism and preserving geometric coherence. The framework is supported by LocateAnything-Data, a large dataset with over 138 million training samples.

1 min readConfidence 85%arxiv.org

RDR83

announcementcommunity discussionJun 9, 2026

Anthropic Announces Claude Fable 5

Anthropic announced Claude Fable 5 alongside Claude Mythos 5 with expanded capabilities for coding, knowledge-intensive tasks, and practical developer workflows. The announcement is also reflected in the OpenRouter model listing, reinforcing that the model is now visible across broader distribution surfaces. This launch indicates a continuation of rapid model iteration focused on production-grade AI application support.

1 min readConfidence 86%anthropic.com

RDR82

Research Papersresearch signalJun 3, 2026

NewtPhys: A New Benchmark for Newtonian Physics Understanding in Foundation Models

Researchers have introduced NewtPhys, a 4D physically annotated dataset designed to evaluate foundation models' understanding of low-level Newtonian physics. The dataset, built from multiview images of real-world scenes with physics-grounded simulations, provides detailed annotations including 3D forces and per-pixel quantities. Initial evaluations using NewtPhys on 56 Vision-Language Models (VLMs) and 10 Vision Foundation Models (VFMs) revealed limitations in their physics reasoning capabilities.

1 min readConfidence 90%arxiv.org

RDR82

Benchmarksresearch signalMay 27, 2026

SpatialBench: A New Benchmark for Spatial Foundation Models

Researchers have introduced SpatialBench, a new benchmark designed to holistically assess the generalization capabilities of spatial foundation models across diverse tasks, viewpoints, scene domains, and input densities. The benchmark evaluates 41 models across 19 datasets and 546 scenes, revealing that current models are not yet "all-round players" and highlighting the importance of domain alignment and data quality over simple dataset scaling.

1 min readConfidence 88%arxiv.org

RDR81

Benchmarksresearch signalJun 1, 2026

SOCO: A New Benchmark for Semantic Object Correspondence in Vision Foundation Models

Researchers have introduced SOCO, a new benchmark designed to evaluate Semantic Object Correspondence (SC) in vision foundation models and large vision-language models (LVLMs). SOCO provides a taxonomy of correspondence types and over 1 million keypoint annotations across 100 categories, including language descriptions for part-level understanding.

1 min readConfidence 90%arxiv.org

RDR80

Model Releasesofficial or source announcementJun 9, 2026

Cohere Introduces North Mini Code: A New Model Tailored for Developers

Cohere has announced North Mini Code, its first model specifically designed for developers. This new model aims to provide enhanced capabilities for coding tasks and development workflows.

2 min readConfidence 90%huggingface.co

RDR78

Roboticsresearch signalJun 9, 2026

MemoryVLA++: Enhancing Vision-Language-Action Models with Temporal Memory and Imagination for Robotics

Researchers have introduced MemoryVLA++, a framework designed to improve Vision-Language-Action (VLA) models for robotic manipulation by incorporating temporal modeling. This approach equips VLA models with mechanisms for memory and imagination, inspired by cognitive science, to better handle long-horizon and temporally dependent tasks.

3 min readConfidence 90%arxiv.org

RDR78

Otherofficial or source announcementJun 10, 2026

OpenAI Proposes Industrial Policy for the AI Era

OpenAI has published a paper outlining ambitious, people-first industrial policy ideas for the age of advanced artificial intelligence. The proposals focus on expanding opportunities, sharing prosperity, and building resilient institutions.

3 min readConfidence 100%openai.com

RDR77

Research Papersresearch signalJun 5, 2026

PAR3D: A Unified 3D-MLLM for Part-Aware Scene Understanding

Researchers have introduced PAR3D, a unified 3D Multimodal Large Language Model (3D-MLLM) framework designed to enhance 3D scene understanding by focusing on fine-grained part structures in addition to objects. This approach aims to improve embodied interaction with 3D environments.

1 min readConfidence 85%arxiv.org

RDR77

Research Papersresearch signalJun 9, 2026

Causally Evaluating the Learnability of Formal Language Tasks

Researchers propose a new methodology for evaluating the learnability of tasks in language models, moving beyond standard correlational analysis. By using formal languages derived from probabilistic finite automata, they introduce the 'binning semiring' to causally control data frequency and measure learnability. This approach aims to address the inherent flaws in correlational evaluations, which can lead to incorrect conclusions.

3 min readConfidence 90%arxiv.org

RDR75

Research Papersresearch signalJun 4, 2026

DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger

Researchers have introduced DistIL, a new approach to reinforcement learning that leverages rich feedback beyond simple binary rewards. DistIL uses a distributional variant of the DAgger imitation learning algorithm with a forward cross-entropy objective, which allows for more effective credit assignment and guarantees monotonic policy improvement. This method has shown empirical improvements over traditional RL from verifiable rewards (RLVR) and self-distillation baselines in tasks like scientific reasoning, coding, and complex mathematical problem-solving.

1 min readConfidence 88%arxiv.org

RDR75