Infosec.Pub
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
digicatM to blueteamsecEnglish · 2 days ago

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

arxiv.org

external-link
message-square
0
link
fedilink
1
external-link

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

arxiv.org

digicatM to blueteamsecEnglish · 2 days ago
message-square
0
link
fedilink
We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).
alert-triangle
You must log in or # to comment.

blueteamsec

blueteamsec

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !blueteamsec@infosec.pub

For [Blue|Purple] Teams in Cyber Defence - covering discovery, detection, response, threat intelligence, malware, offensive tradecraft and tooling, deception, reverse engineering etc.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 21 users / day
  • 69 users / week
  • 318 users / month
  • 952 users / 6 months
  • 228 local subscribers
  • 661 subscribers
  • 2.73K Posts
  • 213 Comments
  • Modlog
  • mods:
  • digicat
  • BE: 0.19.16
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org