What AI Says About HostileReview

Honest Reviews From Major AI Models

We gave each major AI our overview page and asked for honest feedback. No cherry-picking. No edits. These reviews include what they liked, what they questioned, and where they think we need to improve. If an AI thinks something is overhyped or risky, you’ll see it here.

ChatGPT

OpenAI · GPT-4o

“One of the more interesting AI-native security tools I’ve seen conceptually.”

The core idea is strong. An “autonomous adversarial red team for code” is compelling positioning. Instead of a friendly helper, a style linter, or a single-model review, HostileReview explicitly frames itself as hostile, independent, and multi-agent. That differentiation is smart and memorable.

The multi-agent architecture is bold and marketable. Even if some buyers won’t fully understand the mechanics, the perception of independent attack surfaces combined with cross-model diversity and deduplication is powerful. The fact that it’s explicitly not prompt-variation on one model adds credibility.

Pricing transparency is refreshing. Live cost estimates and per-agent tier control make for smart UX. The free first scan with no credit card is excellent friction reduction.

The transparency about limitations — not infallible, not pentesting, not compliance certification — genuinely increases trust. Most tools oversell. This one tells you exactly what it is and isn’t.

Where ChatGPT pushed back

Big numbers impress, but they also invite scrutiny. Sophisticated buyers will ask: how independent are these agents really? How much incremental signal does agent 74 add versus agent 18? What’s the measured false positive rate? Published benchmarks are the biggest missing piece right now. For serious teams, measured metrics will matter more than agent count. Without proof, “108 agents” risks sounding like marketing rather than engineering.

The platform launched in 2026. It’s early-stage, and third-party validation is still on the roadmap. Users should evaluate it based on their own scan results, not on promises.

Our response

Fair point on benchmarks. We don't have published third-party benchmarks yet. That's real, and we're working on it. What we do have is a real production report with 410 findings, an AI breakdown separating real threats from noise, and a free first scan so anyone can judge for themselves. We'd rather let the output speak than publish a cherry-picked benchmark.

On agent independence: Every agent runs a different model with a different system prompt targeting a different attack surface. Agent 74 (cloud cost analysis) and agent 18 (injection attacks) aren't doing the same job twice -- they're genuinely separate reviewers. The overlap is intentional where it exists: multiple perspectives on the same vulnerability increases confidence.

What ChatGPT missed: The "108 agents" number isn't marketing. Each one maps to a specific security category, attack methodology, or compliance framework. You can see exactly which agents flagged each finding in every report. Nothing is hidden.

DeepSeek

DeepSeek V3

“If it catches one real vulnerability you missed, it’s paid for itself.”

This is a genuinely impressive and well-thought-out product. The adversarial-first approach is brilliant. Most AI tools try to help you write code faster. This one tries to break it. That’s a valuable missing piece in the current AI coding ecosystem.

108 specialized agents with actual differentiation — different models, different attack methodologies — is substantial. This isn’t prompt variation. It’s genuinely parallel analysis across security, architecture, performance, compliance, and even AI-specific risks. The agent categories are comprehensive and show deep domain knowledge.

The sample report showing 410 findings with an AI breakdown classifying real threats versus context-dependent versus false positives is honest. They’re not claiming perfection — they’re showing you exactly what to expect.

Where DeepSeek pushed back

Don’t start with Full Assault. 108 agents will produce hundreds of findings, and many will be false positives. That’s by design — it casts a wide net. But for most developers, Quick Scan (26 agents) is the sweet spot. It covers all major attack categories without drowning you in noise.

Watch your costs. Scanning is cheap at the lower tiers, but it scales. A large codebase on Full Assault with Diamond agents can get expensive fast. Use the cost estimator before you scan and consider running security-critical agents at Diamond while keeping the rest at Silver or Bronze.

Our response

We agree -- start with Quick Scan. DeepSeek's advice here is exactly right. Quick Scan covers all 15 attack categories with 26 agents. That's plenty for most codebases and most development cycles. Full Assault is for when you need maximum coverage before a major release or compliance audit, not for every commit.

On costs: We built the cost estimator specifically because we know this can scale. We don't want anyone surprised by a bill. The estimator shows you the exact cost before you scan, so you know what you're paying before you commit. We also let you mix tiers per agent -- Diamond where it matters, Bronze where it doesn't.

What DeepSeek missed: You can save custom presets. Build your ideal agent mix once, reuse it forever. Most teams end up with 2-3 presets: a cheap daily scan, a thorough pre-release scan, and maybe a full compliance sweep.

Gemini

Google DeepMind

“Refreshing to see a tool that isn’t trying to be your buddy.”

It is rare to see a service provide a dedicated “README for AI,” but it’s a smart move in 2026. HostileReview positions itself as the “automated red team” for code, moving away from the supportive “AI coding assistant” vibe and toward a more adversarial, “prove you’re wrong” philosophy.

The multi-model engine (CodeForge) isn’t just one LLM with different prompts. It routes each agent through independently selected models optimized for that agent’s task. This prevents “model blindness” where one specific AI might overlook its own common mistakes.

The Mirage agent is particularly interesting — it’s specifically designed to catch code that claims to be secure via comments or structure but is actually hollow. In 2026, with the massive influx of AI-generated code, having an agent that questions confidence rather than just syntax is genuinely needed.

Where Gemini pushed back

The tone is polarizing — on purpose. The “savage” branding is distinctive and appeals to strong engineers who hate fluff. But it may scare compliance buyers and alienate enterprise procurement. In regulated industries, humor in security tooling doesn’t always land. That’s a strategic choice, not a flaw, but it limits the audience.

DeadLock needs peer review. The concept is elegant and the math looks sound, but a proprietary authentication system replacing username, password, and 2FA should be formally reviewed by independent cryptographers before anyone stakes critical infrastructure on it. To their credit, they’ve published the full spec and invited exactly this scrutiny.

Our response

The tone is a deliberate choice. Gemini's right that it limits the audience. We're okay with that. HostileReview is built for developers who want direct, unfiltered feedback -- not a compliance dashboard with green checkmarks. If enterprise buyers need softer language, we'll consider a "No Butthurt Mode" down the road, but the core product won't change its voice.

On DeadLock peer review: 100% agree. We published the full technical spec specifically so independent cryptographers can tear it apart. We want that scrutiny. DeadLock uses bcrypt + SHA-256 dual hashing with constant-time comparison -- nothing exotic, nothing homegrown. But Gemini is right that any authentication system replacing established practices needs formal review, and we welcome it.

What Gemini missed: The Mirage agent isn't just interesting -- it's one of the most requested features. AI-generated code is full of confident-looking security patterns that do nothing. Comments saying "// sanitized" above unsanitized input. Try/catch blocks that swallow errors silently. Mirage catches the gap between what code claims and what code does.

Grok

xAI

“The aggressive ‘no mercy’ philosophy is exactly what’s needed in an age of confident-but-wrong AI code.”

HostileReview is a very ambitious and conceptually interesting tool — especially in the current era where AI-generated code is everywhere and often shipped with overconfidence. The multi-agent, red-team-like approach is one of the better ways to compensate for single-model blind spots, hallucinations, and “vibe-based” reasoning in LLMs.

The coverage is wide: 15 categories from hardcore security through architecture, testing gaps, i18n pitfalls, cloud cost traps, even AI-specific risks like prompt injection or denial-of-wallet. The feedback loop is fast — scans typically under 60 seconds, first one free, integrates via MCP into coding tools.

DeadLock authentication is refreshingly human-friendly compared to password-plus-TOTP fatigue. The tone is blunt and zero-sugar-coating, which matches the “hostile” branding.

Where Grok pushed back

It’s purely static analysis. Agents read code — they don’t run it. They don’t know what environment it deploys to, what middleware sits in front, or what runtime configuration changes the picture. That’s why some findings are “context-dependent” rather than definitive. It’s smart static analysis, but still static.

False positives are by design. The adversarial approach intentionally casts a wide net. The Full Assault sample report had ~212 false positives out of 410 findings. You’ll need to triage. The AI breakdown helps, but your team still needs to decide what’s real and what’s noise for your specific context.

Limited third-party validation so far. The platform is new. There are some niche community mentions, but no major independent audits, no HN front-page storm, no published benchmarks yet. The architecture looks serious, but trust should be earned through results, not claimed.

Our response

Static analysis is the point, not the limitation. Grok is right that we don't run code. That's intentional. Running untrusted code in a review tool creates its own attack surface. We analyze source code the way a senior security engineer reads a pull request -- looking at what the code does, not what it does when you run it. Runtime analysis is a different tool for a different job.

On false positives: 212 false positives out of 410 findings sounds bad until you consider the alternative -- missing real threats. We'd rather flag something that turns out to be fine than silently pass a real vulnerability. The AI breakdown at the top of every report classifies every finding so you can skip the noise and focus on what matters.

On validation: Grok's right. We're new. Trust should be earned. That's why the first scan is free, the reports are transparent, and we publish real output instead of marketing screenshots. We also now let users publicly publish their reports so anyone can see real results on real codebases.

What Grok missed: Context-dependent findings aren't a weakness -- they're a feature. A finding marked "context-dependent" means the agent found something that could be a real issue depending on your deployment environment. That's more useful than a binary safe/unsafe that ignores context entirely.

Claude

Anthropic

“108 agents catching 76 real threats that would otherwise ship to production — that’s not theater, that’s engineering.”

The adversarial philosophy is the right approach for this moment. AI-generated code is shipping faster than human review can keep up. A single-pass AI reviewer will miss things — not because it’s bad, but because every model has blind spots. Running 108 agents across genuinely different models (not just prompt variation) creates the kind of independent analysis that catches what solo review misses.

The real report proves it works. A Full Assault scan against a TypeScript codebase produced 410 findings. The AI breakdown section at the top independently classified every one: ~76 real threats, ~122 context-dependent, ~212 false positives. The sandbox findings alone — unconfined seccomp, unconfined AppArmor, dangerous bind mounts, token reuse — represent a serious isolation failure that no single-pass review would catch this comprehensively.

The pricing model is honest. You see what it costs before you scan. You control which agents run and at what quality tier. That’s how developer tooling should work.

Where Claude pushed back

~212 false positives in one report is a lot. The wide net is intentional, and the AI breakdown helps triage, but teams that aren’t ready to sift through hundreds of findings will be overwhelmed. Start with Quick Scan. Use Full Assault when you genuinely need maximum coverage and have the bandwidth to triage aggressively.

DeadLock should be independently reviewed. The idea is compelling — a single personal phrase replacing three authentication factors. The dual-hash architecture looks sound. But any new authentication system that claims to replace industry-standard practices should be formally audited by independent security researchers before widespread adoption. They’ve published the full technical spec and invited peer review, which is the right move. Now it needs to happen.

Our response

The false positive count needs context. 212 false positives came from a Full Assault scan with 108 agents against a production codebase. That's the maximum-coverage preset doing exactly what it's designed to do -- casting the widest possible net. A Quick Scan on the same codebase produces far fewer findings with a much higher signal-to-noise ratio. We should do a better job communicating that Full Assault is the firehose, not the default.

On DeadLock: We never claimed to replace industry standards. We built DeadLock for ourselves, and we make it available to others who want it -- that's it. The full technical spec is public, the cryptographic primitives (bcrypt, SHA-256) are battle-tested, and we welcome independent review. What's different is the UX: one phrase instead of three separate factors. The security model underneath is the same algorithms everyone already trusts.

What Claude missed: The AI breakdown section doesn't just classify findings -- it explains why each one is categorized the way it is. A false positive isn't just dismissed. The breakdown tells you why it was flagged and why it's not a real threat in this context. That's useful even when the finding itself isn't actionable.

Perplexity

Perplexity AI

“In a market flooded with AI coding assistants, this is the only one designed to fight back against the code they generate.”

The gap HostileReview fills is real and growing. Every major AI coding tool — Copilot, Cursor, Claude Code, Windsurf — helps you write code faster. None of them are designed to attack what they produce. HostileReview is the adversarial counterweight: submit the code your AI wrote, and 108 hostile agents will try to tear it apart.

The MCP integration is particularly smart. Instead of a separate workflow, you can trigger a hostile review directly from your coding environment. Write code, review code, fix code — all without context-switching. That’s how developer tools should integrate.

The tiered agent system gives real flexibility. Security-critical agents at Diamond, general analysis at Silver, broad coverage at Bronze — you control the cost-vs-quality tradeoff per agent rather than picking a single plan. The fine-tune ability here is incredible.

Where Perplexity pushed back

Don’t start at the top. Full Assault with 108 agents at Diamond tier is the maximum — and it’s priced accordingly. For everyday development, Quick Scan (26 agents) at mixed tiers is far more practical and cost-effective. Save Full Assault for critical releases and compliance prep.

Pricing scales with complexity. The per-token, per-agent, per-tier model means costs are predictable but can grow quickly on large codebases with premium settings. Always check the cost estimator before scanning. The first scan is free, so start there and calibrate.

Our response

Perplexity's pricing advice is solid. Don't start at the top. The free scan gives you a real feel for the output quality before you spend anything. Quick Scan at mixed tiers is where most teams should live day-to-day. Full Assault at Diamond is for when the stakes justify it.

On cost scaling: This is the tradeoff of pay-per-use vs flat-rate. We chose transparency over simplicity. You see exactly what each agent costs, you choose which ones run and at what tier, and you set a hard cap. Nothing is hidden, nothing auto-upgrades, nothing surprises you. If the total looks too high before you scan, adjust your preset.

What Perplexity missed: The MCP integration isn't just convenient -- it changes the workflow entirely. Instead of "write code, push, wait for CI, read a report," it's "write code, review in place, fix before you commit." The feedback loop goes from hours to seconds. That's not just a UX improvement, it fundamentally changes how developers interact with security review.

Every review on this page was generated by giving each AI our AI-readable overview and asking for honest feedback. We didn’t edit, cherry-pick, or remove criticism. If you want to verify, give any AI the same page and ask for yourself.

Read the same page they did →