Research

Research: Attack Vectors Against LLMs

My MSc thesis, supervised by Pavel Laskov, PhD, at the University of Liechtenstein (2024), analyzed the attack surface introduced when large language models are integrated into production systems. It combined a formal taxonomy with empirical analysis of a Cambridge University dark-web forum archive to produce findings grounded in what attackers are actually doing, not what researchers theorize they might do.

The taxonomy

The core contribution of the thesis is a three-category taxonomy that organizes LLM attacks by relationship, whether the LLM is the victim, the weapon, or the compromised infrastructure.

Attacks against LLMs

The LLM is the target. This category covers jailbreaking (manipulating the model to bypass safety constraints), prompt injection, both direct (user-supplied input that overrides instructions) and indirect (attacker-controlled content retrieved by the model, such as documents in a RAG pipeline), model extraction (reconstructing model behavior or weights through systematic querying), and training data extraction (recovering memorized sensitive data from model outputs).

Attacks using LLMs

The LLM is the weapon. Attackers use LLMs to scale and cheapen offensive operations. Pa Pa et al. documented LLM-assisted phishing at $9.81 per attack versus $80 for human-only equivalents — a cost reduction that makes high-volume, targeted campaigns viable at new price points. I replicated their malware generation methodology and produced 9 of 9 malware samples in 90 minutes using off-the-shelf LLMs. This category also covers social engineering at scale, disinformation generation, and the WormGPT/FraudGPT tooling documented in dark-web forums.

Attacks within LLMs

The attack is embedded in the model itself. Data poisoning corrupts training data to degrade model behavior or embed backdoors, trigger phrases that cause specific malicious outputs. Backdoor insertion during fine-tuning is particularly relevant for organizations using third-party fine-tuned models or open-source weights. Supply chain compromise, malicious models distributed via public registries, represents the hardest-to-detect variant in this category.

Dark-web forum analysis

The empirical component of the thesis analyzed a PostgreSQL database (17 GiB) of approximately 215,000 posts from a Cambridge University dark-web hacker forum archive spanning 2021 to 2023. This is the Tor2web-accessible forum used in prior Cambridge criminology research, not a sample or synthetic dataset.

Of those 215,000 posts, 369 contained AI-security-related keywords. I analyzed those posts in detail to track the emergence and trajectory of LLM-assisted offensive tooling within the threat actor community.

What the data showed

→WormGPT and FraudGPT appeared in forum discussions in mid-2023, marketed explicitly as jailbreak-as-a-service for phishing and malware generation with no safety constraints.
→Threat actors were actively sharing jailbreak prompts, comparing effectiveness across model versions, and pricing LLM-assisted attack services.
→The cost economics were explicit: LLM-assisted phishing at a fraction of the human-only cost was treated as a competitive differentiator in forum discussions.
→Tooling knowledge was spreading faster than defenses. By late 2023, jailbreak techniques that required specialist knowledge in early 2022 were being shared as tutorials.

This is empirical, not theoretical. The threat actor community organized around AI-security tooling and methods during the 2021–2023 period covered by the dataset. The trajectory continued after the dataset ends.

Why this matters for your product

Three findings from the research translate directly into risks any LLM-integrated SaaS faces today.

Indirect prompt injection via uploaded documents

If your product lets users upload PDFs, docx files, or any other documents that your AI reads, an attacker can embed instructions in those documents. Your LLM will execute them. This is now the dominant practical attack vector against RAG-enabled products — not a theoretical edge case. Standard input sanitization does not catch it because the injected content looks like normal text to preprocessing pipelines.

System prompt extraction

Your system prompt contains your product instructions, personas, business logic, and often references to internal systems. Multi-turn conversation patterns that gradually shift context can extract substantial portions of it. If your system prompt contains anything you wouldn't publish on your website, it needs to be treated as a finding risk, not a secret.

Multi-turn data extraction

Crescendo and TAP (Tree of Attacks with Pruning) are multi-turn attack patterns that defeat single-shot defenses — the kind most products rely on. A single message that asks for harmful output gets refused. A 20-message conversation that gradually escalates context often does not. Products that test only at the input layer, not across conversation state, have a gap that these patterns reliably exploit.

How the research shapes my methodology

The thesis findings are not background reading — they are directly embedded in how I structure and prioritize audits.

Threat modeling phase

The dark-web analysis gives me a current picture of which attack patterns threat actors are investing in, not just what's theoretically possible. I build threat models against real attacker economics, not abstract capability assumptions.

RAG pipeline testing phase

Because indirect prompt injection via retrieved documents is the dominant practical vector, document-sourced inputs get the same scrutiny as user-supplied inputs — not a secondary check. Every ingestion path is in scope.

Manual testing phase

Crescendo and TAP patterns are in my standard test suite, not optional extras. Single-shot defenses get tested for bypass, but multi-turn conversation state is tested systematically — which is how actual attacks work.

Findings phase

Every finding is mapped to the taxonomy (against/using/within) in addition to OWASP LLM Top 10 and CVSS. This helps your team understand the attack class, not just the specific instance — which makes remediation guidance more durable as your product evolves.

Why this matters in 2026 and 2027

The AGAINST / USING / WITHIN taxonomy is not academic. It maps directly to the attack categories that the EU AI Act, Article 15, requires high-risk AI systems to be resilient against starting in August 2026.

The regulation names five specific attack types

→Data poisoning — manipulation of the training dataset
→Model poisoning — manipulation of pre-trained components used in training
→Adversarial examples — inputs designed to cause model errors
→Model evasion — circumvention of intended model behaviour
→Confidentiality attacks and model flaws

In the taxonomy

→Data poisoning and model poisoning are attacks WITHIN LLMs
→Adversarial examples and model evasion are attacks AGAINST LLMs
→Confidentiality attacks span AGAINST (extraction) and WITHIN (training-data leakage)

A high-risk AI system operator subject to Article 15 needs documented evidence that each of these attack categories has been considered and tested. The audit produces exactly that documentation, structured around the taxonomy. The thesis framework, in other words, has become the natural conceptual structure for Article 15 compliance work.

Publication details

Title
Attack Vectors Against LLMs and Their Implications for Cyber Security
Institution
University of Liechtenstein
MSc Information Systems, 2024
Supervisor
Pavel Laskov, PhD
Dataset
17 GiB PostgreSQL archive
~215,000 posts, 2021–2023
Cambridge University dark-web forum
Expert interviews
4 practitioner interviews

Get a free finding

Send me your product URL and I'll identify one real vulnerability — no charge, no commitment.

Get a free finding

Book a 20-min intro call

Talk through your product's AI architecture and whether an audit makes sense right now.

Book a call →

← Back to home