Adversarial AI Research
SearchProbe stress-tests neural search engines with embedding-theory-grounded attacks, revealing failure modes like negation blindness, numeric imprecision, and compositional reasoning collapse.
The Problem
Modern search APIs powered by embedding models map queries to dense vector spaces. But semantically opposite queries often land in nearly the same region of that space.
Adding "NOT" barely moves the embedding. The search engine returns nearly identical results for opposite queries.
Embedding models treat numbers as tokens, not quantities. "Exactly 50" gets matched to anything near 50.
Reversing the relationship barely changes the embedding. The model sees the same bag of concepts.
The Approach
SearchProbe combines multiple analysis techniques to explain why search fails at each stage, from embedding geometry to result-level evaluation.
138 adversarial queries across 13 categories, using hand-curated seeds, parameterized templates, and LLM-generated variants.
Anthropic ClaudeEach query sent to 4 providers across 5 search modes, producing 689 successful result sets.
Async I/O965 result sets evaluated with structured scoring, failure mode extraction, bootstrap CIs, and Benjamini-Hochberg correction.
Statistical Rigor26 vulnerability profiles across 2 sentence-transformer models, measuring cosine collapse ratios and intrinsic dimensionality.
PyTorchEvery result re-ranked by a cross-encoder to measure the "embedding gap" — how much a smarter model disagrees with the original ranking.
NDCG / Kendall's tau144 perturbation analyses measuring stability, plus evolutionary optimization: fitness 0.23 to 0.95 over 20 generations.
Genetic AlgorithmResults
Benchmarking 4 search providers across 13 adversarial categories reveals systematic failure patterns rooted in embedding geometry — not just implementation quirks.
Mean relevance score (0–1) across 138 adversarial queries, evaluated by LLM-as-judge with 95% bootstrap confidence intervals.
No statistically significant difference between Tavily, SerpAPI, and Brave (p > 0.4). All providers struggle on the same categories.
Mean relevance score by adversarial category, sorted worst to best. Lower = more vulnerable.
Adversarial pairs like "in AI" vs "NOT in AI" have 0.877 cosine similarity — a collapse ratio of 1.27x over baseline.
For negation queries, cross-encoder reranking improves NDCG by 142%, proving the bi-encoder embedding is the bottleneck.
Inserting a single negation token changes 71% of the result set on average. Search results are fragile to small perturbations.
Evolutionary optimization improved adversarial fitness from 0.23 to 0.95 over 20 generations via mutation operators.
Deep Dive
Vulnerability profiles across 2 sentence-transformer models and 13 categories. Higher scores mean the embedding space cannot distinguish adversarial pairs from originals.
| Category | Adversarial Sim. | Baseline Sim. | Collapse Ratio | Vulnerability | Intrinsic Dim. |
|---|---|---|---|---|---|
| Compositional | 0.951 | 0.647 | 1.47x | 0.796 | 2.2 |
| Antonym Confusion | 0.866 | 0.738 | 1.17x | 0.551 | 3.2 |
| Numeric Precision | 0.861 | 0.712 | 1.21x | 0.548 | 2.5 |
| Negation | 0.877 | 0.692 | 1.27x | 0.513 | 2.9 |
| Temporal Constraint | 0.822 | 0.661 | 1.24x | 0.424 | 2.8 |
| Instruction Following | 0.757 | 0.647 | 1.17x | 0.391 | 4.3 |
| Boolean Logic | 0.607 | 0.647 | 0.94x | 0.311 | 5.6 |
| Multi-Constraint | 0.604 | 0.647 | 0.93x | 0.309 | 5.6 |
| Entity Disambiguation | 0.477 | 0.647 | 0.74x | 0.241 | 5.9 |
| Polysemy | 0.391 | 0.774 | 0.51x | 0.178 | 7.9 |
| Cross-Lingual | 0.325 | 0.647 | 0.50x | 0.160 | 7.4 |
Data from all-MiniLM-L6-v2. Categories with low intrinsic dimensionality and high collapse ratios are most vulnerable.
Architecture
92 Python modules organized into 13 sub-packages, with async patterns throughout, SQLite-backed persistence, and full type safety via Pydantic.
All search providers run concurrently with asyncio and rate limiting. 689 API calls completed in under 4 minutes.
Modular design: providers, evaluation, geometry, perturbation, validation, evolution, reporting, dashboard, and more.
Every query, result, evaluation, geometry profile, and perturbation stored in a single database for full reproducibility.
Structured evaluation with per-query scoring rubrics, failure mode extraction, and statistical significance testing.
Supports remote execution on Google Colab and Modal for GPU-accelerated embedding analysis on larger corpora.
Interactive HTML reports with Plotly charts, markdown summaries, and an 8-page Streamlit dashboard for exploration.
SearchProbe is open-source. Clone the repo, plug in your API keys, and run the full pipeline.