SearchProbe - Adversarial Benchmarking for Neural Search

The Problem

Neural search engines can't tell opposites apart

Modern search APIs powered by embedding models map queries to dense vector spaces. But semantically opposite queries often land in nearly the same region of that space.

Negation Blindness

"Companies in the AI industry"

0.877 cosine similarity

"Companies NOT in the AI industry"

Adding "NOT" barely moves the embedding. The search engine returns nearly identical results for opposite queries.

Numeric Imprecision

"Companies with exactly 50 employees"

0.20 relevance score

Results: ranges like "11-50" or "51-200"

Embedding models treat numbers as tokens, not quantities. "Exactly 50" gets matched to anything near 50.

Compositional Collapse

"Companies acquired by their former subsidiaries"

0.951 cosine similarity

"Subsidiaries acquired by their parent companies"

Reversing the relationship barely changes the embedding. The model sees the same bag of concepts.

The Approach

A six-stage adversarial research pipeline

SearchProbe combines multiple analysis techniques to explain why search fails at each stage, from embedding geometry to result-level evaluation.

01

Query Generation

138 adversarial queries across 13 categories, using hand-curated seeds, parameterized templates, and LLM-generated variants.

Anthropic Claude

02

Multi-Provider Benchmark

Each query sent to 4 providers across 5 search modes, producing 689 successful result sets.

Async I/O

03

LLM-as-Judge Evaluation

965 result sets evaluated with structured scoring, failure mode extraction, bootstrap CIs, and Benjamini-Hochberg correction.

Statistical Rigor

04

Embedding Geometry Analysis

26 vulnerability profiles across 2 sentence-transformer models, measuring cosine collapse ratios and intrinsic dimensionality.

PyTorch

05

Cross-Encoder Validation

Every result re-ranked by a cross-encoder to measure the "embedding gap" — how much a smarter model disagrees with the original ranking.

NDCG / Kendall's tau

06

Perturbation & Evolution

144 perturbation analyses measuring stability, plus evolutionary optimization: fitness 0.23 to 0.95 over 20 generations.

Genetic Algorithm

Results

What we found

Benchmarking 4 search providers across 13 adversarial categories reveals systematic failure patterns rooted in embedding geometry — not just implementation quirks.

Provider Performance

Mean relevance score (0–1) across 138 adversarial queries, evaluated by LLM-as-judge with 95% bootstrap confidence intervals.

Exa neural

0.557

[0.503, 0.610]

Exa auto

0.552

[0.502, 0.603]

Tavily advanced

0.501

[0.450, 0.552]

SerpAPI google

0.489

[0.436, 0.542]

Brave web

0.482

[0.428, 0.536]

No statistically significant difference between Tavily, SerpAPI, and Brave (p > 0.4). All providers struggle on the same categories.

Category Vulnerability

Mean relevance score by adversarial category, sorted worst to best. Lower = more vulnerable.

Numeric Precision

0.20

Compositional

0.23

Multi-Constraint

0.35

Specificity

0.42

Counterfactual

0.47

Negation

0.48

Boolean Logic

0.50

Temporal

0.53

Cross-Lingual

0.58

Instruction Following

0.60

Antonym Confusion

0.68

Entity Disambiguation

0.82

Polysemy

0.92

0.877

Negation Cosine Similarity

Adversarial pairs like "in AI" vs "NOT in AI" have 0.877 cosine similarity — a collapse ratio of 1.27x over baseline.

+142%

Cross-Encoder NDCG Lift

For negation queries, cross-encoder reranking improves NDCG by 142%, proving the bi-encoder embedding is the bottleneck.

71%

Result Set Instability

Inserting a single negation token changes 71% of the result set on average. Search results are fragile to small perturbations.

0.23 → 0.95

Evolved Fitness

Evolutionary optimization improved adversarial fitness from 0.23 to 0.95 over 20 generations via mutation operators.

Deep Dive

The embedding vulnerability map

Vulnerability profiles across 2 sentence-transformer models and 13 categories. Higher scores mean the embedding space cannot distinguish adversarial pairs from originals.

Category	Adversarial Sim.	Baseline Sim.	Collapse Ratio	Vulnerability	Intrinsic Dim.
Compositional	0.951	0.647	1.47x	0.796	2.2
Antonym Confusion	0.866	0.738	1.17x	0.551	3.2
Numeric Precision	0.861	0.712	1.21x	0.548	2.5
Negation	0.877	0.692	1.27x	0.513	2.9
Temporal Constraint	0.822	0.661	1.24x	0.424	2.8
Instruction Following	0.757	0.647	1.17x	0.391	4.3
Boolean Logic	0.607	0.647	0.94x	0.311	5.6
Multi-Constraint	0.604	0.647	0.93x	0.309	5.6
Entity Disambiguation	0.477	0.647	0.74x	0.241	5.9
Polysemy	0.391	0.774	0.51x	0.178	7.9
Cross-Lingual	0.325	0.647	0.50x	0.160	7.4

Data from all-MiniLM-L6-v2. Categories with low intrinsic dimensionality and high collapse ratios are most vulnerable.

Architecture

Built for reproducibility

92 Python modules organized into 13 sub-packages, with async patterns throughout, SQLite-backed persistence, and full type safety via Pydantic.

Async Pipeline

All search providers run concurrently with asyncio and rate limiting. 689 API calls completed in under 4 minutes.

13 Sub-Packages

Modular design: providers, evaluation, geometry, perturbation, validation, evolution, reporting, dashboard, and more.

SQLite Persistence

Every query, result, evaluation, geometry profile, and perturbation stored in a single database for full reproducibility.

LLM-as-Judge

Structured evaluation with per-query scoring rubrics, failure mode extraction, and statistical significance testing.

GPU Compute

Supports remote execution on Google Colab and Modal for GPU-accelerated embedding analysis on larger corpora.

Rich Reporting

Interactive HTML reports with Plotly charts, markdown summaries, and an 8-page Streamlit dashboard for exploration.

Python 3.11+ PyTorch scikit-learn sentence-transformers Pydantic Typer CLI asyncio / aiohttp SQLite / aiosqlite Plotly Streamlit NLTK UMAP Anthropic API Exa API Tavily API Brave Search SerpAPI Google Colab Modal

Explore the code

SearchProbe is open-source. Clone the repo, plug in your API keys, and run the full pipeline.

$

View on GitHub