Amgix Benchmarks

This report presents the first benchmark results for Amgix across several BEIR⁵ datasets and one custom dataset, measuring both relevance and query latency across different retrieval workloads.

It also provides an early look at Weighted Multilevel Token Representation (WMTR), Amgix's own keyword retrieval method. The goal is not only to compare vector configurations, but also to show where WMTR works well on its own, where it strengthens hybrid retrieval, and how Amgix behaves as corpus size grows from a few thousand documents to a few million.

BEIR Benchmark Context

The BEIR (Benchmarking IR) benchmark suite⁵ provides standardized datasets for evaluating information retrieval systems. For context, we will include BM25 baseline scores from the BEIR original paper with each BEIR dataset tested.

Note on methodology: All results in this report reflect single-stage retrieval without re-ranking. Many top-performing models on the BEIR leaderboard use two-stage retrieval (initial retrieval + cross-encoder re-ranking), which significantly increases latency. Our focus is on production-ready search that balances relevance with predictable response times. Query latency was measured client-side with 4 parallel workers, and includes the full Amgix request path: API call, query embedding/tokenization inside Amgix, distributed search, score fusion, and response. These are not raw database timings or "bring your own vectors" search timings; they are end-to-end timings through the full system. Because model loading can dominate first-query latency, the reported P50 and P95 numbers should be read as warm, steady-state query latency rather than cold-start latency.

Weighted Multilevel Token Representation (WMTR)

WMTR is Amgix's weighted lexical retrieval method. Instead of relying on a single tokenizer, it represents text through multiple lexical views at once: a surface-form view that stays closer to the original tokens, a language-aware normalized view built with Unicode word boundaries, stopword filtering, and stemming, and a character-level view that captures short local patterns inside the text. Those signals are then weighted together into a single sparse representation.

Because it's not a model-based algorithm, it's fast and lightweight, making it a great default for many use cases where you simply want high-quality lexical search. This makes it a good fit both for ordinary natural-language text and for noisier datasets (ERP, E-commerce, science, etc.) where short identifiers, mixed-format strings, and token structure still carry meaning. In practice, this is why Amgix’s keyword vector type maps to WMTR by default.

Test Setup

Hardware

Because we will be measuring latency, it's important to mention the hardware we used for the tests. All tests are performed on a single bare-metal machine with the following specifications:

CPU: AMD Ryzen™ 9 5900X × 12 cores (24 threads)
RAM: 64GB
GPU: NVIDIA GeForce RTX 5060 Ti (16GB)
Storage: SSD
OS: Ubuntu 24.04.4 LTS

Amgix Deployment

Amgix v1.0.0-beta3.3 was deployed in Docker containers with a single Amgix API container and four (4) Amgix Encoder containers. We chose to use a four-Encoder setup, frankly to speed up the tests by running more queries in parallel, but we also felt that this is a more realistic setup for a production-like distributed deployment environment, where the system is scaled up a bit to handle more concurrent queries. In these tests, clients send plain-text queries to the API; Amgix then distributes the query-processing work across the Encoder containers rather than expecting precomputed query vectors from the caller.

Amgix Encoder nodes are using GPU variants of the images, so dense embedding generation is GPU-accelerated. Since all instances are on the same physical machine, they all share the same single GPU.

RabbitMQ

RabbitMQ v4.1.3 is deployed in a single container with all default settings.

Backend Database

Qdrant v1.17.0 is deployed in a single container with all default settings.

Vector Configurations

Each test collection is set up to index with 5 vectors: Full Text (FT) (name, content), WMTR (name, content), and dense on the content field only. The exceptions are the Quora and PC Parts datasets. Quora has only content and PC Parts only name fields. So those two datasets are indexed with 3 vectors.

For dense model vectors we simply chose a common small model: sentence-transformers/all-MiniLM-L6-v2.

[
    {"name": "wmtr", "type": "wmtr", "index_fields": ["name","content"]},
    {"name": "ft", "type": "full_text", "index_fields": ["name","content"]},
    {"name": "dense", "type": "dense_model", "model": "sentence-transformers/all-MiniLM-L6-v2", "index_fields": ["content"]}
]

Weight Tuning

We've used Nelder-Mead (simplex) search algorithm to tune weights for best nDCG@10¹ score. Tuning was performed in two phases: first we ran 30 evaluations with initial weights set to 1.0 and initial edge of 0.5, then we ran 15 evaluations with initial edge of 0.1 using the best weights from the first phase.

The tuning process also served to warm up the Amgix cluster, ensuring all models were loaded and ready before running the actual benchmark queries.

In the results below, (T) marks configurations whose vector weights were tuned with this process. Configurations without (T) use the default untuned weights.

Amgix Search Query Flow

Before we get to the results, let's take a look at the search query flow through Amgix. Every query goes through the following process:

Query text and vector weights are sent to Amgix API.
The API dispatches query-processing work to Amgix Encoders over RabbitMQ.
Amgix Encoders generate two/three (depending on the test) query vectors inside the system: Full Text (FT), WMTR, and dense.
Encoders use the generated vectors to search the database collection, matching against three (or two, depending on the dataset) indexed vectors: keyword (FT or WMTR) on name and content, and dense on content.
Database search results are then ranked and fused using vector weights.
Top 100 documents are returned to the caller.

SciFact Dataset

The SciFact dataset contains 5,183 documents and 300 queries focused on scientific claims and evidence verification. This makes it an ideal benchmark for evaluating search performance on domain-specific scientific content.

Results

	FT	FT (T)	WMTR	WMTR (T)	WMTR + FT (T)	Dense	WMTR + Dense	WMTR + Dense (T)
nDCG@1¹	0.4767	0.5300	0.4800	0.5233	0.5433	0.4967	0.5667	0.5967
nDCG@10¹	0.6399	0.6782	0.6370	0.6679	0.6817	0.6402	0.6979	0.7197
Recall@10²	0.8046	0.8305	0.7947	0.8105	0.8294	0.7867	0.8339	0.8489
P50 (ms)³	22	23	23	23	23	23	35	34
P95 (ms)⁴	29	29	30	31	31	28	44	45
Weight (KW name)	1.00	0.66	1.00	0.57	0.52		1.00	0.71
Weight (KW content)	1.00	2.02	1.00	1.67	1.65		1.00	1.21
Weight (D name)						1.00	1.00	1.45

Notes

(T) marks configurations whose vector weights were tuned.
KW: Keyword vector weight
D: Dense vector weight

Analysis

This dataset responds well to both tuning and hybrid retrieval. The untuned single-vector runs cluster fairly close together, while the tuned lexical variants and especially the WMTR + Dense runs lift ranking quality more clearly. The strongest result comes from WMTR + Dense (T), which suggests that scientific claim retrieval benefits from combining lexical precision with semantic matching rather than relying on either signal alone. Latency stays low for the keyword-heavy configurations and remains fast even for the hybrid runs, despite the extra query processing and fusion work.

Key Takeaway: On SciFact, the clearest gains come from tuning and from WMTR + Dense, with the tuned hybrid providing the strongest overall balance of relevance and still-fast end-to-end latency.

TREC-COVID Dataset

The TREC-COVID dataset contains 171,332 documents and 50 queries related to COVID-19 research articles. Despite having fewer queries, this dataset's large corpus size and specialized medical terminology make it an excellent test of search performance at scale.

Results

	FT	FT (T)	WMTR	WMTR (T)	WMTR + FT (T)	Dense	WMTR + Dense	WMTR + Dense (T)
nDCG@1¹	0.7200	0.7400	0.7700	0.7100	0.7400	0.6600	0.8100	0.7800
nDCG@10¹	0.6307	0.6476	0.6225	0.6336	0.6536	0.5871	0.6670	0.6756
Recall@10²	0.0179	0.0187	0.0167	0.0177	0.0186	0.0159	0.0181	0.0186
P50 (ms)³	30	23	26	25	24	25	37	37
P95 (ms)⁴	44	31	37	34	31	58	45	46
Weight (KW name)	1.00	0.75	1.00	0.89	0.78		1.00	0.77
Weight (KW content)	1.00	2.00	1.00	1.53	1.59		1.00	1.40
Weight (D name)						1.00	1.00	1.04

Notes

(T) marks configurations whose vector weights were tuned.
KW: Keyword vector weight
D: Dense vector weight

Analysis

TREC-COVID shows a narrower win for hybrid retrieval. The clearest improvements come from WMTR + Dense and WMTR + Dense (T), while the other configurations stay close to, but generally below, the BM25 reference. That pattern suggests this corpus benefits most from combining strong lexical matching with dense semantics rather than leaning too heavily on either one by itself. Recall@10 remains low across the board, which is consistent with the difficulty and sparse relevance judgments of this benchmark. Even so, end-to-end latency remains fast overall, with the hybrid runs still well below the slowest dense-only result.

Key Takeaway: On TREC-COVID, the strongest results come from WMTR + Dense, which makes hybrid retrieval the most convincing default for this corpus.

Quora Dataset

The BEIR Quora dataset contains 523K documents and 10,000 queries, and pairs duplicate questions from Quora. Documents are question text only (content), so the collection uses the reduced three-vector layout described in Vector Configurations (no name field).

Results

	FT	WMTR	Dense	WMTR + Dense	WMTR + Dense (T)
nDCG@1¹	0.6848	0.7030	0.8038	0.8024	0.8133
nDCG@10¹	0.7807	0.7941	0.8751	0.8779	0.8848
Recall@10²	0.8823	0.8922	0.9495	0.9538	0.9577
P50 (ms)³	17	17	25	31	31
P95 (ms)⁴	21	22	31	38	38
Weight (KW content)	1.00	1.00		1.00	0.72
Weight (D content)			1.00	1.00	1.44

Notes

(T) marks configurations whose vector weights were tuned.
KW: Keyword vector weight
D: Dense vector weight

Analysis

Quora strongly favors semantic retrieval. Dense and WMTR + Dense produce the strongest ranking quality, while WMTR on its own also remains competitive and slightly ahead of the BM25 reference. That fits duplicate-question retrieval well: semantic similarity matters more here than exact lexical overlap. Latency is still lowest for the keyword-only configurations, but the extra cost of dense and hybrid retrieval is relatively modest compared with the gain in relevance.

Key Takeaway: On Quora, dense retrieval is the main driver of quality, and WMTR + Dense is the strongest overall configuration.

NQ Dataset

The BEIR Natural Questions (NQ) dataset contains 2.6M documents and 3,452 queries, and uses Wikipedia passages and factoid-style queries. Documents use both name and content fields, so runs use the standard five-vector layout in Vector Configurations.

Results

	FT	FT (T)	WMTR	WMTR (T)	WMTR + FT (T)	Dense	WMTR + Dense	WMTR + Dense (T)
nDCG@1¹	0.1301	0.1419	0.1327	0.1501	0.1422	0.2277	0.2274	0.2402
nDCG@10¹	0.2483	0.2607	0.2579	0.2741	0.2616	0.3841	0.3850	0.4075
Recall@10²	0.4050	0.4096	0.4222	0.4266	0.4111	0.5619	0.5685	0.5944
P50 (ms)³	36	36	31	31	35	24	52	52
P95 (ms)⁴	46	46	39	38	42	30	62	62
Weight (KW name)	1.00	0.88	1.00	0.84	1.12		1.00	0.67
Weight (KW content)	1.00	1.60	1.00	1.69	1.76		1.00	0.98
Weight (D name)						1.00	1.00	1.70

Notes

(T) marks configurations whose vector weights were tuned.
KW: Keyword vector weight
D: Dense vector weight

Analysis

NQ shows the clearest dependence on semantic matching. The lexical-only runs remain below the BM25 reference, while Dense and especially WMTR + Dense move clearly ahead, with the tuned hybrid strongest overall. For passage retrieval over a very large corpus, the quality gains from semantic signals are substantial enough to justify the heavier tail latency. Even so, the reported end-to-end latencies remain low for a single-stage setup at this scale.

Key Takeaway: On NQ, WMTR + Dense is the most effective configuration, and the results make a strong case for hybrid retrieval on large passage-search workloads.

PC Parts Dataset

To simulate a real-world use case, we prepared a dataset using a list of more than 6,600 video cards from this PC Part Dataset repository.

The Corpus

As you can see below, the data is a list of video card names that are largely brand/model names and cryptic identifiers.

{"_id":"2123","name":"Asus TUF GAMING","text":"","metadata":{}}
{"_id":"2124","name":"Asus EAH6450 SILENT/DI/512MD3(LP)","text":"","metadata":{}}
{"_id":"2125","name":"Gigabyte GV-N660OC-2GD","text":"","metadata":{}}
{"_id":"2126","name":"Gigabyte WINDFORCE","text":"","metadata":{}}

This particular dataset has no descriptions at all, which is atypical for the real-world datasets. The data lacks linguistic richness that most tokenizers (model-based or not) are optimized for. However, this is representative of a lot of the data that exists in many domains (ERP, E-commerce, etc.), where you may have thousands of records with part numbers like Pin 12LP'-x03/5-XL and short or non-existent descriptions. Users searching these datasets often know the exact model/part number they want to find, so they search for unique bits of identifiers, special characters and all, to quickly find what they came for. Full-text tokenizers (for example) struggle with this, they typically strip out all special characters, drop single digits, etc. FT tokenizer may represent the above part number as pin 12lp x03 xl, which drops some or all differentiators, meaningful for this context, dramatically reducing recall and precision of searches.

The Queries

We've created a short list of queries (only 15) that may be representative of the type of searches users would typically perform on this dataset.

geinword phoenix-s
512MD3(LP)
Asus EAH4670
1GD3/V2
900352
XFX RS XXX
MSI VENTUS 3X 8GD6X
Biostar VA7906XM00
HIS Mini IceQ X2
ATI FirePro W5000
AERO ITX 4G
16G-P
galaxy 10th
giga d6 2.0
odesey

There are some misspellings, special characters, pure numbers, etc. Most of the queries are designed to return a single result, but a few have 2-4 matches. In retrospect, we should have made it even harder for the tokenizers by using more special characters and partial strings, but this is a good start.

The PC Parts dataset targets product search; documents use the name field only, so the collection follows the reduced three-vector layout in Vector Configurations (no content field).

Results

	FT	WMTR	Dense	WMTR + Dense	WMTR + Dense (T)
nDCG@1¹	0.8750	0.9375	0.5000	0.8125	0.9375
nDCG@10¹	0.8216	0.9325	0.5494	0.8609	0.9325
Recall@10²	0.8125	0.9375	0.6406	0.9375	0.9375
P50 (ms)³	12	19	24	29	18
P95 (ms)⁴	18	21	40	35	22
Weight (KW name)	1.00	1.00		1.00	1.75
Weight (D name)			1.00	1.00	0.00

Notes

(T) marks configurations whose vector weights were tuned.
KW: Keyword vector weight
D: Dense vector weight

Analysis

PC Parts points in a different direction from the BEIR text corpora. Here the strongest result comes from WMTR, and the tuned WMTR + Dense configuration reaches the same score only because tuning reduces the dense weight to 0.00. That makes the outcome easy to interpret: on short, identifier-heavy names, lexical structure is the primary signal, while dense embeddings add little value in this setup. The dataset is small and intentionally specialized, so it should be read as a targeted stress test rather than a broad retrieval benchmark.

Key Takeaway: On PC Parts, WMTR is the right retrieval strategy for this style of data, and tuning reinforces that by effectively removing dense from the best-performing configuration.

Conclusion

Relevance Performance Summary

Taken together, these benchmarks show a consistent picture of how Amgix behaves across very different retrieval workloads. On the language-rich datasets in this report, the strongest nDCG@10 scores usually come from hybrid retrieval, especially WMTR + Dense (T). That pattern appears in SciFact, TREC-COVID, Quora, and NQ, showing that Amgix can combine lexical and dense signals effectively in a single-stage retrieval setup.

The results also show that WMTR is a strong retrieval method in its own right, not just as a helper for dense search. On Quora, plain WMTR beats BM25 while plain FT does not. On PC Parts, WMTR is the strongest configuration overall, and the tuned "WMTR + Dense" run reaches the same score only because tuning drives the dense weight to 0.00, effectively turning dense off. That is an important result: on identifier-heavy corpora, lexical structure can matter far more than semantic embeddings, and WMTR captures that well.

Latency Performance Summary

The latency story here is less about tiny per-configuration differences and more about how the system behaves across the tested workloads, including much larger corpora. This report moves from SciFact with 5,183 documents to NQ with roughly 2.6 million documents, yet the measured end-to-end P50 and P95 latencies remain low across those benchmark runs. Given that Amgix is generating query representations inside the system, searching multiple vectors, fusing the results, and returning responses through a distributed API/Encoder architecture, that is a notably strong latency result.

Within each dataset, the expected trade-off still appears: keyword-only configurations such as FT and WMTR usually have the lightest tail latency, while dense and hybrid search tend to increase P95. In return, however, hybrids often deliver materially better ranking quality on the language-heavy datasets while still staying fast in end-to-end terms. The useful question is therefore not simply "which method is fastest?" but rather "which method gives the best relevance while keeping end-to-end latency low for this dataset?"

Weight Tuning Impact

Weight tuning has a large impact on final quality. In multiple datasets, tuned variants outperform their untuned counterparts by a visible margin, showing that the relative contribution of lexical and dense signals should not be treated as fixed. Even when the same underlying vectors are available, changing the weights materially changes ranking behavior.

The most revealing case is PC Parts, where tuning sets the dense weight to zero. That shows tuning is not just polishing an already-correct hybrid recipe; it can decide that one signal should be removed entirely when it hurts ranking quality. In other words, tuning is not merely a final optimization step. It is how the system adapts retrieval strategy to the structure of the corpus, and in this report it helps show both where WMTR + Dense is best and where WMTR alone should dominate.

Practical Recommendations

For general-purpose search over natural-language content, start with WMTR + Dense, and tune it if quality matters. That combination is the most consistently strong across the BEIR datasets in this report, and it is the clearest default recommendation to come out of these benchmarks. If you need a lower-latency fallback, WMTR or FT can still provide good results, but they usually leave relevance on the table for more semantic tasks.

For identifier-heavy product, catalog, or ERP-style datasets, start with WMTR and treat dense retrieval as optional rather than mandatory. The PC Parts benchmark suggests that when queries depend on fragments, part numbers, abbreviations, and special-character patterns, lexical precision matters more than semantic similarity. More broadly, these benchmarks should be read as a practical guide to how Amgix behaves in single-stage retrieval, and how WMTR works both on its own and as the lexical component of strong hybrid search.

nDCG@k (normalized Discounted Cumulative Gain at k): A metric that measures the quality of ranking results. It considers both the relevance of retrieved documents and their position in the result list. A perfect ranking would have an nDCG of 1.0. Higher values indicate better search effectiveness. ↩↩↩↩↩↩↩↩↩↩↩
Recall@k: The proportion of relevant documents that are retrieved among the top 'k' results. A Recall@10 of 1.0 means all relevant documents were found within the top 10 results. Higher values indicate more complete retrieval. ↩↩↩↩↩
P50 (50th percentile latency): The median query latency, meaning 50% of queries complete faster than this time and 50% take longer. This represents the typical user experience. ↩↩↩↩↩
P95 (95th percentile latency): The latency threshold under which 95% of queries complete. This metric captures the experience of the slowest 5% of queries, helping identify worst-case performance scenarios. ↩↩↩↩↩
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thakur et al., 2021. https://arxiv.org/abs/2104.08663 ↩↩