WMTR: Lexical Retrieval Across Diverse Data Types
Standard tokenizers were built for natural language. They split on whitespace, normalize punctuation, apply stemming, and produce clean word tokens. For natural language text this works well. For data that looks like this:
Asus EAH6450 SILENT/DI/512MD3(LP)
Gigabyte GV-N660OC-2GD
they still work — but they leave signal on the table. Special characters get stripped, single digits dropped, structure partially lost. Search on this kind of data is functional, but measurably weaker than it could be.
This is not an exotic edge case. Product catalogs, ERP systems, manufacturing databases, parts inventories — a significant share of real-world search runs on data that looks exactly like this.
WMTR (Weighted Multilevel Token Representation) is our attempt to balance both.
What WMTR Is
Weighted Multilevel Token Representation (WMTR) is Amgix's tokenizer for sparse vector indexing. It builds multiple lexical views of the same text simultaneously: a surface-form view that stays close to the original tokens, a language-aware normalized view with Unicode word boundaries, stopword filtering, and stemming, and a character-level view that captures local patterns within tokens. Those views are weighted and fused into a single sparse representation, which is then scored using TF-IDF: TF computed at index time with running corpus statistics, IDF applied at query time against current corpus state.
It requires no training, no model, and no GPU. It is fast, deterministic, and works the same way regardless of domain.
Identifier-Heavy Data
We tested WMTR on a corpus of 6,600 video card records — short identifier-heavy names with no descriptions — using 15 queries that include partial identifiers, special characters, misspellings, and pure numbers. There is no BM25 baseline for this custom dataset; Amgix full-text tokenization (FT), which applies similar tokenization principles to BM25, is included as the closest available reference.
| Dataset | WMTR | FT | Δ |
|---|---|---|---|
| PC Parts | 0.9325 | 0.8216 | +0.111 |
The gap is over 11 points nDCG@10. This dataset is small and has not been validated outside our own testing — it should be read as an early indicator rather than a definitive result. For full details on the corpus and queries see the PC Parts section of the Amgix Benchmarks page.
Standard BEIR Datasets
The more important question is whether WMTR pays a cost on natural language text to get those identifier gains. The following results compare flat single-field WMTR against BM25 Flat baselines from Pyserini's BEIR regression results across five standard BEIR datasets. All results are nDCG@10, single-stage retrieval, no reranking. Tests were run on Amgix v1.0.0-beta4.4; for full hardware and deployment details see the Amgix Benchmarks page.
| Dataset | WMTR (flat) | BM25 Flat | Δ |
|---|---|---|---|
| SciFact | 0.6837 | 0.679 | +0.005 |
| TREC-COVID | 0.6383 | 0.595 | +0.043 |
| ArguAna | 0.5062 | 0.397 | +0.109 |
| Quora | 0.7941 | 0.789 | +0.005 |
| NQ | 0.3151 | 0.305 | +0.010 |
WMTR beats BM25 Flat on all five datasets. The gains on SciFact, Quora, and NQ are small. TREC-COVID and ArguAna show larger improvements, for different reasons: TREC-COVID benefits from WMTR's character-level view on dense technical terminology, while ArguAna benefits from WMTR's token ceiling reducing overlap noise between arguments and counter-arguments. Testing with varying top_k values confirms lower top_k helps on this dataset: nDCG@10 drops as top_k increases (80: 0.5258, 128: 0.5062, 512: 0.4568), and Recall@10 follows the same pattern (80: 0.8108, 128: 0.7838, 512: 0.6992) - the opposite of what you would expect if top_k were limiting recall rather than filtering noise.
For completeness, the following table shows untuned split-field WMTR against BM25 Multifield — a fairer comparison when documents have separate title and text fields. Here WMTR trails BM25 on most datasets with default settings.
| Dataset | WMTR (split, untuned) | BM25 MF | Δ |
|---|---|---|---|
| SciFact | 0.6370 | 0.665 | -0.028 |
| TREC-COVID | 0.6225 | 0.656 | -0.034 |
| ArguAna | 0.3394 | 0.414 | -0.075 |
| NQ | 0.2579 | 0.329 | -0.050 |
Tuned configurations narrow these gaps: see the Amgix Benchmarks page for full results.
Note: ArguAna shows a larger drop in split-field configuration than other datasets. ArguAna queries appear verbatim in the corpus — the argument text used as a query is also indexed as a document. Indexing title and text separately in split-field configuration doubles this overlap signal, amplifying the noise problem. The flat configuration, which merges fields, reduces this duplication.
What This Means
WMTR is not a replacement for BM25 on natural language text. On standard benchmarks it is competitive with BM25 Flat and trails BM25 Multifield with default settings. What it offers is a single tokenizer that handles both natural language and identifier-heavy data without configuration changes. You don't need to know in advance what kind of corpus you're dealing with. On the data where standard tokenizers may struggle, the difference is not small.