Skip to content

Why We Built Amalgam Index (Amgix)

"Can't we just add a 'Global Search' to our product?"

Amgix was born to address the real challenges of building search for messy, real-world data. Sure, there were "wouldn't it be cool" moments - but the idea was simple: intelligent search for your data shouldn't be hard.

The Data Problem

If you worked on a software project, at some point, users, sales team, management, or other stakeholders will ask you this question. We've heard it for years: "Why can't we have just a single search box that searches everything in our database and returns relevant results? Like Google but for our data." "Sure," we say, "but have you seen our data?"

Thousands of records containing cryptic identifiers, special characters, typos and repetitive words, that have very little or no descriptive text or lexical richness.

But just because these records may mean nothing to you and I, they are very meaningful to the users. Users often know approximately what they are looking for and they type in short partial queries to find the records they are looking for: 010, LP10, insert, 1/4, etc.

The developers are, sooner or later, dispatched to find a solution.

Note: This reflects our experience building search for identifier-heavy ERP-style data. Your setup and mileage may vary.

Potential Solution 1: Full-Text Search (FTS)

Whether you are using Microsoft SQL Server, PostgreSQL, MariaDB or other databases, chances are, they have some form of full-text search. With some work we can:

  • Add full-text indexes to all the relevant tables and fields.
  • Rewrite all the search queries to use those indexes.
  • etc.
Part Number
.250 X 1.750
.250 X 2.000 X 6.000
.250 X 2.500 X 7.000
010
021
06132OB
08948001AA
1-8 X 5
1/16-27 NPT
1/2-13 X 1
LOGO INSERT
LIFT TAB INSERT
CAVITY BLOCK INSERT

So we try it. And what do we find?

query: inser (user forgot to type the t)

results:

Part Number Score

query: 1/2 results: no records found

query: 1 results: no records found

Partial matches do not work.

Fine, we say, we'll add LIKE operator searches with wildcards. Boom!

query: 1/16 npt

results:

Part Number Score
3/8-18 NPT 4.792638301849365
1/8-27 NPT 4.792638301849365
... 4.792638301849365
1/16-27 NPT 4.792638301849365

Excellent! Found records with NPT in them and 1/16 is there. But wait, all the scores are the same and our result is last. LIKE operator doesn't have a score and can't add anything to the relevance of the results.

FTS implementations are slightly different between databases, so your mileage may vary, but they all have similar problems. They are not good with this sort of data/queries: partial matches, misspelled words, special characters, etc. And LIKE operator, while it can be helpful, doesn't help with the relevance.

In the nutshell, FTS strips all special characters, drops stopwords (a, the, at, etc.), drops single digits, in some implementations (like in PostgreSQL) it does word stemming, based on the language you specify, and it ranks purely on number of words that matched.

Let's Add Trigrams

If you are lucky and your database has trigram capabilities (PostgreSQL), you can change all of your queries again and add trigram search to your FTS and LIKE solution.

query: 1/4

results:

Part Number Score
1/4 X .75 0.5052...
4 0.5
1/4-20 X 1 0.4444...
... ...
1/4-20 X 1.25 0.3636...

Kinda works, but the scores reflect the length of the text, rather than the relevance user cares about. And why is 4 there?

Regardless of the details, you see where we are going with this. Just not a great search experience.

Potential Solution 2: Elasticsearch

"All the big boys are using Elasticsearch"

Elasticsearch does a better job, with multiple tools available to handle our data: fuzzy matching, wildcards, edge_grams, ngrams, etc.

But those, we found, also have issues:

  • Wildcard searches do not do scoring (every result has a score of 1), which is not better than doing LIKE in SQL
  • While fuzzy matching, for example, when searching for 10, finds records where 10 is a standalone piece (like in 10-24), it doesn’t find records where 10 is a part of LP100L9.75. Search for LP10 returns nothing for some reason.
  • edge_grams and ngrams are only marginally better than trigrams in PostgreSQL.
  • And if you decide to use it, you have to learn a whole new query DSL, with all the complexity that comes with it. You need ES experts.
  • AND you have to have another database to store your search data.
  • AND you have to think about all the ingestion pipelines, deduplication, etc.

So yes, "big boys" do it. And it kinda works, but why is it so hard and messy?

The Infrastructure Problem

Whether you decide to use Elasticsearch, FTS or something else, the complexity of adding semantic hybrid search capability to your application is significant:

  • Semantic Search: You have to build code and pipelines to embed your data, find a place to store those embeddings, and write algorithms to fuse semantic scores with keyword scores.
  • Data Ingestion: Keeping your data in sync with your search index requires you to build complex ingestion pipelines to handle queuing, deduplication, retries, and failure handling.
  • Scaling: If you build your own embedding pipelines, how do you scale them? You have to figure out how to load-balance ML models across GPU/CPU resources without wasting money.
  • Privacy: Many "easy" solutions require sending your private data to cloud APIs, which may violate compliance or security guidelines.

We Built Amgix to Solve This

Clearly, adding a "global, Google-like" search for messy enterprise data is a complex engineering problem. Amgix was built to reduce that complexity and remove the operational overhead.

It gives you a single, unified stack to search your data your way, so you can focus on your product instead of building and maintaining search infrastructure.

👉 Read our list of Cool Features to see how we solved these infrastructure problems.