jasonjmcghee 7 hours ago [-]

Nice project! A regular on HN and creator of usearch built an embedding search for the same dataset and did a write up which is a great read.

https://ashvardanian.com/posts/usearch-molecules/

mireklzicar 6 hours ago [-]

Thanks — I read Ash’s post (great blog!) and even spun up USEARCH when I first explored this space.

Main differences:

* *Cost-efficiency:* USEARCH / FAISS / HNSW keep most of the index in RAM; at the billion scale that often means hundreds of GB. In CHEESE both build and search stream from disk. For the 5.5 B-compound Enamine set the footprint is ~1.7 TB NVMe plus ~4 GB RAM (only the centroids), so it can run on a laptop and still scale to tens of billions of vectors. This is also huge difference over commercial vector DB providers (pinecone, milvus...) who would bill you many thousands USD per month for it, because of the RAM heavy instances.

* *Vector type:* USEARCH demo uses binary fingerprints with Tanimoto distance. I use 256-D float embeddings trained to approximate 3-D shape and electrostatic overlap, searched with Euclidean distance.

* *Latency vs. accuracy:* BigANN-style work optimises for QPS and milisecond latency. Chemists usually submit queries one-by-one, so they don’t mind 1–6 s if the top hits are chemically meaningful. I pull entire clusters from disk and scan them exactly to keep recall high.

So the trade-off is a few seconds slower, but far cheaper hardware and results optimized for accuracy.