This tutorial was written by Arek Borucki.
Search is one of the most critical components of applications that work with text data, yet traditional keyword-based search has a major limitation: it only works when users type the exact words stored in the database, usually in the same language.
Imagine an online fashion store with products described in English. A customer searches in French for "chaussures de course confortables pour marathon" (comfortable marathon running shoes) and then in Polish for "wygodne buty do biegania na maraton". Even though both queries describe the same concept, a keyword-based search will return no results because the wording and language are different from the English product descriptions.
Semantic search addresses this limitation by focusing on meaning rather than exact words or language. Instead of comparing text strings, it maps text to numerical representations that capture semantic intent, allowing it to recognize that all of these queries refer to the same concept.
In this blog post, you will build a semantic search engine using Transformers and a real dataset from Hugging Face, a multilingual language model, and MongoDB Atlas Vector Search.
Understand Semantic Search Mechanisms
Before moving to the implementation, it is important to understand how the individual components work together.
Semantic search relies on three core elements:
- A language model that converts text into numerical representations called embeddings.
- A vector database capable of storing and searching these representations.
- An application layer that connects everything into a working system.
Explore the Hugging Face Platform
Training language models from scratch requires enormous computational resources and training data. A single model can cost millions of dollars in infrastructure. Hugging Face solves this problem by providing a repository of ready-to-use models and datasets.
Often described as the “GitHub for AI”, Hugging Face is an open platform and community for machine learning, focused on natural language processing and multimodal AI. It hosts thousands of pre-trained models and public datasets, along with developer tools such as the Transformers library and hosted inference APIs, enabling teams to build AI-powered applications without training models from scratch. Companies like Google, Meta, Microsoft, and MongoDB publish their models there, which anyone can download and use.
In this tutorial, you will download two resources from Hugging Face: a multilingual language model that enables cross-language semantic search and a fashion product dataset that will be used as the searchable data source.
Understand the Transformers Architecture
Hugging Face Transformers is an open-source Python library that provides a unified API for working with Transformer-based models. The library handles model loading, tokenization, and inference, while pre-trained model checkpoints are hosted on the Hugging Face Hub and automatically downloaded when used.
Transformers 5 introduces a more modular architecture with a strong focus on interoperability, making quantization a first-class feature for smaller models and faster inference.
This tutorial uses sentence-transformers, a library built on top of Hugging Face Transformers that specializes in generating sentence and text embeddings for semantic similarity tasks.
Discover Vector Search
Vector search is a database search technique that powers semantic search. MongoDB Atlas provides this capability through Atlas Vector Search, where instead of matching keywords, the database compares numerical vectors.
This approach works by storing embeddings generated by Hugging Face language models directly in the database. Using Hugging Face Transformers, text is converted into embeddings, which are numerical representations of semantic meaning.
These embeddings are stored in MongoDB Atlas Vector Search and indexed for efficient similarity search. Each Hugging Face model produces embeddings of a fixed size. For example, the paraphrase-multilingual-MiniLM-L12-v2 model generates vectors with 384 dimensions, and MongoDB Atlas Vector Search must be configured with the same dimensionality.
When a user submits a query, Hugging Face generates an embedding for the query text. MongoDB Atlas Vector Search compares this vector with stored embeddings and returns the most semantically similar documents using Approximate Nearest Neighbor algorithms.
In short, Hugging Face generates embeddings, and MongoDB Atlas Vector Search stores and searches them. Together, they enable fast, meaning-based search.
Configure the Development Environment
The development environment requires several components: a Python interpreter with machine learning libraries, a local MongoDB Atlas deployment with Vector Search enabled, and a web framework to expose the API. Each of these components can be run locally.
Create the Project Structure
The project structure separates source code from configuration to keep the application organized and secure. Configuration files such as .env contain sensitive information, including connection strings, and should never be committed to the repository. The src directory isolates the application logic, making the project easier to test, maintain, and containerize later. Create the project directories and configuration files as shown below:
mkdir semantic-search-demo
cd semantic-search-demo
mkdir src
touch requirements.txt .env
Install Dependencies
Python does not isolate dependencies between projects by default, which can lead to version conflicts. To avoid this, project dependencies and their exact versions are defined in a requirements.txt file.
This project uses:
- Sentence-transformers to generate embeddings with Hugging Face models
- Datasets library to download datasets from Hugging Face
- Pymongo to communicate with MongoDB
- FastAPI and Uvicorn to serve the API
- Torch as the underlying machine learning framework.
Create the requirements.txt file using the following command:
cat <<EOF > requirements.txt
fastapi>=0.109.0
uvicorn>=0.27.0
pymongo>=4.6.1
sentence-transformers>=2.3.0
python-dotenv>=1.0.0
datasets>=2.16.0
Pillow>=10.0.0
torch
EOF
Now, create a Python virtual environment and install the project dependencies. A virtual environment isolates the project from the system Python and prevents version conflicts, which is standard practice in production environments.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Run MongoDB Atlas with Vector Search
To use MongoDB Atlas Vector Search without deploying a cloud cluster, you can run a local Atlas deployment using Atlas CLI. This approach provides the same Vector Search and Atlas Search capabilities as the cloud version, but runs locally using Docker, making it ideal for development and learning. Atlas CLI manages the entire setup and ensures your local environment behaves the same way as Atlas in the cloud. This helps avoid configuration mismatches later. First, install Atlas CLI if you do not have it already:
brew install mongodb-atlas-cli
Create a local Atlas deployment using atlas deployments setup command:
atlas deployments setup
? What type of deployment would you like to work with? local
[Default Settings]
Deployment Name local5783
MongoDB Major Version 8 (latest minor version)
? How do you want to set up your local Atlas deployment? default
Creating your cluster local9328
1/3: Starting your local environment...
2/3: Downloading the latest MongoDB image to your local environment...
3/3: Creating your deployment local9328...
Deployment created!
Connection string: "mongodb://localhost:51213/?directConnection=true"
? How would you like to connect to local9328? mongosh
When prompted, choose local and accept the default settings. Atlas CLI will download the required MongoDB image and create a local Atlas cluster. Once the deployment is ready, Atlas CLI displays a connection string similar to:
mongodb://localhost:51213/?directConnection=true
When prompted to connect using mongosh, selecting yes will automatically open a shell session connected to the local Atlas cluster, allowing you to verify that the database is running correctly.
AtlasLocalDev local9328 [direct: primary] test> show dbs
admin 224.00 KiB
config 224.00 KiB
local 392.00 KiB
AtlasLocalDev local9328 [direct: primary] test>
After creating the cluster, update your .env file with the connection string from Atlas CLI.
cat <<EOF > .env
MONGO_URI=mongodb://localhost:51213/?directConnection=true
DB_NAME=fashion_db
COLLECTION_NAME=products
# Multilingual embedding model (50+ languages, including French, Polish, and English)
MODEL_NAME=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
EOF
Implement System Components
The system is composed of three main modules: a database module, a language model module, and an indexing module. Together, they form the core of the semantic search pipeline.
Create the Database Module
The database module is responsible for managing the MongoDB connection and configuring the vector search index. The vector index enables efficient similarity search by organizing embeddings in a structure optimized for nearest-neighbor retrieval in high-dimensional space. Create the database module:
cat <<'EOF' > src/database.py
import os
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
from dotenv import load_dotenv
load_dotenv()
class MongoManager:
def __init__(self):
self.client = MongoClient(os.getenv("MONGO_URI"))
self.db = self.client[os.getenv("DB_NAME")]
self.collection = self.db[os.getenv("COLLECTION_NAME")]
def get_collection(self):
return self.collection
def create_vector_index(self):
"""Creates a Vector Search index"""
search_index_model = SearchIndexModel(
definition={
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 384,
"similarity": "cosine"
}
]
},
name="vector_index",
type="vectorSearch"
)
try:
self.collection.create_search_index(model=search_index_model)
print("Vector index 'vector_index' has been initialized.")
except Exception as e:
print(f"Index status: {e}")
EOF
Create the Language Model Module
The AI module encapsulates the embedding generation logic. The paraphrase-multilingual-MiniLM-L12-v2 model was trained on text from more than 50 languages, allowing it to capture semantic similarity across languages without explicit translation.
On the first run, the sentence-transformers library downloads the model weights from the Hugging Face Hub and stores them in a local cache.
cat <<'EOF' > src/ai.py
import os
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
load_dotenv()
class AIModel:
def __init__(self):
model_name = os.getenv("MODEL_NAME")
print(f"Loading Transformer model: {model_name}...")
self.model = SentenceTransformer(model_name)
print("Model ready.")
def generate_embedding(self, text: str):
if not text:
return []
return self.model.encode(text).tolist()
EOF
The initial download requires an Internet connection and may take several minutes depending on network speed. Subsequent runs reuse the cached model and start almost instantly.
Create the Data Indexing Module
The indexing module downloads a dataset from Hugging Face, generates embeddings for each document, and stores the enriched data in MongoDB.
cat <<'EOF' > src/indexer.py
from datasets import load_dataset
from src.database import MongoManager
from src.ai import AIModel
def run_indexing():
print("=" * 50)
print("Starting indexing process...")
print("=" * 50)
print("\n[1/6] Initializing MongoDB connection...")
db = MongoManager()
print("✓ MongoDB connected")
print("\n[2/6] Loading AI model...")
ai = AIModel()
print("✓ AI model loaded")
print("\n[3/6] Getting MongoDB collection...")
collection = db.get_collection()
print("✓ Collection ready")
print("\n[4/6] Downloading fashion dataset from Hugging Face...")
dataset = load_dataset(
"ashraq/fashion-product-images-small",
split="train"
)
print(f"✓ Dataset downloaded: {len(dataset)} total products")
sample_data = dataset.shuffle(seed=42).select(range(200))
print(f"\n[5/6] Processing {len(sample_data)} products (randomly sampled)...")
collection.delete_many({})
print("✓ Cleared existing data")
docs_to_save = []
for idx, item in enumerate(sample_data):
if idx % 20 == 0:
print(f" Processing item {idx + 1}/{len(sample_data)}...")
product = {
"name": item["productDisplayName"],
"category": item["masterCategory"],
"subCategory": item["subCategory"],
"gender": item["gender"],
"price": 99.99,
}
text_to_embed = (
f"{item['productDisplayName']} "
f"{item['usage']} "
f"{item['baseColour']}"
)
product["embedding"] = ai.generate_embedding(text_to_embed)
docs_to_save.append(product)
if docs_to_save:
print(f"\n Saving {len(docs_to_save)} products to MongoDB...")
collection.insert_many(docs_to_save)
print(f"✓ Saved {len(docs_to_save)} products")
print("\n[6/6] Creating vector search index...")
db.create_vector_index()
print("\n" + "=" * 50)
print("✓ Indexing completed successfully!")
print("=" * 50)
if __name__ == "__main__":
run_indexing()
EOF
Create the Search API
The API layer exposes semantic search functionality over HTTP. The /search endpoint accepts a text query, converts it into an embedding using the language model, and performs vector search in MongoDB.
MongoDB executes a $vectorSearch aggregation stage, which finds documents whose embedding vectors are most similar to the query vector. The results are returned together with a similarity score.
cat <<'EOF' > src/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, field_validator
from src.database import MongoManager
from src.ai import AIModel
app = FastAPI(title="Fashion Semantic Search")
db = MongoManager()
ai = AIModel()
collection = db.get_collection()
class SearchRequest(BaseModel):
query: str
limit: int = 5
@field_validator('query')
@classmethod
def query_must_not_be_empty(cls, v):
if not v or not v.strip():
raise ValueError('Query cannot be empty')
return v.strip()
@field_validator('limit')
@classmethod
def limit_must_be_positive(cls, v):
if v < 1:
raise ValueError('Limit must be at least 1')
if v > 100:
raise ValueError('Limit cannot exceed 100')
return v
@app.post("/search")
def search_products(payload: SearchRequest):
query_vector = ai.generate_embedding(payload.query)
if not query_vector:
raise HTTPException(status_code=400, detail="Failed to generate embedding for query")
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "embedding",
"queryVector": query_vector,
"numCandidates": 100,
"limit": payload.limit
}
},
{
"$project": {
"_id": 0,
"name": 1,
"category": 1,
"subCategory": 1,
"score": {"$meta": "vectorSearchScore"}
}
}
]
try:
results = list(collection.aggregate(pipeline))
return {"query": payload.query, "results": results}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}")
@app.get("/")
def read_root():
return {"status": "ok", "service": "Fashion Semantic Search API"}
EOF
Test Multilingual Search
The database contains products with descriptions in English. A key advantage of semantic search is the ability to handle queries written in a different language than the stored data. A multilingual model recognizes that a query in French or Polish can express the same intent as an English product description.
Run the Indexing Process
Before running the indexer, mark the src directory as a Python package:
touch src/__init__.py
At this point, the project structure should look like this:
project-root/
├── src/
│ ├── __init__.py
│ ├── ai.py
│ ├── database.py
│ ├── indexer.py
│ └── main.py
├── requirements.txt
├── .env
└── venv/
Start by indexing the data. This step downloads the dataset from Hugging Face, generates embeddings for each product, and stores the documents together with their vectors in MongoDB. Run the indexer:
python -m src.indexer
The terminal displays progress information for each step of the process, including database initialization, model loading, dataset download, document processing, and vector index creation:
[1/6] Initializing MongoDB connection...
✓ MongoDB connected
[2/6] Loading AI model...
Model ready.
✓ AI model loaded
[3/6] Getting MongoDB collection...
✓ Collection ready
[4/6] Downloading fashion dataset from Hugging Face...
✓ Dataset downloaded: 44072 total products
[5/6] Processing 200 products (randomly sampled)...
✓ Cleared existing data
Processing item 1/200...
Processing item 21/200...
Processing item 41/200...
Processing item 61/200...
Processing item 81/200...
Processing item 101/200...
Processing item 121/200...
Processing item 141/200...
Processing item 161/200...
Processing item 181/200...
Saving 200 products to MongoDB...
✓ Saved 200 products
[6/6] Creating vector search index...
Vector index 'vector_index' has been initialized.
==================================================
✓ Indexing completed successfully!
==================================================
Run the API Server
Activate the virtual environment and start the FastAPI server:
source venv/bin/activate
python -m uvicorn src.main:app --reload
The API is now available at http://localhost:8000
Execute Test Queries
The first test checks cross-language semantic search using a French query. The database contains product descriptions written in English.
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "chaussures de course confortables pour marathon"}'
Output:
{
"query": "chaussures de course confortables pour marathon",
"results": [
{
"name": "Nike Men Transform III White Sports Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7942747473716736
},
{
"name": "Fila Men's Basic Low Red Shoe",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7913200855255127
},
{
"name": "ADIDAS Unisex Brown Casual Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.789846658706665
},
{
"name": "Numero Uno Men's Red Canvas Shoe",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7863043546676636
},
{
"name": "Lotto Men Sprint Black Sports Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7819452285766602
}
]
}
The output shows that MongoDB Atlas Vector Search returns a ranked list of products based on vector similarity scores. Although the query is written in French and the product data is in English, the multilingual embedding model correctly identifies that the query refers to athletic footwear. The top results include Nike and Lotto sports shoes from well-known athletic brands, which are semantically close to "marathon running shoes".
The second test verifies cross-language understanding using a Polish query. "Wygodne buty do biegania na maraton" translates to "comfortable running shoes for marathon". This is the same concept as the French query above.
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "wygodne buty do biegania na maraton"}'
Output:
{
"query": "wygodne buty do biegania na maraton",
"results": [
{
"name": "Lotto Men Sprint Black Sports Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7927727699279785
},
{
"name": "Nike Men Transform III White Sports Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.787016749382019
},
{
"name": "Fila Men's Basic Low Red Shoe",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7850795984268188
},
{
"name": "ADIDAS Unisex Brown Casual Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7824510335922241
},
{
"name": "Nike Men Downshifter White Sports Shoes",
"category": "Footwear",
"subCategory": "Shoes",
"score": 0.7783344388008118
}
]
}
The Polish query returns the same category of products as the French query — athletic footwear from Nike, Lotto, Fila, and Adidas. This confirms that the multilingual model correctly maps semantically equivalent queries across languages to the same product space.
The third test explores a different product category. This Polish query translates to "elegant evening handbag" and tests whether the model can identify accessories rather than footwear.
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "elegancka wieczorowa torebka"}'
Output:
{
"query": "elegancka wieczorowa torebka",
"results": [
{
"name": "SDL by Sweet Dreams Women Green Printed Night Suit S11-3124",
"category": "Apparel",
"subCategory": "Loungewear and Nightwear",
"score": 0.7503366470336914
},
{
"name": "Lucera Women Silver Pendant",
"category": "Accessories",
"subCategory": "Jewellery",
"score": 0.7264891862869263
},
{
"name": "Giorgio Armani Women Idole Perfume",
"category": "Personal Care",
"subCategory": "Fragrance",
"score": 0.7015564441680908
},
{
"name": "Levis Men Comfort Style Grey Innerwear Vest",
"category": "Apparel",
"subCategory": "Innerwear",
"score": 0.6910820007324219
},
{
"name": "Kiara Women Teal Handbag",
"category": "Accessories",
"subCategory": "Bags",
"score": 0.6863116025924683
}
]
}
The results show that the semantic search engine understands the Polish query, even though products are described in English. However, the handbag appears only at position 5. The word "wieczorowa" (evening) led the model to nightwear, and "elegancka" (elegant) to jewelry and perfume. These connections make sense linguistically, but they are not what the user was looking for. In a real application, adding category filters would help return more relevant results.
Summary
This semantic search system is built on Hugging Face language models, which transform text into dense vector embeddings that capture meaning across languages. Using a multilingual model from the Hugging Face Hub eliminates the need to train custom models and enables cross-language similarity out of the box.
These embeddings are stored and searched with MongoDB Atlas Vector Search, while the application layer connects user queries to relevant results. Together, Hugging Face and vector search enable multilingual, meaning-based retrieval that goes beyond traditional keyword matching.
As shown in the test results, semantic search correctly identifies broad categories across languages. The less precise results for the handbag query are primarily due to the limited sample size of 200 products used in this demo—with a full catalog, the system would have more relevant items to match. For production applications, consider indexing the complete dataset, combining vector search with metadata filters, implementing re-ranking, or using domain-specific models to improve precision.

Top comments (0)