For many e-commerce businesses, Google Shopping drives the majority of their revenue. However, getting product data into Google Merchant Center isn't always straightforward. If you are working with a legacy CMS, a custom-built platform, or a client who lacks direct database access, you likely won't have a native "Export to Google Shopping" plugin at your disposal.
When traditional tools fail, Python can bridge the gap. By scraping a website's front-end and transforming that data into a Google-compliant XML format, you can build a reliable product feed from scratch.
This guide walks through building a custom pipeline that crawls an e-commerce site, cleans the product data, and generates a valid feed.xml file ready for high-performance ad campaigns.
Understanding Google’s Feed Requirements
Before writing code, you need to understand the target format. Google Merchant Center is strict. If a currency is formatted incorrectly or an availability status uses the wrong keyword, Google will reject the entire feed.
Google Shopping feeds are XML-based and rely on a specific namespace (g:). While there are dozens of optional attributes, every product must include these mandatory fields:
| Attribute | Description | Requirement Example |
|---|---|---|
id |
A unique, persistent identifier |
sku_123 or a stable hash |
title |
The name of your product | Max 150 characters |
description |
Detailed product info | No marketing jargon |
link |
The product landing page | Must be a verified domain |
image_link |
URL of the main image | High res, no watermarks |
availability |
Current stock status |
in_stock, out_of_stock, preorder
|
price |
Numerical price + currency | 1299.00 USD |
For a full list of attributes, refer to the official Google Product Data specification.
Prerequisites
To follow this tutorial, you will need:
- Python 3.7+ installed
- Familiarity with
requestsandBeautifulSoup - The
feedgenlibrary for XML generation
Install the necessary packages via pip:
pip install requests beautifulsoup4 feedgen
Step 1: Acquiring the Raw Data
You need a way to gather product information. In a production scenario, you might use a spider to crawl the entire site, but for this example, we’ll create a function that extracts data from a single product page.
If you need a real-world example of structured e-commerce extraction, the Flipkart.com Scrapers Repository demonstrates scalable patterns that can be adapted into this feed generation pipeline.
The following code intentionally grabs "dirty" data, such as raw strings with currency symbols and extra whitespace, to simulate the reality of web scraping.
import requests
from bs4 import BeautifulSoup
def scrape_product_page(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting raw data points
product_data = {
'url': url,
'title': soup.find('h1').text.strip(),
'description': soup.find('div', class_='description').text.strip(),
'price_raw': soup.find('span', class_='price').text.strip(), # "$1,299.00"
'stock_raw': soup.find('div', class_='stock-status').text.strip(), # "In Stock"
'image_url': soup.find('img', id='main-image')['src']
}
return product_data
Step 2: Data Transformation and Cleaning
Google will reject a price listed as "$1,299.00". It expects a numeric value followed by the ISO 4217 currency code, such as 1299.00 USD.
The product id must also be stable. If an ID changes, Google treats it as a new product, which resets the performance history and quality scores your ads have accumulated. If the site doesn't expose a SKU, you can create a deterministic ID by hashing the product URL.
import re
import hashlib
def clean_data(raw_item):
# 1. Normalize Price: Remove symbols/commas and add currency
price_numeric = re.sub(r'[^\d.]', '', raw_item['price_raw'])
formatted_price = f"{float(price_numeric):.2f} USD"
# 2. Map Availability: Convert "In Stock" or "3 Left" to "in_stock"
stock_text = raw_item['stock_raw'].lower()
availability = "in_stock" if "in stock" in stock_text or "left" in stock_text else "out_of_stock"
# 3. Generate a Stable ID: Hash the URL for consistency
product_id = hashlib.md5(raw_item['url'].encode()).hexdigest()[:10]
return {
'id': product_id,
'title': raw_item['title'][:150],
'description': raw_item['description'],
'link': raw_item['url'],
'image_link': raw_item['image_url'],
'availability': availability,
'price': formatted_price
}
Step 3: Generating the XML Feed
The feedgen library is more efficient than Python’s built-in xml.etree for this task because it handles RSS/Atom standards and custom namespaces smoothly.
Google Shopping feeds require the http://base.google.com/ns/1.0 namespace. We use the feedgen Google Merchant extension to append the g: tags correctly.
from feedgen.feed import FeedGenerator
def generate_xml_feed(cleaned_products, output_file="google_feed.xml"):
fg = FeedGenerator()
fg.title('My Custom E-commerce Store')
fg.link(href='https://example-shop.com', rel='alternate')
fg.description('Daily updated product feed for Google Shopping')
# Load the Google Merchant extension
fg.load_extension('google_re_base', atom=True)
for item in cleaned_products:
fe = fg.add_entry()
fe.id(item['id'])
fe.title(item['title'])
fe.description(item['description'])
fe.link(href=item['link'])
# Add Google-specific attributes
fe.google_re_base.image_link(item['image_link'])
fe.google_re_base.availability(item['availability'])
fe.google_re_base.price(item['price'])
fg.rss_file(output_file, encoding='UTF-8')
print(f"Success! Feed generated at {output_file}")
Step 4: Validation and Upload
Once the google_feed.xml is ready, use the Google Merchant Center "Test" feature. This allows you to upload a file and see errors without affecting live ads.
- Host the file: Google needs a static URL to fetch the feed. You can host this on an AWS S3 bucket, a VPS, or a private GitHub Pages site.
- Add Feed in Merchant Center: Navigate to Feeds > Primary Feeds > Add New.
- Select "Scheduled Fetch": Enter the URL where you hosted your XML file.
- Check for Errors: Look for encoding issues or missing mandatory attributes. A common pitfall is unescaped HTML characters in the description.
Automating the Update Cycle
A product feed is a living document. If you scrape the site once and never update it, your ads will eventually show incorrect prices. If Google detects a price mismatch between your feed and your landing page, they may disapprove the product or suspend your account.
Automate the script to run at least once every 24 hours. On a Linux server, a Cron job is the most reliable method. To run the script every night at 2:00 AM, use crontab -e and add:
0 2 * * * /usr/bin/python3 /path/to/your/script.py
In Google Merchant Center, set the "Fetch Schedule" to 3:00 AM to ensure Google picks up the fresh data shortly after it is generated.
To Wrap Up
Building your own Google Shopping feed pipeline gives you total control over your data. Instead of being limited by a plugin, you can inject custom labels, optimize titles for SEO, and include products from platforms that were previously difficult to advertise.
By following this workflow, you can extract raw web data and transform it into a professional, compliant XML feed. To speed up development, you can explore the ScrapeOps AI Web Scraping Assistant to automatically generate scraper logic for your pipeline.
Top comments (0)