If you operate a custom e-commerce platform or a legacy CMS, you've likely hit a frustrating wall: there is no "Export to Google Shopping" button. Google Merchant Center is the gateway to Google Shopping ads, but it requires your product data in a strictly formatted XML or TSV file.
Without a native API or a pre-built plugin, you might assume you're stuck with manual data entry or expensive middleware. However, if your products are visible on your website, you already have the data you need.
You can bridge this gap by building a custom pipeline that scrapes your storefront, normalizes the data, and generates a compliant XML feed. This guide walks through turning raw HTML into a Google-ready shopping feed using Python.
Understanding the Google Shopping Feed Specification
Before writing any code, we need to understand the requirements. Google Merchant Center requires an RSS 2.0 or Atom 1.0 feed with specific namespaces.
Google's specification is famously strict. If your price format is off or a required field is missing, the entire feed will be rejected.
Required Fields
To get products approved, you must provide these core attributes:
-
id: A unique, persistent identifier for the product. -
title: The name of your product (max 150 characters). -
description: A clear description of the item. -
link: The direct URL to the product page. -
image_link: The URL of the main product image. -
availability: Must bein_stock,out_of_stock, orpreorder. -
price: The numeric value followed by the ISO currency code (e.g.,29.99 USD).
The XML Structure
A valid Google Shopping item looks like this:
<item>
<g:id>SKU-12345</g:id>
<g:title>Leather Chelsea Boots</g:title>
<g:description>Handcrafted men's boots made from genuine leather.</g:description>
<g:link>https://example.com/products/chelsea-boots</g:link>
<g:image_link>https://example.com/images/boots.jpg</g:image_link>
<g:availability>in_stock</g:availability>
<g:price>120.00 USD</g:price>
</item>
The g: prefix refers to the Google namespace, which we must define in the root XML element.
Prerequisites
To follow this tutorial, you will need:
- Python 3.x
- The
requestslibrary for fetching web pages. - The
beautifulsoup4library for parsing HTML.
Install the dependencies via pip:
pip install requests beautifulsoup4
1. Setting Up the Scraper
The first task is extracting raw data from the website. In this example, we’ll scrape a list of products from a category page and store the raw strings in a list of dictionaries.
import requests
from bs4 import BeautifulSoup
def scrape_products(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
products = []
# Targeting product containers (adjust selectors for your site)
for product_div in soup.select('.product-item'):
item = {
'id': product_div.get('data-sku'),
'title': product_div.select_one('.title').text,
'price': product_div.select_one('.price').text,
'link': product_div.select_one('a')['href'],
'image': product_div.select_one('img')['src'],
'stock_text': product_div.select_one('.stock-status').text
}
products.append(item)
return products
At this stage, the data is messy. The price might look like \n $19.99 \n and the image link might be a relative path like /media/img.jpg. This won't pass Google's validation.
2. Data Normalization
Normalization transforms human-readable HTML into machine-compliant data. These helper functions handle the heavy lifting.
Cleaning Prices and URLs
Google requires prices to have a currency code and links to be absolute (starting with https://).
import re
def clean_price(raw_price):
# Extract digits and decimals using regex
match = re.search(r"(\d+\.\d{2})", raw_price)
if match:
return f"{match.group(1)} USD"
return ""
def make_absolute(url, base_url="https://example.com"):
if url.startswith('/'):
return f"{base_url}{url}"
return url
def map_availability(text):
text = text.lower()
if "in stock" in text or "add to cart" in text:
return "in_stock"
return "out_of_stock"
Processing the Scraped List
Now we iterate through the raw data and apply these transformations:
def normalize_data(raw_products):
clean_list = []
for item in raw_products:
clean_item = {
'id': item['id'] or f"id-{hash(item['link'])}",
'title': item['title'].strip()[:150],
'description': f"Buy {item['title'].strip()} online.", # Fallback description
'link': make_absolute(item['link']),
'image_link': make_absolute(item['image']),
'availability': map_availability(item['stock_text']),
'price': clean_price(item['price'])
}
clean_list.append(clean_item)
return clean_list
3. Generating the XML File
To generate the XML, use Python's built-in xml.etree.ElementTree. This is safer than manual string concatenation because it handles character escaping for symbols like & automatically.
We must define the Google Merchant Center namespace (http://base.google.com/ns/1.0) so Google recognizes the tags.
import xml.etree.ElementTree as ET
def generate_xml_feed(products, filename="google_feed.xml"):
# Define namespaces
ET.register_namespace('g', "http://base.google.com/ns/1.0")
# Create the root RSS structure
rss = ET.Element("rss", version="2.0")
channel = ET.SubElement(rss, "channel")
# Add basic channel info
ET.SubElement(channel, "title").text = "My Digital Store"
ET.SubElement(channel, "link").text = "https://example.com"
ET.SubElement(channel, "description").text = "Product feed for Google Shopping"
for product in products:
item = ET.SubElement(channel, "item")
# Standard RSS tags
ET.SubElement(item, "title").text = product['title']
ET.SubElement(item, "link").text = product['link']
ET.SubElement(item, "description").text = product['description']
# Google-specific tags with the 'g:' namespace
ET.SubElement(item, "{http://base.google.com/ns/1.0}id").text = product['id']
ET.SubElement(item, "{http://base.google.com/ns/1.0}image_link").text = product['image_link']
ET.SubElement(item, "{http://base.google.com/ns/1.0}availability").text = product['availability']
ET.SubElement(item, "{http://base.google.com/ns/1.0}price").text = product['price']
# Write to file
tree = ET.ElementTree(rss)
tree.write(filename, encoding="utf-8", xml_declaration=True)
print(f"Feed successfully generated: {filename}")
4. Validation and Automation
Once you have your google_feed.xml, open it in a web browser. If the XML is well-formed, the browser will display the tree structure. If there is a syntax error, it will show a parsing error.
Common Pitfalls
- Relative URLs: Google cannot crawl
/images/product.jpg. Ensure your script prepends the domain. - Missing Prices: If a product has no price, exclude it from the feed rather than sending an empty tag.
- Character Encoding: Use
utf-8to handle special characters (like the ‘registered’ symbol ®) to avoid rejection.
Automating the Update
Google Shopping feeds aren't static. If prices change or items go out of stock, your feed must reflect that. Host this script on a server and set up a Cron job to run it every 24 hours.
Once the file is hosted at a public URL, such as https://example.com/feeds/google_feed.xml, you can set up a Scheduled Fetch in Google Merchant Center. Google will then automatically download your fresh XML at a time you specify.
To Wrap Up
Generating a Google Shopping feed without an API is straightforward with Python. By extracting data directly from your HTML and normalizing it to meet Google's standards, you can run Google Shopping ads regardless of your e-commerce platform.
Key Takeaways:
- Scrape with intent: Focus on core attributes like ID, Price, and Availability.
- Normalize strictly: Use regex to ensure prices and URLs are perfectly formatted.
- Use ElementTree: Avoid manual string building to prevent syntax errors.
- Automate updates: Use a fetch schedule in Merchant Center to keep ads synced with your site.
For real-world scraping implementations, explore the Walmart.com Scrapers GitHub Repository.
If you are working with websites that require JavaScript rendering to load pricing, consider using the ScrapeOps Parser API to simplify extraction and reduce infrastructure overhead.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.