Robert N. Gutierrez

Posted on Feb 9

How to Generate a Google Shopping Feed from Any Website Using Python

#webscraping #datascraping #googleshopping #python

If you operate a custom e-commerce platform or a legacy CMS, you've likely hit a frustrating wall: there is no "Export to Google Shopping" button. Google Merchant Center is the gateway to Google Shopping ads, but it requires your product data in a strictly formatted XML or TSV file.

Without a native API or a pre-built plugin, you might assume you're stuck with manual data entry or expensive middleware. However, if your products are visible on your website, you already have the data you need.

You can bridge this gap by building a custom pipeline that scrapes your storefront, normalizes the data, and generates a compliant XML feed. This guide walks through turning raw HTML into a Google-ready shopping feed using Python.

Understanding the Google Shopping Feed Specification

Before writing any code, we need to understand the requirements. Google Merchant Center requires an RSS 2.0 or Atom 1.0 feed with specific namespaces.

Google's specification is famously strict. If your price format is off or a required field is missing, the entire feed will be rejected.

Required Fields

To get products approved, you must provide these core attributes:

id: A unique, persistent identifier for the product.
title: The name of your product (max 150 characters).
description: A clear description of the item.
link: The direct URL to the product page.
image_link: The URL of the main product image.
availability: Must be in_stock, out_of_stock, or preorder.
price: The numeric value followed by the ISO currency code (e.g., 29.99 USD).

The XML Structure

A valid Google Shopping item looks like this:

<item>
  <g:id>SKU-12345</g:id>
  <g:title>Leather Chelsea Boots</g:title>
  <g:description>Handcrafted men's boots made from genuine leather.</g:description>
  <g:link>https://example.com/products/chelsea-boots</g:link>
  <g:image_link>https://example.com/images/boots.jpg</g:image_link>
  <g:availability>in_stock</g:availability>
  <g:price>120.00 USD</g:price>
</item>

The g: prefix refers to the Google namespace, which we must define in the root XML element.

Prerequisites

To follow this tutorial, you will need:

Python 3.x
The requests library for fetching web pages.
The beautifulsoup4 library for parsing HTML.

Install the dependencies via pip:

pip install requests beautifulsoup4

1. Setting Up the Scraper

The first task is extracting raw data from the website. In this example, we’ll scrape a list of products from a category page and store the raw strings in a list of dictionaries.

import requests
from bs4 import BeautifulSoup

def scrape_products(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    products = []

    # Targeting product containers (adjust selectors for your site)
    for product_div in soup.select('.product-item'):
        item = {
            'id': product_div.get('data-sku'),
            'title': product_div.select_one('.title').text,
            'price': product_div.select_one('.price').text,
            'link': product_div.select_one('a')['href'],
            'image': product_div.select_one('img')['src'],
            'stock_text': product_div.select_one('.stock-status').text
        }
        products.append(item)

    return products

At this stage, the data is messy. The price might look like \n $19.99 \n and the image link might be a relative path like /media/img.jpg. This won't pass Google's validation.

2. Data Normalization

Normalization transforms human-readable HTML into machine-compliant data. These helper functions handle the heavy lifting.

Cleaning Prices and URLs

Google requires prices to have a currency code and links to be absolute (starting with https://).

import re

def clean_price(raw_price):
    # Extract digits and decimals using regex
    match = re.search(r"(\d+\.\d{2})", raw_price)
    if match:
        return f"{match.group(1)} USD"
    return ""

def make_absolute(url, base_url="https://example.com"):
    if url.startswith('/'):
        return f"{base_url}{url}"
    return url

def map_availability(text):
    text = text.lower()
    if "in stock" in text or "add to cart" in text:
        return "in_stock"
    return "out_of_stock"

Processing the Scraped List

Now we iterate through the raw data and apply these transformations:

def normalize_data(raw_products):
    clean_list = []
    for item in raw_products:
        clean_item = {
            'id': item['id'] or f"id-{hash(item['link'])}",
            'title': item['title'].strip()[:150],
            'description': f"Buy {item['title'].strip()} online.", # Fallback description
            'link': make_absolute(item['link']),
            'image_link': make_absolute(item['image']),
            'availability': map_availability(item['stock_text']),
            'price': clean_price(item['price'])
        }
        clean_list.append(clean_item)
    return clean_list

3. Generating the XML File

To generate the XML, use Python's built-in xml.etree.ElementTree. This is safer than manual string concatenation because it handles character escaping for symbols like & automatically.

We must define the Google Merchant Center namespace (http://base.google.com/ns/1.0) so Google recognizes the tags.

import xml.etree.ElementTree as ET

def generate_xml_feed(products, filename="google_feed.xml"):
    # Define namespaces
    ET.register_namespace('g', "http://base.google.com/ns/1.0")

    # Create the root RSS structure
    rss = ET.Element("rss", version="2.0")
    channel = ET.SubElement(rss, "channel")

    # Add basic channel info
    ET.SubElement(channel, "title").text = "My Digital Store"
    ET.SubElement(channel, "link").text = "https://example.com"
    ET.SubElement(channel, "description").text = "Product feed for Google Shopping"

    for product in products:
        item = ET.SubElement(channel, "item")

        # Standard RSS tags
        ET.SubElement(item, "title").text = product['title']
        ET.SubElement(item, "link").text = product['link']
        ET.SubElement(item, "description").text = product['description']

        # Google-specific tags with the 'g:' namespace
        ET.SubElement(item, "{http://base.google.com/ns/1.0}id").text = product['id']
        ET.SubElement(item, "{http://base.google.com/ns/1.0}image_link").text = product['image_link']
        ET.SubElement(item, "{http://base.google.com/ns/1.0}availability").text = product['availability']
        ET.SubElement(item, "{http://base.google.com/ns/1.0}price").text = product['price']

    # Write to file
    tree = ET.ElementTree(rss)
    tree.write(filename, encoding="utf-8", xml_declaration=True)
    print(f"Feed successfully generated: {filename}")

4. Validation and Automation

Once you have your google_feed.xml, open it in a web browser. If the XML is well-formed, the browser will display the tree structure. If there is a syntax error, it will show a parsing error.

Common Pitfalls

Relative URLs: Google cannot crawl /images/product.jpg. Ensure your script prepends the domain.
Missing Prices: If a product has no price, exclude it from the feed rather than sending an empty tag.
Character Encoding: Use utf-8 to handle special characters (like the ‘registered’ symbol ®) to avoid rejection.

Automating the Update

Google Shopping feeds aren't static. If prices change or items go out of stock, your feed must reflect that. Host this script on a server and set up a Cron job to run it every 24 hours.

Once the file is hosted at a public URL, such as https://example.com/feeds/google_feed.xml, you can set up a Scheduled Fetch in Google Merchant Center. Google will then automatically download your fresh XML at a time you specify.

To Wrap Up

Generating a Google Shopping feed without an API is straightforward with Python. By extracting data directly from your HTML and normalizing it to meet Google's standards, you can run Google Shopping ads regardless of your e-commerce platform.

Key Takeaways:

Scrape with intent: Focus on core attributes like ID, Price, and Availability.
Normalize strictly: Use regex to ensure prices and URLs are perfectly formatted.
Use ElementTree: Avoid manual string building to prevent syntax errors.
Automate updates: Use a fetch schedule in Merchant Center to keep ads synced with your site.

For real-world scraping implementations, explore the Walmart.com Scrapers GitHub Repository.

If you are working with websites that require JavaScript rendering to load pricing, consider using the ScrapeOps Parser API to simplify extraction and reduce infrastructure overhead.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.