Rose Wabere

Posted on Feb 18

Nairobi Property Listings Scraping: 500+ Property Listings with Smart Size Extraction

#webscraping #python #datascience #regex

The Challenge

Building a house price prediction model for Nairobi requires a dataset that simply doesn’t exist in the open. You have to build it yourself by scraping property portals. But real estate data is very messy, listings are inconsistent, poorly structured, and is often contradictory.

Taking the size property for instance, one listing gives size as "4,350 sq. ft." Another says "Approx. 350 – 400 sqm." A third buries three villa sizes inside a paragraph.

So, how do you extract clean, usable data from this chaos?

In this post, I’ll walk you through the architecture of a scraper that collected 528 listings with 10 structured fields, focusing on the size extraction logic that handles ranges, commas, and mixed units.

Project Structure

nairobi-house-price-prediction/
├── data/
│   └── raw_listings.csv          # raw csv output
├── notebooks/
│   └── extraction.ipynb          # scraping logic
├── src/
│   ├── scraper.py                 # Core scraping functions
│   └── utils.py                    # Helpers (parsers, cleaners)
├── data_dictionary.json            # Schema definition
└── requirements.txt

Core Scraping Logic

The scraper iterates over pages, extracts needed data from each listing card, and then fetches more data from the detail page for richer information. I used requests with retries and BeautifulSoup for parsing.

The Size Extraction Problem

Property size appears in many forms on the site:

Example	Format
250 m²	Single, metric
4,350 sq. ft.	Imperial with comma
Approx. 350 – 400 sqm	Range, metric
465 to 476 square meters	Range with “to”
7,755 sq ft	Imperial, no dot
240 SQM	Uppercase

Some listings even contain multiple size mentions (built‑up area, terrace, garden). We need to extract the first meaningful built‑up size.

With all that chaos, the extract_size_from_text() function handles it all.

import re

def extract_size_from_text(text):
    """
    Extract built-up/property size from messy real estate descriptions.
    Returns original string of the most plausible size, or "N/A".
    """
    if not text:
        return "N/A"

    text = text.replace(",", "")
    candidates = []  # (size_in_sqm, original_text)

    # 1. Ranges in sqm
    range_matches = re.findall(
        r'(\d+(\.\d+)?)\s*(?:–|-|to)\s*(\d+(\.\d+)?)\s*(sqm|m²|square meters?)',
        text, re.IGNORECASE
    )
    for match in range_matches:
        low = float(match[0])
        high = float(match[2])
        if high >= 30:
            candidates.append((high, f"{match[0]}–{match[2]} sqm"))

    # 2. Single sqm
    sqm_matches = re.findall(
        r'(\d+(\.\d+)?)\s*(sqm|m²|square meters?)',
        text, re.IGNORECASE
    )
    for match in sqm_matches:
        val = float(match[0])
        if val >= 30:
            candidates.append((val, f"{match[0]} sqm"))

    # 3. Square feet (convert to sqm)
    sqft_matches = re.findall(
        r'(\d+(\.\d+)?)\s*(sq\.?\s*ft\.?|sqft)',
        text, re.IGNORECASE
    )
    for match in sqft_matches:
        sqft = float(match[0])
        if sqft >= 300:
            sqm = sqft * 0.092903
            candidates.append((sqm, f"{match[0]} sq ft"))

    # 4. Acres (including fractions)
    acre_matches = re.findall(
        r'(\d+/\d+|\d+(\.\d+)?)\s*-?\s*(acre)',
        text, re.IGNORECASE
    )
    for match in acre_matches:
        raw = match[0]
        if '/' in raw:
            num, den = raw.split('/')
            acres = float(num) / float(den)
        else:
            acres = float(raw)
        if acres >= 0.05:
            sqm = acres * 4046.86
            candidates.append((sqm, f"{raw} acre"))

    if candidates:
        plausible = [c for c in candidates if c[0] >= 30]
        if plausible:
            # Return the largest (most likely built-up area)
            return max(plausible, key=lambda x: x[0])[1]

    return "N/A"

Why This Approach?

Unit normalization allows fair comparison across different measurement systems.

Thresholds (30 sqm, 300 sqft, 0.05 acre) eliminate noise from tiny areas that are clearly not the main house or are clearly not logical.

Max selection handles listings that describe multiple areas (e.g., terraces, gardens). This means that we take the largest, which typically corresponds to the built‑up area.

Regex patterns are flexible enough to catch variations like "sqm", "m²", "square meters", "sq. ft.", "sqft", and acre fractions.

Integration with the Scraper

The scraper collects size from two places:

Listing card – quick size from swiper slides or truncated description.

Detail page – full description that contains a more accurate size. If found, it overrides the card size.

def extract_bedrooms_bathrooms_size(listing):
    bedrooms = bathrooms = "N/A"
    size_from_swiper = "N/A"

    # ... extract from swiper slides ...

    # Extract from description on the card
    desc_div = listing.find('div', id='truncatedDescription')
    if desc_div:
        desc_text = desc_div.get_text(" ", strip=True)
        size_from_desc = extract_size_from_text(desc_text)

    size = size_from_desc if size_from_desc != "N/A" else size_from_swiper
    return bedrooms, bathrooms, size

The detail page

def extract_bedrooms_bathrooms_size(listing):
    bedrooms = bathrooms = "N/A"
    size_from_swiper = "N/A"

    # ... extract from swiper slides ...

    # Extract from description on the card
    desc_div = listing.find('div', id='truncatedDescription')
    if desc_div:
        desc_text = desc_div.get_text(" ", strip=True)
        size_from_desc = extract_size_from_text(desc_text)

    size = size_from_desc if size_from_desc != "N/A" else size_from_swiper
    return bedrooms, bathrooms, size

Then in scrape_listing:

if size_from_detail != "N/A":
    size = size_from_detail  # override card size

Error Handling & Resilience

Retry logic with fetch_page – requests up to 3 times with a 2‑second delay between attempts.

Fallback data– if detail page fails, return basic info (size from card).

Checkpointing – main loop stops when max_listings stops at max_listings (800) is reached to avoid over‑scraping.

Polite delays – 1s between detail requests, 2s between pages to prevent IP blocking.

Data Dictionary: Schema as Code

I defined the schema upfront to ensure consistency:

data_dictionary = [
    {"Column": "Title", "Type": "String", "Description": "Property name"},
    {"Column": "Property Type", "Type": "String", "Description": "Apartment, Townhouse, etc."},
    # ... etc.
]

This is saved as data_dictionary.json, serving as documentation for collaborators and future‑me.

Results after running pages 1–40:

START_PAGE = 1
END_PAGE = 40      
MAX_LISTINGS = 800     # max

# Scrape
df = scrape_pages(START_PAGE, END_PAGE, MAX_LISTINGS)

print(len(df))  # 800

print(df.columns)
# ['Title', 'Property Type', 'Price', 'Location', 'Bedrooms', 'Bathrooms', 'Size', 'Amenities', 'Surroundings', 'Created At']

800 listings from pages 1–40
10 fields per listing
Size captured in listings if available
0 critical failures – thanks to retries and fallbacks
Raw CSV saved to data/raw_listings.csv

Lessons for Production Scrapers

Always normalise units – you can't compare apples and oranges.

Filter implausible values – they corrupt your model.

Have a fallback – if detail page fails, keep the card data.

Document your schema – you'll thank yourself later.

Respect the source – rate limiting isn't optional.

Next Steps: Data Cleaning

Load raw_listings.csv
Remove duplicates
Standardize location strings
Convert price to integer (remove “KSh”, commas)
Convert size to numeric (extract first number from ranges, convert acres to sqm)
Create features: price_per_sqft, amenity_score (count of amenities), month from Created At
Basic EDA and save as clean_listings.csv

Get the Code

All code is available on GitHub: github

Contributions welcome! Feel free to open issues or PRs.

Connect with Me

LinkedIn: linkedin

Python #WebScraping #BeautifulSoup #DataScience #MachineLearning #Kenya #OpenSource #Regex

DEV Community