Igor Nosatov

Posted on Feb 12

The Marie Kondo Method for MySQL: A Data Lifecycle Story

#mysql #database #performance #devops

The 5-Stage Data Lifecycle: From Birth to... Whatever Comes After

Just like Marie Kondo's tidying method, data has stages. But instead of clothes, we're dealing with rows. Let me walk you through what I learned that sleepless night.

┌─────────────────────────────────────────────────────────┐
│         THE MYSQL BIG DATA LIFECYCLE                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  1. INGESTION     →  2. ACTIVE USE  →  3. AGING         │
│  (Hot Data)          (Warm Data)       (Cool Data)      │
│       ↓                                      ↓           │
│  5. DELETION     ←  4. ARCHIVAL                         │
│  (Goodbye)          (Cold Storage)                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Stage 1: Ingestion - The "Just Moved In" Phase 🚚

This is when data is fresh, exciting, and everyone wants to touch it. Like that new IKEA furniture you just assembled (and only cursed at twice).

-- The birth of data: INSERT operations
CREATE TABLE user_events (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_id INT NOT NULL,
    event_type VARCHAR(50),
    event_data JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_user_created (user_id, created_at),
    INDEX idx_created (created_at)
) ENGINE=InnoDB;

-- High-velocity ingestion
INSERT INTO user_events (user_id, event_type, event_data)
VALUES (12345, 'page_view', '{"page": "/dashboard", "duration": 45}');

Pro tip: At this stage, optimize for WRITE speed. Use bulk inserts, disable unnecessary indexes during batch loads, and consider partitioning from day one.

-- Partition by month for easier lifecycle management
CREATE TABLE user_events_partitioned (
    id BIGINT AUTO_INCREMENT,
    user_id INT NOT NULL,
    event_type VARCHAR(50),
    event_data JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id, created_at),
    INDEX idx_user_created (user_id, created_at)
) ENGINE=InnoDB
PARTITION BY RANGE (UNIX_TIMESTAMP(created_at)) (
    PARTITION p202401 VALUES LESS THAN (UNIX_TIMESTAMP('2024-02-01')),
    PARTITION p202402 VALUES LESS THAN (UNIX_TIMESTAMP('2024-03-01')),
    PARTITION p202403 VALUES LESS THAN (UNIX_TIMESTAMP('2024-04-01')),
    PARTITION pmax VALUES LESS THAN MAXVALUE
);

Stage 2: Active Use - The "Daily Routine" Phase ☕

This is your data's productive years. It's being queried constantly, updated regularly, and earning its keep in RAM.

-- Active queries hit recent data
SELECT 
    event_type,
    COUNT(*) as event_count,
    AVG(JSON_EXTRACT(event_data, '$.duration')) as avg_duration
FROM user_events
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY event_type;

The Reality Check: Only about 20% of your data lives here, but it accounts for 80% of your queries. Sound familiar? That's the Pareto Principle, alive and well in your database.

Performance trick:

-- Keep your working set in memory
-- Check your buffer pool usage
SELECT 
    ROUND((PagesData * 16384) / (1024 * 1024 * 1024), 2) AS buffer_pool_gb,
    ROUND((PagesFree * 16384) / (1024 * 1024 * 1024), 2) AS free_gb
FROM (
    SELECT 
        SUM(DATA_LENGTH) / 16384 AS PagesData,
        SUM(INDEX_LENGTH) / 16384 AS PagesFree
    FROM information_schema.TABLES
    WHERE TABLE_SCHEMA = 'your_database'
) AS t;

Stage 3: Aging - The "Attic" Phase 📦

Data doesn't retire gracefully. It just... slows down. Like me trying to run after turning 30.

This is when data is:

Rarely queried (maybe once a month)
Still needs to be accessible
Taking up valuable SSD space
Making your backups unnecessarily long

-- Identify aging data
SELECT 
    TABLE_NAME,
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 / 1024), 2) AS size_gb,
    TABLE_ROWS,
    CREATE_TIME,
    UPDATE_TIME
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database'
    AND UPDATE_TIME < DATE_SUB(NOW(), INTERVAL 6 MONTH)
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;

The Move: This is where partitioning pays off. You can move old partitions to slower, cheaper storage.

-- Prepare for archival: add a status column
ALTER TABLE user_events ADD COLUMN data_status ENUM('active', 'archived') DEFAULT 'active';

-- Mark old data
UPDATE user_events 
SET data_status = 'archived'
WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 YEAR);

-- Create a covering index for archived data queries
CREATE INDEX idx_archived ON user_events(data_status, created_at) 
WHERE data_status = 'archived';

Stage 4: Archival - The "Storage Unit" Phase 🏢

This is the data equivalent of saying, "I might need this someday, but not today, and definitely not in production."

Here's my battle-tested archival strategy:

-- Step 1: Create archive table (could be on different server/storage)
CREATE TABLE user_events_archive LIKE user_events;

-- Step 2: Move data in manageable chunks (avoid locking everything)
-- Use a stored procedure for this
DELIMITER //
CREATE PROCEDURE archive_old_events(IN batch_size INT, IN days_old INT)
BEGIN
    DECLARE rows_affected INT DEFAULT 1;
    DECLARE total_archived INT DEFAULT 0;

    WHILE rows_affected > 0 DO
        -- Insert to archive
        INSERT INTO user_events_archive
        SELECT * FROM user_events
        WHERE created_at < DATE_SUB(NOW(), INTERVAL days_old DAY)
        LIMIT batch_size;

        SET rows_affected = ROW_COUNT();
        SET total_archived = total_archived + rows_affected;

        -- Delete from main table
        DELETE FROM user_events
        WHERE created_at < DATE_SUB(NOW(), INTERVAL days_old DAY)
        LIMIT batch_size;

        -- Breathe (prevent replication lag)
        DO SLEEP(0.1);
    END WHILE;

    SELECT CONCAT('Archived ', total_archived, ' rows') AS result;
END//
DELIMITER ;

-- Run it
CALL archive_old_events(1000, 365);

Real talk: I learned this the hard way. My first archival attempt locked the table for 4 hours. The Slack channel was... unpleasant.

Alternative: Partition-based archival (my preferred method now)

-- Simply detach old partitions
ALTER TABLE user_events_partitioned 
EXCHANGE PARTITION p202401 
WITH TABLE user_events_archive_202401;

-- Now p202401 is empty and archive table has the data
-- This is INSTANT. No copying. Just metadata changes.

Stage 5: Deletion - The "Thanks, Goodbye" Phase 👋

Some data needs to die. GDPR requests, compliance requirements, or just the march of time.

-- Compliance-driven deletion (GDPR "right to be forgotten")
DELIMITER //
CREATE PROCEDURE delete_user_data(IN target_user_id INT)
BEGIN
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
    BEGIN
        ROLLBACK;
        SELECT 'Error occurred, rolled back' AS message;
    END;

    START TRANSACTION;

    -- Delete from partitioned table (fast with partition pruning)
    DELETE FROM user_events_partitioned WHERE user_id = target_user_id;

    -- Delete from archive
    DELETE FROM user_events_archive WHERE user_id = target_user_id;

    -- Log the deletion for compliance
    INSERT INTO deletion_log (user_id, deleted_at, deleted_by)
    VALUES (target_user_id, NOW(), CURRENT_USER());

    COMMIT;
    SELECT 'User data deleted successfully' AS message;
END//
DELIMITER ;

Gotcha alert: DELETE on large tables is a nightmare. Consider these alternatives:

-- Option 1: DROP partition (instant for old data)
ALTER TABLE user_events_partitioned DROP PARTITION p202101;

-- Option 2: TRUNCATE archive table (when you're done with it)
TRUNCATE TABLE user_events_archive_2021;

-- Option 3: pt-archiver (Percona Toolkit - safest for live systems)
-- pt-archiver --source h=localhost,D=mydb,t=user_events \
--              --where "created_at < DATE_SUB(NOW(), INTERVAL 3 YEAR)" \
--              --limit 1000 --commit-each --purge

Automation: Because I Need to Sleep Sometimes 😴

After implementing this manually for a month, I automated everything. Here's the cron job that runs my lifecycle:

# /etc/cron.d/mysql-lifecycle
# Daily archival at 2 AM
0 2 * * * mysql_user /usr/local/bin/archive_data.sh >> /var/log/mysql-archival.log 2>&1

# Monthly partition rotation
0 3 1 * * mysql_user /usr/local/bin/rotate_partitions.sh >> /var/log/mysql-partitions.log 2>&1

#!/bin/bash
# archive_data.sh
# Because automation is self-care

MYSQL_USER="archiver"
MYSQL_PASS="your_secure_password"
DB_NAME="your_database"

# Archive data older than 1 year
mysql -u$MYSQL_USER -p$MYSQL_PASS $DB_NAME <<EOF
CALL archive_old_events(5000, 365);
EOF

# Optimize tables after archival
mysql -u$MYSQL_USER -p$MYSQL_PASS $DB_NAME <<EOF
OPTIMIZE TABLE user_events;
EOF

echo "Archival completed at $(date)"

Monitoring: Know Before You're Screwed 📊

I built a simple monitoring dashboard. Here's the query I run every hour:

-- Data lifecycle health check
SELECT 
    'Hot Data (< 7 days)' AS category,
    COUNT(*) AS row_count,
    ROUND(SUM(LENGTH(event_data)) / 1024 / 1024, 2) AS size_mb
FROM user_events
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)

UNION ALL

SELECT 
    'Warm Data (7-30 days)' AS category,
    COUNT(*) AS row_count,
    ROUND(SUM(LENGTH(event_data)) / 1024 / 1024, 2) AS size_mb
FROM user_events
WHERE created_at BETWEEN DATE_SUB(NOW(), INTERVAL 30 DAY) 
                     AND DATE_SUB(NOW(), INTERVAL 7 DAY)

UNION ALL

SELECT 
    'Cool Data (30-365 days)' AS category,
    COUNT(*) AS row_count,
    ROUND(SUM(LENGTH(event_data)) / 1024 / 1024, 2) AS size_mb
FROM user_events
WHERE created_at BETWEEN DATE_SUB(NOW(), INTERVAL 365 DAY) 
                     AND DATE_SUB(NOW(), INTERVAL 30 DAY)

UNION ALL

SELECT 
    'Should Be Archived (> 365 days)' AS category,
    COUNT(*) AS row_count,
    ROUND(SUM(LENGTH(event_data)) / 1024 / 1024, 2) AS size_mb
FROM user_events
WHERE created_at < DATE_SUB(NOW(), INTERVAL 365 DAY);

Alert when: "Should Be Archived" exceeds 10% of total data. That's my "clean your room" threshold.

The Results: Did It Spark Joy? ✨

After implementing this lifecycle strategy:

Database size: 287 GB → 42 GB (85% reduction)
Query performance: Average query time reduced by 73%
Backup time: 4 hours → 35 minutes
My sleep quality: Significantly improved
3 AM alerts: Dropped from weekly to "what's that weird notification sound?"

Before:                          After:
┌─────────────────┐              ┌──────────┐
│   287 GB        │              │  42 GB   │
│                 │              │          │
│  ██████████████ │              │  ███     │
│  ██████████████ │    ───────>  │          │
│  ██████████████ │              │          │
│  ██████████████ │              │          │
└─────────────────┘              └──────────┘
  Full, Slow, Sad                 Lean, Fast, Happy

Lessons from the Trenches 🎖️

Start with partitioning. Future-you will send thank-you notes.
Archive in small batches. Locking a 287 GB table is a resume-generating event.
Test your archival strategy on a replica first. I cannot stress this enough.
Automate, but monitor. Automation without monitoring is just automated disasters.
Document your retention policy. Legal and DevOps will both thank you.
Consider external storage. After a year, data can live in S3/GCS. MySQL doesn't need to be a long-term storage solution.

The Hard Questions You Should Ask 💭

Before implementing your lifecycle strategy:

What's your actual data access pattern? Don't guess. Query your slow query log.
What's your legal retention requirement? GDPR, HIPAA, or your industry regulations might dictate this.
What's the cost of storage vs. the cost of a 3 AM incident? Sometimes keeping data is cheaper than the engineering time to manage it perfectly.
Can you query archived data separately? Design your application to handle "recent" vs. "historical" data differently.

Going Further 🚀

This is just the beginning. Advanced topics to explore:

TiDB or Vitess for automatic sharding of massive datasets
ClickHouse or TimescaleDB for time-series data that MySQL struggles with
Change Data Capture (CDC) to stream data to data lakes in real-time
MySQL 8.0 Instant DDL for faster schema changes on large tables

The Bottom Line

Your database isn't a hoarding closet. It's a workshop. Keep what you use. Archive what you might need. Delete what you don't.

And remember: if your data doesn't spark joy (or profit, or compliance), thank it and let it go.

DEV Community