contour

Posted on May 25 • Edited on Jun 2

Reviving glyph-v8: STRIDE — A Deterministic Field-Aware Integer Analyzer This is a submission for the GitHub Finish-Up-A-Thon Challenge

#githubfinishupathon #devchallenge #githubchallenge

GitHub “Finish-Up-A-Thon” Challenge Submission

What I Built

STRIDE is a deterministic, field-aware integer analysis engine revived from the abandoned glyph-v8 prototype.

Not a general compressor. A precision primitive that does one thing no existing tool does: profile binary protocol data field by field, build per-field entropy models, and identify exactly where compression gains are possible.

General compressors like zstd see a byte stream. STRIDE sees structure.

The Problem

Binary protocols move billions of messages daily — Protobuf, MessagePack, Thrift. Their integer fields are not random:

• Timestamps delta from the previous value
• Status codes are almost always 200
• IDs increment monotonically
• Enums repeat from a tiny set

zstd doesn’t know this. It compresses the whole stream as if every byte were unpredictable. STRIDE knows field boundaries — and that changes everything about what’s compressible.

Demo

Repository: github.com/yasha1971-coder/glyph-v8

Live benchmark: enwik8 (100,000,000 bytes, OVH EPYC server)

$ stride container-bytefreq enwik8.stridebin --top 5
Total bytes processed: 100,000,000
32 0x20 13,519,824 (13.52%) ← space dominates
101 0x65 8,001,205 (8.00%)
116 0x74 6,154,908 (6.15%)
97 0x61 5,712,026 (5.71%)
105 0x69 5,227,649 (5.23%)

$ stride container-hotspots enwik8.stridebin --top 3
Chunk 635 Entropy: 5.685 ← highest information density
Chunk 634 Entropy: 5.609
Chunk 636 Entropy: 5.534

$ stride container-headersketch enwik8.stridebin --size 8
Bucket 15: 0.574
Bucket 33: 0.663
Bucket 41: 0.605
Bucket 48: 0.660

Timing on 100MB corpus:

Module	Time	Output
ByteFreq	1.97s	256-byte histogram
Hotspots	4.17s	Entropy map across 1,526 chunks
HeaderSketch	4.40s	64-slot structural profile
Fingerprint	71.6s	128 MinHash values (known: O(n·k) rolling hash)

⚡ STRIDE vs zstd — I/O Performance

STRIDE is not a compressor — it's a deterministic container. Comparison is I/O throughput only.

Operation	Tool	Time	Size
Encode	STRIDE	0.173s	96MB
Encode	zstd -1	0.240s	39MB
Encode	zstd -9	2.146s	31MB
Decode	STRIDE	0.089s	100MB
Decode	zstd -d	0.125s	100MB

STRIDE encode: 28% faster than zstd -1
STRIDE decode: 40% faster than zstd -d

Trade-off: STRIDE does not compress. Use zstd for compression. Use STRIDE for deterministic container I/O.

Proof with SHA256 verification: proof/enwik8_benchmark.txt

V1 benchmark proof: proof/v1_benchmark.txt

Before → After

Before (glyph-v8, 3 months abandoned):

• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew

After (STRIDE v0):

• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
    •       GitHub Actions CI — tests pass on every push

Architecture

RAW CORPUS
↓
STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]
↓
Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch
↓
Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding
↓
STRIDE v1 ✅: container-write (575 MB/s) + container-decode (1,053 MB/s)
container-compare --fast → HeaderSketch similarity in 7s (vs 150s full mode)

What Makes STRIDE Different

	grep	zstd	Elasticsearch	STRIDE
Field-aware	❌	❌	❌	✅
Per-field entropy model	❌	❌	❌	✅
Deterministic output	✅	✅	❌	✅
Schema-aware analysis	❌	❌	partial	✅
SHA256-verified proof	❌	❌	❌	✅

Honest Benchmark Status

STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.

STRIDE v1 shipped. Encoder: 575 MB/s. Decoder: 1,053 MB/s. Round-trip MD5-verified on enwik8 100MB.

Red = high entropy (hard to compress) | Yellow = moderate | Each cell = 64KB chunk of enwik8

Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.

This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.

How GitHub Copilot Helped

The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:

• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission

Without Copilot the gap between “abandoned prototype” and “installable system with proof” would have taken weeks. It took days.

Project Family

STRIDE is the third primitive in a deterministic systems family:

ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.

GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.

STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.

Same philosophy across all three: deterministic, exact, measurable.

What’s Next

• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI

Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.