What I Built
STRIDE is a deterministic, field-aware integer analysis engine revived from the abandoned glyph-v8 prototype.
Not a general compressor. A precision primitive that does one thing no existing tool does: profile binary protocol data field by field, build per-field entropy models, and identify exactly where compression gains are possible.
General compressors like zstd see a byte stream. STRIDE sees structure.
The Problem
Binary protocols move billions of messages daily — Protobuf, MessagePack, Thrift. Their integer fields are not random:
• Timestamps delta from the previous value
• Status codes are almost always 200
• IDs increment monotonically
• Enums repeat from a tiny set
zstd doesn’t know this. It compresses the whole stream as if every byte were unpredictable. STRIDE knows field boundaries — and that changes everything about what’s compressible.
Demo
Repository: github.com/yasha1971-coder/glyph-v8
Live benchmark: enwik8 (100,000,000 bytes, OVH EPYC server)
$ stride container-bytefreq enwik8.stridebin --top 5
Total bytes processed: 100,000,000
32 0x20 13,519,824 (13.52%) ← space dominates
101 0x65 8,001,205 (8.00%)
116 0x74 6,154,908 (6.15%)
97 0x61 5,712,026 (5.71%)
105 0x69 5,227,649 (5.23%)
$ stride container-hotspots enwik8.stridebin --top 3
Chunk 635 Entropy: 5.685 ← highest information density
Chunk 634 Entropy: 5.609
Chunk 636 Entropy: 5.534
$ stride container-headersketch enwik8.stridebin --size 8
Bucket 15: 0.574
Bucket 33: 0.663
Bucket 41: 0.605
Bucket 48: 0.660
Timing on 100MB corpus:
| Module | Time | Output |
|---|---|---|
| ByteFreq | 1.97s | 256-byte histogram |
| Hotspots | 4.17s | Entropy map across 1,526 chunks |
| HeaderSketch | 4.40s | 64-slot structural profile |
| Fingerprint | 71.6s | 128 MinHash values (known: O(n·k) rolling hash) |
⚡ STRIDE vs zstd — I/O Performance
STRIDE is not a compressor — it's a deterministic container. Comparison is I/O throughput only.
| Operation | Tool | Time | Size |
|---|---|---|---|
| Encode | STRIDE | 0.173s | 96MB |
| Encode | zstd -1 | 0.240s | 39MB |
| Encode | zstd -9 | 2.146s | 31MB |
| Decode | STRIDE | 0.089s | 100MB |
| Decode | zstd -d | 0.125s | 100MB |
STRIDE encode: 28% faster than zstd -1
STRIDE decode: 40% faster than zstd -d
Trade-off: STRIDE does not compress. Use zstd for compression. Use STRIDE for deterministic container I/O.
Proof with SHA256 verification: proof/enwik8_benchmark.txt
V1 benchmark proof: proof/v1_benchmark.txt
Before → After
Before (glyph-v8, 3 months abandoned):
• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew
After (STRIDE v0):
• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
• GitHub Actions CI — tests pass on every push
Architecture
RAW CORPUS
↓
STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]
↓
Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch
↓
Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding
↓
STRIDE v1 ✅: container-write (575 MB/s) + container-decode (1,053 MB/s)
container-compare --fast → HeaderSketch similarity in 7s (vs 150s full mode)
What Makes STRIDE Different
| grep | zstd | Elasticsearch | STRIDE | |
|---|---|---|---|---|
| Field-aware | ❌ | ❌ | ❌ | ✅ |
| Per-field entropy model | ❌ | ❌ | ❌ | ✅ |
| Deterministic output | ✅ | ✅ | ❌ | ✅ |
| Schema-aware analysis | ❌ | ❌ | partial | ✅ |
| SHA256-verified proof | ❌ | ❌ | ❌ | ✅ |
Honest Benchmark Status
STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.
STRIDE v1 shipped. Encoder: 575 MB/s. Decoder: 1,053 MB/s. Round-trip MD5-verified on enwik8 100MB.
Red = high entropy (hard to compress) | Yellow = moderate | Each cell = 64KB chunk of enwik8
Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.
This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.
How GitHub Copilot Helped
The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:
• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission
Without Copilot the gap between “abandoned prototype” and “installable system with proof” would have taken weeks. It took days.
Project Family
STRIDE is the third primitive in a deterministic systems family:
ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.
GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.
STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.
Same philosophy across all three: deterministic, exact, measurable.
What’s Next
• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI
Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.

Top comments (1)
Happy to answer questions about the STRIDE pipeline,
entropy analysis on binary corpora, or how I revived
this abandoned prototype with Copilot.
Also open to feedback on the architecture —
container format, chunking strategy, or the MinHash fingerprint.