DEV Community

contour
contour

Posted on • Edited on

Reviving glyph-v8: STRIDE — A Deterministic Field-Aware Integer Analyzer This is a submission for the GitHub Finish-Up-A-Thon Challenge

GitHub “Finish-Up-A-Thon” Challenge Submission

What I Built

STRIDE is a deterministic, field-aware integer analysis engine revived from the abandoned glyph-v8 prototype.

Not a general compressor. A precision primitive that does one thing no existing tool does: profile binary protocol data field by field, build per-field entropy models, and identify exactly where compression gains are possible.

General compressors like zstd see a byte stream. STRIDE sees structure.

The Problem

Binary protocols move billions of messages daily — Protobuf, MessagePack, Thrift. Their integer fields are not random:

• Timestamps delta from the previous value
• Status codes are almost always 200
• IDs increment monotonically
• Enums repeat from a tiny set
Enter fullscreen mode Exit fullscreen mode

zstd doesn’t know this. It compresses the whole stream as if every byte were unpredictable. STRIDE knows field boundaries — and that changes everything about what’s compressible.

Demo

Repository: github.com/yasha1971-coder/glyph-v8

Live benchmark: enwik8 (100,000,000 bytes, OVH EPYC server)

$ stride container-bytefreq enwik8.stridebin --top 5
Total bytes processed: 100,000,000
32 0x20 13,519,824 (13.52%) ← space dominates
101 0x65 8,001,205 (8.00%)
116 0x74 6,154,908 (6.15%)
97 0x61 5,712,026 (5.71%)
105 0x69 5,227,649 (5.23%)

$ stride container-hotspots enwik8.stridebin --top 3
Chunk 635 Entropy: 5.685 ← highest information density
Chunk 634 Entropy: 5.609
Chunk 636 Entropy: 5.534

$ stride container-headersketch enwik8.stridebin --size 8
Bucket 15: 0.574
Bucket 33: 0.663
Bucket 41: 0.605
Bucket 48: 0.660

Timing on 100MB corpus:

Module Time Output
ByteFreq 1.97s 256-byte histogram
Hotspots 4.17s Entropy map across 1,526 chunks
HeaderSketch 4.40s 64-slot structural profile
Fingerprint 71.6s 128 MinHash values (known: O(n·k) rolling hash)

⚡ STRIDE vs zstd — I/O Performance

STRIDE is not a compressor — it's a deterministic container. Comparison is I/O throughput only.

Operation Tool Time Size
Encode STRIDE 0.173s 96MB
Encode zstd -1 0.240s 39MB
Encode zstd -9 2.146s 31MB
Decode STRIDE 0.089s 100MB
Decode zstd -d 0.125s 100MB

STRIDE encode: 28% faster than zstd -1
STRIDE decode: 40% faster than zstd -d

Trade-off: STRIDE does not compress. Use zstd for compression. Use STRIDE for deterministic container I/O.

Proof with SHA256 verification: proof/enwik8_benchmark.txt

V1 benchmark proof: proof/v1_benchmark.txt

Before → After

Before (glyph-v8, 3 months abandoned):

• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew
Enter fullscreen mode Exit fullscreen mode

After (STRIDE v0):

• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
    •       GitHub Actions CI — tests pass on every push
Enter fullscreen mode Exit fullscreen mode

Architecture

RAW CORPUS

STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]

Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch

Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding

STRIDE v1 ✅: container-write (575 MB/s) + container-decode (1,053 MB/s)
container-compare --fast → HeaderSketch similarity in 7s (vs 150s full mode)

What Makes STRIDE Different

grep zstd Elasticsearch STRIDE
Field-aware
Per-field entropy model
Deterministic output
Schema-aware analysis partial
SHA256-verified proof

Honest Benchmark Status

STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.

STRIDE v1 shipped. Encoder: 575 MB/s. Decoder: 1,053 MB/s. Round-trip MD5-verified on enwik8 100MB.

Entropy Heatmap

Red = high entropy (hard to compress) | Yellow = moderate | Each cell = 64KB chunk of enwik8

Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.

This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.

How GitHub Copilot Helped

The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:

• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission
Enter fullscreen mode Exit fullscreen mode

Without Copilot the gap between “abandoned prototype” and “installable system with proof” would have taken weeks. It took days.

Project Family

STRIDE is the third primitive in a deterministic systems family:

ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.

GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.

STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.

Same philosophy across all three: deterministic, exact, measurable.

What’s Next

• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI
Enter fullscreen mode Exit fullscreen mode

Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.

Top comments (1)

Collapse
 
yasha1971coder profile image
contour

Happy to answer questions about the STRIDE pipeline,
entropy analysis on binary corpora, or how I revived
this abandoned prototype with Copilot.

Also open to feedback on the architecture —
container format, chunking strategy, or the MinHash fingerprint.