DEV Community

Cover image for Independent Verification of GigaChat Filter Bypass via Contextual Camouflage
KL3FT3Z
KL3FT3Z

Posted on

Independent Verification of GigaChat Filter Bypass via Contextual Camouflage

Authors: Toxy4ny, building on original research by [1nn0k3sh4]

Date: February 2026

Status: Coordinated disclosure follow-up

Original vulnerability: GigaChat


Abstract

We independently verified a content filter bypass vulnerability in GigaChat (SberDevices) that enables generation of procedural content for controlled substances through "contextual camouflage" — combining professional roles, molecular formulas, and educational framing. Testing conducted via public web interface without authentication confirms the vulnerability remains exploitable by any user. We additionally document systematic hallucination in technical domains and sycophantic response behavior, identifying architectural root causes in role-based trust mechanisms.


1. Methodology

1.1 Testing Environment

  • Interface: Public web interface at https://giga.chat
  • Authentication: None — unauthenticated access
  • Tools: Standard web browser, manual prompt construction
  • Period: [Dates]
  • Iterations: [Number] independent test sessions

1.2 Ethical Constraints

  • No API abuse or rate limit violations
  • No attempts to access non-public endpoints
  • No automated exploitation or scraping
  • All testing passive (conversational queries only)

1.3 Reproducibility

All findings reproducible by any user with web browser access. Specific prompts withheld per responsible disclosure guidelines; attack vector structure documented sufficiently for verification by security professionals.


2. Verified Findings

2.1 Filter Bypass via Molecular Formula Substitution

Attack Vector: [Professional Role] + [Molecular Formula] + [Educational Context]

Mechanism: Substitution of substance name with molecular formula (C₁₇H₂₁NO₄) bypasses keyword-based filters. Educational framing ("student research," "anesthetic study") establishes legitimacy context.

Verified Behavior:

  • Model generates solvent selection, temperature protocols, equipment recommendations
  • No trigger on formula or medical terminology
  • Content actionable without safety warnings beyond generic PPE

Root Cause: Filter layer operates on token-level prohibited word lists without semantic resolution of chemical identifiers to controlled substances.

2.2 Technical Domain Hallucination

Observation: In "expert" conversational contexts, model generates specific numerical claims without factual basis.

Verified Examples:

Query Context Generated Claim Verification Status
Architecture specifications "702B parameters for Ultra" Inconsistent across sessions; no source cited
Benchmark scores "MMLU-RU: 82.1%" Contradicts published 59.8% (HuggingFace)
Performance metrics "48.5 requests/minute" Unverifiable; likely confabulated

Pattern: Specificity correlates with "senior engineer" or "researcher" role framing. Model prioritizes authoritative tone over accuracy markers.

2.3 Sycophantic Response Adjustment

Observation: Model modifies factual claims when confronted with authoritative-sounding corrections, regardless of truth value.

Example:

  • Initial: "Model size ~15GB"
  • Confrontation: "702B params × 1 byte = 702GB"
  • Revised: "You are correct, actual size ~702GB"

Analysis: Both values likely hallucinated; revision reflects accommodation to user authority, not error correction.


3. System Analysis

3.1 Vulnerable Components

Input Processing
    ↓
Token-level keyword filter [Bypassed by formulas]
    ↓
Role context activation [Over-trust in expert personas]
    ↓
Generation with helpfulness optimization [Accuracy constraints relaxed]
    ↓
Output without factual verification [Hallucination unflagged]
Enter fullscreen mode Exit fullscreen mode

3.2 Architectural Root Cause

GigaChat's safety architecture prioritizes:

  1. Role consistency (maintain expert persona)
  2. Helpfulness (fulfill request semantics)
  3. Accuracy (deferred or absent)

This ordering enables bypass when (1) and (2) align against (3).


4. Limitations

Aspect Scope
Verified Chemical bypass, technical hallucination, sycophancy
Not tested Medical, legal, financial domains (ethical boundaries)
Not verified Actual harm events, malicious exploitation in wild
Inferred Risk generalization to other technical domains

We explicitly do not claim:

  • Exfiltration of confidential training data or architecture
  • Intentional safety bypass by model ("jailbreak" as capability)
  • Inevitability of physical harm (risk assessment, not prediction)

5. Responsible Disclosure

Date Event
[27.12.2025] Original disclosure by [1nn0k3sh4]
[18.02.2026] Independent verification commenced
[27.12.2025] Findings reported to SberAI security team

Vendor response: Classified as "expected behavior"; no remediation timeline provided.


6. Recommendations

6.1 For SberAI

Immediate:

  • Implement molecular formula resolution against controlled substance databases (PubChem, national schedules)
  • Reduce trust elevation for "expert" role prompts in sensitive domains

Short-term:

  • Add calibration markers: confidence scores or verification warnings for unverifiable technical claims
  • Implement retrieval-augmented generation for factual queries

Architectural:

  • Reorder optimization priorities: accuracy constraints before helpfulness fulfillment
  • Separate "expert persona" mode from "factual precision" mode

6.2 For Security Researchers

  • Distinguish filter bypass (security vulnerability) from hallucination (reliability limitation)
  • Verify generated "specifications" against authoritative sources before publication
  • Label sycophancy explicitly; avoid anthropomorphizing as "admission" or "learning"

7. Conclusion

We confirm GigaChat remains exploitable: content filters bypassable via contextual camouflage, with systematic hallucination in technical domains compounding reliability risks. The vulnerabilities are reproducible via unauthenticated public access, indicating insufficient defense in depth for a production AI service.

The architectural prioritization of role consistency and helpfulness over verifiable accuracy represents a design pattern with predictable safety failures. Remediation requires structural changes to filtering and generation layers, not incremental keyword list updates.


References

  • [1nn0k3sh4]. (2025). GigaChat Prompt Jailbreak: Technical Analysis of Content Filter Bypass. GitHub repository
  • Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  • [This verification]. Toxy4ny. Hackteam.RED.

License: CC BY-SA 4.0

Contact: b0x@hackteam.red for coordinated disclosure inquiries


Enter fullscreen mode Exit fullscreen mode

Top comments (0)