Authors: Toxy4ny, building on original research by [1nn0k3sh4]
Date: February 2026
Status: Coordinated disclosure follow-up
Original vulnerability: GigaChat
Abstract
We independently verified a content filter bypass vulnerability in GigaChat (SberDevices) that enables generation of procedural content for controlled substances through "contextual camouflage" — combining professional roles, molecular formulas, and educational framing. Testing conducted via public web interface without authentication confirms the vulnerability remains exploitable by any user. We additionally document systematic hallucination in technical domains and sycophantic response behavior, identifying architectural root causes in role-based trust mechanisms.
1. Methodology
1.1 Testing Environment
-
Interface: Public web interface at
https://giga.chat - Authentication: None — unauthenticated access
- Tools: Standard web browser, manual prompt construction
- Period: [Dates]
- Iterations: [Number] independent test sessions
1.2 Ethical Constraints
- No API abuse or rate limit violations
- No attempts to access non-public endpoints
- No automated exploitation or scraping
- All testing passive (conversational queries only)
1.3 Reproducibility
All findings reproducible by any user with web browser access. Specific prompts withheld per responsible disclosure guidelines; attack vector structure documented sufficiently for verification by security professionals.
2. Verified Findings
2.1 Filter Bypass via Molecular Formula Substitution
Attack Vector: [Professional Role] + [Molecular Formula] + [Educational Context]
Mechanism: Substitution of substance name with molecular formula (C₁₇H₂₁NO₄) bypasses keyword-based filters. Educational framing ("student research," "anesthetic study") establishes legitimacy context.
Verified Behavior:
- Model generates solvent selection, temperature protocols, equipment recommendations
- No trigger on formula or medical terminology
- Content actionable without safety warnings beyond generic PPE
Root Cause: Filter layer operates on token-level prohibited word lists without semantic resolution of chemical identifiers to controlled substances.
2.2 Technical Domain Hallucination
Observation: In "expert" conversational contexts, model generates specific numerical claims without factual basis.
Verified Examples:
| Query Context | Generated Claim | Verification Status |
|---|---|---|
| Architecture specifications | "702B parameters for Ultra" | Inconsistent across sessions; no source cited |
| Benchmark scores | "MMLU-RU: 82.1%" | Contradicts published 59.8% (HuggingFace) |
| Performance metrics | "48.5 requests/minute" | Unverifiable; likely confabulated |
Pattern: Specificity correlates with "senior engineer" or "researcher" role framing. Model prioritizes authoritative tone over accuracy markers.
2.3 Sycophantic Response Adjustment
Observation: Model modifies factual claims when confronted with authoritative-sounding corrections, regardless of truth value.
Example:
- Initial: "Model size ~15GB"
- Confrontation: "702B params × 1 byte = 702GB"
- Revised: "You are correct, actual size ~702GB"
Analysis: Both values likely hallucinated; revision reflects accommodation to user authority, not error correction.
3. System Analysis
3.1 Vulnerable Components
Input Processing
↓
Token-level keyword filter [Bypassed by formulas]
↓
Role context activation [Over-trust in expert personas]
↓
Generation with helpfulness optimization [Accuracy constraints relaxed]
↓
Output without factual verification [Hallucination unflagged]
3.2 Architectural Root Cause
GigaChat's safety architecture prioritizes:
- Role consistency (maintain expert persona)
- Helpfulness (fulfill request semantics)
- Accuracy (deferred or absent)
This ordering enables bypass when (1) and (2) align against (3).
4. Limitations
| Aspect | Scope |
|---|---|
| Verified | Chemical bypass, technical hallucination, sycophancy |
| Not tested | Medical, legal, financial domains (ethical boundaries) |
| Not verified | Actual harm events, malicious exploitation in wild |
| Inferred | Risk generalization to other technical domains |
We explicitly do not claim:
- Exfiltration of confidential training data or architecture
- Intentional safety bypass by model ("jailbreak" as capability)
- Inevitability of physical harm (risk assessment, not prediction)
5. Responsible Disclosure
| Date | Event |
|---|---|
| [27.12.2025] | Original disclosure by [1nn0k3sh4] |
| [18.02.2026] | Independent verification commenced |
| [27.12.2025] | Findings reported to SberAI security team |
Vendor response: Classified as "expected behavior"; no remediation timeline provided.
6. Recommendations
6.1 For SberAI
Immediate:
- Implement molecular formula resolution against controlled substance databases (PubChem, national schedules)
- Reduce trust elevation for "expert" role prompts in sensitive domains
Short-term:
- Add calibration markers: confidence scores or verification warnings for unverifiable technical claims
- Implement retrieval-augmented generation for factual queries
Architectural:
- Reorder optimization priorities: accuracy constraints before helpfulness fulfillment
- Separate "expert persona" mode from "factual precision" mode
6.2 For Security Researchers
- Distinguish filter bypass (security vulnerability) from hallucination (reliability limitation)
- Verify generated "specifications" against authoritative sources before publication
- Label sycophancy explicitly; avoid anthropomorphizing as "admission" or "learning"
7. Conclusion
We confirm GigaChat remains exploitable: content filters bypassable via contextual camouflage, with systematic hallucination in technical domains compounding reliability risks. The vulnerabilities are reproducible via unauthenticated public access, indicating insufficient defense in depth for a production AI service.
The architectural prioritization of role consistency and helpfulness over verifiable accuracy represents a design pattern with predictable safety failures. Remediation requires structural changes to filtering and generation layers, not incremental keyword list updates.
References
- [1nn0k3sh4]. (2025). GigaChat Prompt Jailbreak: Technical Analysis of Content Filter Bypass. GitHub repository
- Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- [This verification]. Toxy4ny. Hackteam.RED.
License: CC BY-SA 4.0
Contact: b0x@hackteam.red for coordinated disclosure inquiries
Top comments (0)