Our AI proposal generator was hemorrhaging money. In month two of operation, our OpenAI bill hit $3,200 while generating only $1,800 in revenue. Our gross margin was negative 78%.
Something had to change, fast.
Six months later, we're processing 10x the volume at 27% of the original cost per request. Our margins are now healthy (+62%), response quality improved (4.3/5 → 4.6/5), and we've learned some expensive lessons about production AI.
This is the story of how we optimized our AI features without sacrificing quality—and the specific technical strategies you can use to do the same.
The Naive Implementation: When AI Looks Easy
Here's what our original AI proposal generator looked like:
// ❌ The expensive, naive approach
async function generateProposal(request: ProposalRequest): Promise<string> {
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const completion = await client.chat.completions.create({
model: "gpt-4-turbo-preview", // Expensive!
messages: [
{
role: "system",
content: `You are an expert tender response writer for South African government procurement...
[3,200 tokens of system prompt]`
},
{
role: "user",
content: `Tender: ${request.tenderTitle}
Description: ${request.tenderDescription}
Company: ${request.companyProfile}
Requirements: ${request.requirements}
[Often 2,000+ tokens]`
}
],
temperature: 0.7,
max_tokens: 2000, // Expensive!
});
return completion.choices[0].message.content;
}
This looks clean and simple. Ship it, right?
The costs (per request):
- Input tokens: ~5,200 tokens (system prompt + user input)
- Output tokens: ~2,000 tokens (response)
- Model: GPT-4 Turbo ($0.01/1K input, $0.03/1K output)
- Cost per request: $0.112
We were generating 200-300 proposals per day. Monthly cost at 250 proposals/day:
250 requests/day × 30 days × $0.112 = $840/month
Except that's the average calculation. In reality:
- Some users regenerated 3-4 times (multiply by 4)
- Some prompts were much longer (2-3x tokens)
- Peak usage hours hit rate limits (wasted requests)
Actual monthly cost: $3,200
For a feature that generated $1,800 in premium subscriptions. Oops.
Strategy 1: Aggressive Caching (Saved 40% Immediately)
The first low-hanging fruit: identical requests.
Analysis showed that 37% of our requests were semantically identical or near-identical. Why? Because users would:
- Generate proposal for tender X
- Regenerate because they didn't like the tone
- Generate again with slight wording changes
Each regeneration cost us money, but the tender specification was the same.
Implementation: Content-Addressed Caching
import { createHash } from 'crypto';
import { redis } from '@/lib/redis';
async function generateProposal(request: ProposalRequest): Promise<string> {
// Generate cache key from request content
const cacheKey = generateCacheKey(request);
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
console.log('Cache hit, saved $0.112');
return cached;
}
// Generate new
const result = await callOpenAI(request);
// Cache for 7 days
await redis.setex(cacheKey, 604800, result);
return result;
}
function generateCacheKey(request: ProposalRequest): string {
// Normalize to handle minor variations
const normalized = {
tenderTitle: request.tenderTitle.toLowerCase().trim(),
description: request.tenderDescription.toLowerCase().trim(),
type: request.documentType,
// Don't include user-specific data
};
const content = JSON.stringify(normalized);
return `proposal:${createHash('sha256').update(content).digest('hex')}`;
}
Impact:
- Cache hit rate: 42% (better than expected!)
- Cost reduction: 42% × $0.112 = $0.047 saved per cached request
- Monthly savings: ~$530
- New monthly cost: $2,670
We were no longer losing money, but still not profitable. We needed more optimization.
Strategy 2: Prompt Compression (Saved 25% More)
Our 3,200-token system prompt was absurdly long. Did we really need all that context?
We analyzed the impact of system prompt length on output quality:
// Testing framework
async function testPromptVariations() {
const prompts = [
{ name: 'Full (3200 tokens)', content: fullPrompt },
{ name: 'Medium (1200 tokens)', content: mediumPrompt },
{ name: 'Minimal (400 tokens)', content: minimalPrompt },
];
const testCases = loadTestCases(10); // Real tender requests
for (const prompt of prompts) {
for (const testCase of testCases) {
const output = await generate(prompt.content, testCase);
const quality = await evaluateQuality(output);
console.log({
prompt: prompt.name,
quality: quality.score,
cost: calculateCost(prompt.content, output),
});
}
}
}
Results:
- Full prompt (3,200 tokens): Quality 4.3/5, Cost $0.112
- Medium prompt (1,200 tokens): Quality 4.2/5, Cost $0.079
- Minimal prompt (400 tokens): Quality 3.8/5, Cost $0.048
The medium prompt offered the best quality-cost tradeoff. We were willing to sacrifice 0.1 quality points for 29% cost reduction.
The Optimized Prompt
// ✅ Compressed but effective
const SYSTEM_PROMPT = `Expert SA govt tender writer. Generate ${documentType}.
Rules:
- SBD/MBD compliant
- Professional tone
- Address requirements explicitly
- 1-2 pages max
- No placeholders
Context: ${industryContext}`;
From 3,200 tokens to 80-120 tokens (depending on document type). 27x compression!
Impact:
- Input cost: $0.052 → $0.012 (77% reduction)
- Quality: 4.3/5 → 4.2/5 (acceptable trade-off)
- Monthly savings: ~$450
- New monthly cost: $2,220
We were getting close to profitability, but not there yet.
Strategy 3: Model Routing (Saved Another 35%)
GPT-4 is powerful but expensive. Do we always need it?
We analyzed our requests by complexity:
interface RequestComplexity {
documentType: 'cover-letter' | 'executive-summary' | 'capability-statement';
tenderLength: number; // tokens
requirementsCount: number;
complexity: 'simple' | 'medium' | 'complex';
}
function assessComplexity(request: ProposalRequest): RequestComplexity {
const tenderLength = estimateTokens(request.tenderDescription);
const requirementsCount = request.requirements?.length || 0;
let complexity: 'simple' | 'medium' | 'complex';
if (request.documentType === 'cover-letter' && tenderLength < 500) {
complexity = 'simple';
} else if (requirementsCount > 10 || tenderLength > 2000) {
complexity = 'complex';
} else {
complexity = 'medium';
}
return { documentType: request.documentType, tenderLength, requirementsCount, complexity };
}
Distribution:
- Simple: 45% (cover letters, short tenders)
- Medium: 40% (standard proposals)
- Complex: 15% (detailed technical responses)
We could route simple requests to cheaper models!
The Routing Logic
async function generateProposal(request: ProposalRequest): Promise<string> {
const complexity = assessComplexity(request);
let model: string;
switch (complexity.complexity) {
case 'simple':
model = 'gpt-3.5-turbo'; // $0.0015/1K in, $0.002/1K out
break;
case 'medium':
model = 'gpt-4-turbo-preview'; // $0.01/1K in, $0.03/1K out
break;
case 'complex':
model = 'gpt-4-turbo-preview';
break;
}
console.log(`Routing to ${model} for ${complexity.complexity} request`);
return await callOpenAI(request, model);
}
Cost Comparison:
- Simple (GPT-3.5): $0.008 per request
- Medium (GPT-4): $0.079 per request
- Complex (GPT-4): $0.112 per request
Blended cost:
- 45% × $0.008 = $0.0036
- 40% × $0.079 = $0.0316
- 15% × $0.112 = $0.0168
- Average: $0.052 per request
Down from $0.112! That's a 53% reduction with minimal quality impact.
Impact:
- Monthly cost: $2,220 → $1,450
- Quality: Still 4.2/5 average
- Margin: Now profitable!
But we weren't done optimizing.
Strategy 4: Smart Regeneration (Prevent Waste)
Users regenerating proposals 3-4 times was killing our margins. Why were they regenerating?
We added feedback tracking:
interface RegenerationFeedback {
reason: 'tone' | 'length' | 'content' | 'format' | 'other';
previousVersion: string;
userComment?: string;
}
async function regenerateProposal(
request: ProposalRequest,
feedback: RegenerationFeedback
): Promise<string> {
// Don't just regenerate blindly—modify the prompt
const modifiedRequest = applyFeedback(request, feedback);
return await generateProposal(modifiedRequest);
}
Regeneration reasons:
- Tone (too formal/informal): 38%
- Length (too long/short): 27%
- Content (missing points): 22%
- Format (structure issues): 13%
We could address these without regenerating!
Smart Edits Instead of Regeneration
async function improveProposal(
original: string,
feedback: RegenerationFeedback
): Promise<string> {
// For simple changes, use cheaper edit operations
if (feedback.reason === 'tone' || feedback.reason === 'length') {
return await editProposal(original, feedback);
}
// Only regenerate for content issues
return await regenerateProposal(request, feedback);
}
async function editProposal(
original: string,
feedback: RegenerationFeedback
): Promise<string> {
const editPrompt = `Original proposal:\n${original}\n\nModify to: ${feedback.reason}`;
// Use GPT-3.5 for edits (cheaper)
const completion = await openai.chat.completions.create({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: editPrompt }],
max_tokens: 500, // Edits are shorter
});
return completion.choices[0].message.content;
}
Impact:
- 65% of regenerations now use edit mode (much cheaper)
- Edit cost: $0.004 vs $0.052 regeneration
- Saved regeneration costs: ~$180/month
- New monthly cost: $1,270
Now we were solidly profitable.
Strategy 5: Prompt Versioning and A/B Testing
We weren't satisfied with 4.2/5 quality. Could we improve quality without increasing costs?
We implemented prompt versioning with A/B testing:
interface PromptVersion {
version: string;
prompt: string;
modelConfig: {
temperature: number;
topP?: number;
frequencyPenalty?: number;
};
activePercentage: number; // For A/B testing
}
const PROMPT_VERSIONS: PromptVersion[] = [
{
version: 'v1-baseline',
prompt: COMPRESSED_PROMPT,
modelConfig: { temperature: 0.7 },
activePercentage: 50,
},
{
version: 'v2-structured',
prompt: STRUCTURED_PROMPT_V2,
modelConfig: { temperature: 0.6, frequencyPenalty: 0.3 },
activePercentage: 50,
},
];
async function generateProposal(request: ProposalRequest): Promise<string> {
const promptVersion = selectPromptVersion();
const result = await callOpenAI(request, promptVersion);
// Track performance
await trackGeneration({
version: promptVersion.version,
requestId: request.id,
cost: calculateCost(result),
});
return result;
}
function selectPromptVersion(): PromptVersion {
const rand = Math.random() * 100;
let cumulative = 0;
for (const version of PROMPT_VERSIONS) {
cumulative += version.activePercentage;
if (rand < cumulative) return version;
}
return PROMPT_VERSIONS[0];
}
After 2 weeks of A/B testing with 500+ generations:
// Results
{
'v1-baseline': {
avgQuality: 4.2,
avgCost: $0.052,
userRating: 4.2,
},
'v2-structured': {
avgQuality: 4.6,
avgCost: $0.054,
userRating: 4.6,
}
}
v2 was slightly more expensive (+4%) but significantly better quality (+9.5%). We made it the default.
Impact:
- Quality: 4.2/5 → 4.6/5
- Cost increase: +4% ($1,270 → $1,320)
- Worth it! Better product = more users = more revenue
The Final Architecture
Here's what our production system looks like now:
// ✅ Optimized production implementation
async function generateProposal(request: ProposalRequest): Promise<string> {
// 1. Cache check
const cacheKey = generateCacheKey(request);
const cached = await redis.get(cacheKey);
if (cached) {
await analytics.track('cache_hit');
return cached;
}
// 2. Assess complexity
const complexity = assessComplexity(request);
// 3. Select model
const model = routeToModel(complexity);
// 4. Get prompt version
const promptVersion = selectPromptVersion();
// 5. Generate
const result = await callOpenAI({
model,
prompt: promptVersion.prompt,
request,
config: promptVersion.modelConfig,
});
// 6. Cache
await redis.setex(cacheKey, 604800, result);
// 7. Track
await analytics.track('generation', {
complexity,
model,
promptVersion: promptVersion.version,
cost: calculateCost(result),
});
return result;
}
The Results: By the Numbers
Month 2 (Before Optimization):
- Requests: 7,500
- Average cost per request: $0.112
- Total cost: $3,200
- Revenue: $1,800
- Margin: -78%
Month 8 (After Optimization):
- Requests: 78,000 (10x growth!)
- Average cost per request: $0.030
- Total cost: $2,340
- Revenue: $14,400 (8x growth)
- Margin: +84%
Cost Breakdown:
- Caching saved: 42%
- Prompt compression saved: 27%
- Model routing saved: 35%
- Smart regeneration saved: 12%
- Total reduction: 73%
(Yes, these add up to >100% because they compound)
Key Lessons Learned
1. Cache Everything You Can
Our cache hit rate is 42%. That's 42% of requests costing us literally nothing (Redis is cheap).
Pro tip: Use semantic hashing for cache keys. Minor wording differences shouldn't break cache hits:
function semanticHash(text: string): string {
// Normalize
const normalized = text
.toLowerCase()
.replace(/\s+/g, ' ')
.trim();
// Hash
return createHash('sha256').update(normalized).digest('hex');
}
2. Test Model Performance Rigorously
Don't assume GPT-4 is always necessary. We found:
- GPT-3.5 works great for 45% of requests
- GPT-4 only needed for complex tasks
- User satisfaction barely changed
Create a test suite:
// tests/ai/model-comparison.test.ts
describe('Model Quality Comparison', () => {
const testCases = loadRealTenderRequests(50);
it('should compare GPT-3.5 vs GPT-4 on simple requests', async () => {
for (const testCase of testCases.filter(isSimple)) {
const gpt35 = await generate(testCase, 'gpt-3.5-turbo');
const gpt4 = await generate(testCase, 'gpt-4-turbo');
const quality35 = await evaluateQuality(gpt35);
const quality4 = await evaluateQuality(gpt4);
console.log({
quality35,
quality4,
costDiff: calculateCost(gpt4) - calculateCost(gpt35),
qualityDiff: quality4 - quality35,
});
}
});
});
3. Prompt Engineering Matters More Than Model Selection
We got bigger wins from better prompts than from better models.
Bad prompt (high cost, mediocre quality):
You are an expert tender response writer with 20 years of experience in South African government procurement. You have deep knowledge of PFMA, MFMA, PPPFA regulations, and all relevant compliance requirements. You understand the nuances of different government departments and their evaluation criteria...
[3,000 more tokens of background]
Good prompt (low cost, high quality):
SA tender expert. Generate ${type}.
Must: SBD format, professional tone, 1-2 pages.
Context: ${industry}
The second prompt is 30x shorter and produces comparable quality.
4. User Feedback > Your Assumptions
We thought users wanted longer, more detailed responses. They actually wanted shorter, more scannable content.
We discovered this by asking—not assuming:
async function collectFeedback(generationId: string) {
return await showUserSurvey({
questions: [
'How would you rate the quality?',
'Was the length appropriate?',
'Did it address your needs?',
],
});
}
5. Measure Everything
You can't optimize what you don't measure. We track:
interface GenerationMetrics {
requestId: string;
model: string;
promptVersion: string;
inputTokens: number;
outputTokens: number;
cost: number;
latency: number;
cacheHit: boolean;
complexity: string;
userRating?: number;
regenerationCount: number;
}
await analytics.track('generation', metrics);
This data drove every optimization decision.
What's Next: Future Optimizations
We're not done. Here's what we're working on:
1. Fine-Tuned Models
Instead of using general-purpose GPT models, we're fine-tuning on our domain:
// Fine-tuning on our successful generations
const trainingData = await db.generation.findMany({
where: { userRating: { gte: 4 } },
select: { input, output },
take: 10000,
});
const fineTuned = await openai.fineTuning.create({
model: 'gpt-3.5-turbo',
training_file: prepareTrainingFile(trainingData),
});
Expected: 20-30% cost reduction with comparable quality.
2. Local Model for Simple Tasks
For the simplest tasks (cover letters for short tenders), we're experimenting with locally-hosted models:
- Mistral 7B for simple generation
- Zero API costs
- Higher latency (acceptable for async jobs)
3. Streaming Responses
Better UX and perceived performance:
async function* streamProposal(request: ProposalRequest) {
const stream = await openai.chat.completions.create({
model,
messages,
stream: true,
});
for await (const chunk of stream) {
yield chunk.choices[0]?.delta?.content || '';
}
}
4. Intelligent Pre-fetching
Predict likely regenerations and pre-generate alternatives:
async function generateWithVariations(request: ProposalRequest) {
const [base, formal, casual] = await Promise.all([
generate(request, { tone: 'balanced' }),
generate(request, { tone: 'formal' }),
generate(request, { tone: 'casual' }),
]);
// Cache all versions
// User can switch tone without regenerating
}
Tools and Resources
Here are the tools that made optimization possible:
Cost Tracking:
- OpenAI usage dashboard (built-in)
- Custom analytics with Mixpanel
- Real-time cost alerts via Slack
Quality Evaluation:
// Automated quality scoring
import { HumanEval } from '@anthropic-ai/eval';
async function evaluateQuality(output: string): Promise<number> {
const criteria = [
'addresses_requirements',
'professional_tone',
'appropriate_length',
'sbd_compliance',
];
const scores = await Promise.all(
criteria.map(c => HumanEval.score(output, c))
);
return scores.reduce((a, b) => a + b) / scores.length;
}
Prompt Management:
- Version control in Git
- Feature flags for A/B testing
- Automated performance tracking
Conclusion: AI Can Be Profitable
When we started, AI features seemed like a necessary loss leader. We'd subsidize them with other revenue and hope to break even eventually.
But with systematic optimization:
- 73% cost reduction
- 10x volume growth
- +84% margins
AI features can be profitable and high-quality.
The key is treating AI like any other infrastructure: measure, optimize, iterate.
Final cost comparison:
- Naive implementation: $0.112/request
- Optimized implementation: $0.030/request
- 73% reduction while improving quality from 4.3/5 to 4.6/5
If you're building AI features and struggling with costs, start with these strategies:
- Cache aggressively (42% savings for us)
- Compress your prompts (27% savings)
- Route to cheaper models when possible (35% savings)
- Prevent unnecessary regenerations (12% savings)
- A/B test everything (improved quality)
The math works. AI can be profitable.
Questions? Want to share your AI optimization strategies?
Drop them in the comments! I'm always learning better approaches.
Follow for more articles on:
- Building cost-effective AI features
- Production LLM best practices
- Scaling developer tools
- South African tech startup journey
Check out our platform: tenders-sa.org
Currently building Tenders SA - South Africa's AI-powered government tender platform. We generate 2,500+ AI proposals monthly at profitable margins.
Top comments (0)