mobius-crypt

Posted on Feb 14

How We Cut AI Costs by 73% While Improving Quality: Building Cost-Effective LLM Features

#ai #llm #aicost #llmcost

Our AI proposal generator was hemorrhaging money. In month two of operation, our OpenAI bill hit $3,200 while generating only $1,800 in revenue. Our gross margin was negative 78%.

Something had to change, fast.

Six months later, we're processing 10x the volume at 27% of the original cost per request. Our margins are now healthy (+62%), response quality improved (4.3/5 → 4.6/5), and we've learned some expensive lessons about production AI.

This is the story of how we optimized our AI features without sacrificing quality—and the specific technical strategies you can use to do the same.

The Naive Implementation: When AI Looks Easy

Here's what our original AI proposal generator looked like:

// ❌ The expensive, naive approach
async function generateProposal(request: ProposalRequest): Promise<string> {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const completion = await client.chat.completions.create({
    model: "gpt-4-turbo-preview", // Expensive!
    messages: [
      {
        role: "system",
        content: `You are an expert tender response writer for South African government procurement...
        [3,200 tokens of system prompt]`
      },
      {
        role: "user",
        content: `Tender: ${request.tenderTitle}
        Description: ${request.tenderDescription}
        Company: ${request.companyProfile}
        Requirements: ${request.requirements}
        [Often 2,000+ tokens]`
      }
    ],
    temperature: 0.7,
    max_tokens: 2000, // Expensive!
  });

  return completion.choices[0].message.content;
}

This looks clean and simple. Ship it, right?

The costs (per request):

Input tokens: ~5,200 tokens (system prompt + user input)
Output tokens: ~2,000 tokens (response)
Model: GPT-4 Turbo ($0.01/1K input, $0.03/1K output)
Cost per request: $0.112

We were generating 200-300 proposals per day. Monthly cost at 250 proposals/day:

250 requests/day × 30 days × $0.112 = $840/month

Except that's the average calculation. In reality:

Some users regenerated 3-4 times (multiply by 4)
Some prompts were much longer (2-3x tokens)
Peak usage hours hit rate limits (wasted requests)

Actual monthly cost: $3,200

For a feature that generated $1,800 in premium subscriptions. Oops.

Strategy 1: Aggressive Caching (Saved 40% Immediately)

The first low-hanging fruit: identical requests.

Analysis showed that 37% of our requests were semantically identical or near-identical. Why? Because users would:

Generate proposal for tender X
Regenerate because they didn't like the tone
Generate again with slight wording changes

Each regeneration cost us money, but the tender specification was the same.

Implementation: Content-Addressed Caching

import { createHash } from 'crypto';
import { redis } from '@/lib/redis';

async function generateProposal(request: ProposalRequest): Promise<string> {
  // Generate cache key from request content
  const cacheKey = generateCacheKey(request);

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) {
    console.log('Cache hit, saved $0.112');
    return cached;
  }

  // Generate new
  const result = await callOpenAI(request);

  // Cache for 7 days
  await redis.setex(cacheKey, 604800, result);

  return result;
}

function generateCacheKey(request: ProposalRequest): string {
  // Normalize to handle minor variations
  const normalized = {
    tenderTitle: request.tenderTitle.toLowerCase().trim(),
    description: request.tenderDescription.toLowerCase().trim(),
    type: request.documentType,
    // Don't include user-specific data
  };

  const content = JSON.stringify(normalized);
  return `proposal:${createHash('sha256').update(content).digest('hex')}`;
}

Impact:

Cache hit rate: 42% (better than expected!)
Cost reduction: 42% × $0.112 = $0.047 saved per cached request
Monthly savings: ~$530
New monthly cost: $2,670

We were no longer losing money, but still not profitable. We needed more optimization.

Strategy 2: Prompt Compression (Saved 25% More)

Our 3,200-token system prompt was absurdly long. Did we really need all that context?

We analyzed the impact of system prompt length on output quality:

// Testing framework
async function testPromptVariations() {
  const prompts = [
    { name: 'Full (3200 tokens)', content: fullPrompt },
    { name: 'Medium (1200 tokens)', content: mediumPrompt },
    { name: 'Minimal (400 tokens)', content: minimalPrompt },
  ];

  const testCases = loadTestCases(10); // Real tender requests

  for (const prompt of prompts) {
    for (const testCase of testCases) {
      const output = await generate(prompt.content, testCase);
      const quality = await evaluateQuality(output);

      console.log({
        prompt: prompt.name,
        quality: quality.score,
        cost: calculateCost(prompt.content, output),
      });
    }
  }
}

Results:

Full prompt (3,200 tokens): Quality 4.3/5, Cost $0.112
Medium prompt (1,200 tokens): Quality 4.2/5, Cost $0.079
Minimal prompt (400 tokens): Quality 3.8/5, Cost $0.048

The medium prompt offered the best quality-cost tradeoff. We were willing to sacrifice 0.1 quality points for 29% cost reduction.

The Optimized Prompt

// ✅ Compressed but effective
const SYSTEM_PROMPT = `Expert SA govt tender writer. Generate ${documentType}.

Rules:
- SBD/MBD compliant
- Professional tone
- Address requirements explicitly
- 1-2 pages max
- No placeholders

Context: ${industryContext}`;

From 3,200 tokens to 80-120 tokens (depending on document type). 27x compression!

Impact:

Input cost: $0.052 → $0.012 (77% reduction)
Quality: 4.3/5 → 4.2/5 (acceptable trade-off)
Monthly savings: ~$450
New monthly cost: $2,220

We were getting close to profitability, but not there yet.

Strategy 3: Model Routing (Saved Another 35%)

GPT-4 is powerful but expensive. Do we always need it?

We analyzed our requests by complexity:

interface RequestComplexity {
  documentType: 'cover-letter' | 'executive-summary' | 'capability-statement';
  tenderLength: number; // tokens
  requirementsCount: number;
  complexity: 'simple' | 'medium' | 'complex';
}

function assessComplexity(request: ProposalRequest): RequestComplexity {
  const tenderLength = estimateTokens(request.tenderDescription);
  const requirementsCount = request.requirements?.length || 0;

  let complexity: 'simple' | 'medium' | 'complex';

  if (request.documentType === 'cover-letter' && tenderLength < 500) {
    complexity = 'simple';
  } else if (requirementsCount > 10 || tenderLength > 2000) {
    complexity = 'complex';
  } else {
    complexity = 'medium';
  }

  return { documentType: request.documentType, tenderLength, requirementsCount, complexity };
}

Distribution:

Simple: 45% (cover letters, short tenders)
Medium: 40% (standard proposals)
Complex: 15% (detailed technical responses)

We could route simple requests to cheaper models!

The Routing Logic

async function generateProposal(request: ProposalRequest): Promise<string> {
  const complexity = assessComplexity(request);

  let model: string;

  switch (complexity.complexity) {
    case 'simple':
      model = 'gpt-3.5-turbo'; // $0.0015/1K in, $0.002/1K out
      break;
    case 'medium':
      model = 'gpt-4-turbo-preview'; // $0.01/1K in, $0.03/1K out
      break;
    case 'complex':
      model = 'gpt-4-turbo-preview';
      break;
  }

  console.log(`Routing to ${model} for ${complexity.complexity} request`);

  return await callOpenAI(request, model);
}

Cost Comparison:

Simple (GPT-3.5): $0.008 per request
Medium (GPT-4): $0.079 per request
Complex (GPT-4): $0.112 per request

Blended cost:

45% × $0.008 = $0.0036
40% × $0.079 = $0.0316
15% × $0.112 = $0.0168
Average: $0.052 per request

Down from $0.112! That's a 53% reduction with minimal quality impact.

Impact:

Monthly cost: $2,220 → $1,450
Quality: Still 4.2/5 average
Margin: Now profitable!

But we weren't done optimizing.

Strategy 4: Smart Regeneration (Prevent Waste)

Users regenerating proposals 3-4 times was killing our margins. Why were they regenerating?

We added feedback tracking:

interface RegenerationFeedback {
  reason: 'tone' | 'length' | 'content' | 'format' | 'other';
  previousVersion: string;
  userComment?: string;
}

async function regenerateProposal(
  request: ProposalRequest,
  feedback: RegenerationFeedback
): Promise<string> {
  // Don't just regenerate blindly—modify the prompt
  const modifiedRequest = applyFeedback(request, feedback);

  return await generateProposal(modifiedRequest);
}

Regeneration reasons:

Tone (too formal/informal): 38%
Length (too long/short): 27%
Content (missing points): 22%
Format (structure issues): 13%

We could address these without regenerating!

Smart Edits Instead of Regeneration

async function improveProposal(
  original: string,
  feedback: RegenerationFeedback
): Promise<string> {
  // For simple changes, use cheaper edit operations
  if (feedback.reason === 'tone' || feedback.reason === 'length') {
    return await editProposal(original, feedback);
  }

  // Only regenerate for content issues
  return await regenerateProposal(request, feedback);
}

async function editProposal(
  original: string,
  feedback: RegenerationFeedback
): Promise<string> {
  const editPrompt = `Original proposal:\n${original}\n\nModify to: ${feedback.reason}`;

  // Use GPT-3.5 for edits (cheaper)
  const completion = await openai.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: editPrompt }],
    max_tokens: 500, // Edits are shorter
  });

  return completion.choices[0].message.content;
}

Impact:

65% of regenerations now use edit mode (much cheaper)
Edit cost: $0.004 vs $0.052 regeneration
Saved regeneration costs: ~$180/month
New monthly cost: $1,270

Now we were solidly profitable.

Strategy 5: Prompt Versioning and A/B Testing

We weren't satisfied with 4.2/5 quality. Could we improve quality without increasing costs?

We implemented prompt versioning with A/B testing:

interface PromptVersion {
  version: string;
  prompt: string;
  modelConfig: {
    temperature: number;
    topP?: number;
    frequencyPenalty?: number;
  };
  activePercentage: number; // For A/B testing
}

const PROMPT_VERSIONS: PromptVersion[] = [
  {
    version: 'v1-baseline',
    prompt: COMPRESSED_PROMPT,
    modelConfig: { temperature: 0.7 },
    activePercentage: 50,
  },
  {
    version: 'v2-structured',
    prompt: STRUCTURED_PROMPT_V2,
    modelConfig: { temperature: 0.6, frequencyPenalty: 0.3 },
    activePercentage: 50,
  },
];

async function generateProposal(request: ProposalRequest): Promise<string> {
  const promptVersion = selectPromptVersion();

  const result = await callOpenAI(request, promptVersion);

  // Track performance
  await trackGeneration({
    version: promptVersion.version,
    requestId: request.id,
    cost: calculateCost(result),
  });

  return result;
}

function selectPromptVersion(): PromptVersion {
  const rand = Math.random() * 100;
  let cumulative = 0;

  for (const version of PROMPT_VERSIONS) {
    cumulative += version.activePercentage;
    if (rand < cumulative) return version;
  }

  return PROMPT_VERSIONS[0];
}

After 2 weeks of A/B testing with 500+ generations:

// Results
{
  'v1-baseline': {
    avgQuality: 4.2,
    avgCost: $0.052,
    userRating: 4.2,
  },
  'v2-structured': {
    avgQuality: 4.6,
    avgCost: $0.054,
    userRating: 4.6,
  }
}

v2 was slightly more expensive (+4%) but significantly better quality (+9.5%). We made it the default.

Impact:

Quality: 4.2/5 → 4.6/5
Cost increase: +4% ($1,270 → $1,320)
Worth it! Better product = more users = more revenue

The Final Architecture

Here's what our production system looks like now:

// ✅ Optimized production implementation
async function generateProposal(request: ProposalRequest): Promise<string> {
  // 1. Cache check
  const cacheKey = generateCacheKey(request);
  const cached = await redis.get(cacheKey);
  if (cached) {
    await analytics.track('cache_hit');
    return cached;
  }

  // 2. Assess complexity
  const complexity = assessComplexity(request);

  // 3. Select model
  const model = routeToModel(complexity);

  // 4. Get prompt version
  const promptVersion = selectPromptVersion();

  // 5. Generate
  const result = await callOpenAI({
    model,
    prompt: promptVersion.prompt,
    request,
    config: promptVersion.modelConfig,
  });

  // 6. Cache
  await redis.setex(cacheKey, 604800, result);

  // 7. Track
  await analytics.track('generation', {
    complexity,
    model,
    promptVersion: promptVersion.version,
    cost: calculateCost(result),
  });

  return result;
}

The Results: By the Numbers

Month 2 (Before Optimization):

Requests: 7,500
Average cost per request: $0.112
Total cost: $3,200
Revenue: $1,800
Margin: -78%

Month 8 (After Optimization):

Requests: 78,000 (10x growth!)
Average cost per request: $0.030
Total cost: $2,340
Revenue: $14,400 (8x growth)
Margin: +84%

Cost Breakdown:

Caching saved: 42%
Prompt compression saved: 27%
Model routing saved: 35%
Smart regeneration saved: 12%
Total reduction: 73%

(Yes, these add up to >100% because they compound)

Key Lessons Learned

1. Cache Everything You Can

Our cache hit rate is 42%. That's 42% of requests costing us literally nothing (Redis is cheap).

Pro tip: Use semantic hashing for cache keys. Minor wording differences shouldn't break cache hits:

function semanticHash(text: string): string {
  // Normalize
  const normalized = text
    .toLowerCase()
    .replace(/\s+/g, ' ')
    .trim();

  // Hash
  return createHash('sha256').update(normalized).digest('hex');
}

2. Test Model Performance Rigorously

Don't assume GPT-4 is always necessary. We found:

GPT-3.5 works great for 45% of requests
GPT-4 only needed for complex tasks
User satisfaction barely changed

Create a test suite:

// tests/ai/model-comparison.test.ts
describe('Model Quality Comparison', () => {
  const testCases = loadRealTenderRequests(50);

  it('should compare GPT-3.5 vs GPT-4 on simple requests', async () => {
    for (const testCase of testCases.filter(isSimple)) {
      const gpt35 = await generate(testCase, 'gpt-3.5-turbo');
      const gpt4 = await generate(testCase, 'gpt-4-turbo');

      const quality35 = await evaluateQuality(gpt35);
      const quality4 = await evaluateQuality(gpt4);

      console.log({
        quality35,
        quality4,
        costDiff: calculateCost(gpt4) - calculateCost(gpt35),
        qualityDiff: quality4 - quality35,
      });
    }
  });
});

3. Prompt Engineering Matters More Than Model Selection

We got bigger wins from better prompts than from better models.

Bad prompt (high cost, mediocre quality):

You are an expert tender response writer with 20 years of experience in South African government procurement. You have deep knowledge of PFMA, MFMA, PPPFA regulations, and all relevant compliance requirements. You understand the nuances of different government departments and their evaluation criteria...
[3,000 more tokens of background]

Good prompt (low cost, high quality):

SA tender expert. Generate ${type}.
Must: SBD format, professional tone, 1-2 pages.
Context: ${industry}

The second prompt is 30x shorter and produces comparable quality.

4. User Feedback > Your Assumptions

We thought users wanted longer, more detailed responses. They actually wanted shorter, more scannable content.

We discovered this by asking—not assuming:

async function collectFeedback(generationId: string) {
  return await showUserSurvey({
    questions: [
      'How would you rate the quality?',
      'Was the length appropriate?',
      'Did it address your needs?',
    ],
  });
}

5. Measure Everything

You can't optimize what you don't measure. We track:

interface GenerationMetrics {
  requestId: string;
  model: string;
  promptVersion: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
  latency: number;
  cacheHit: boolean;
  complexity: string;
  userRating?: number;
  regenerationCount: number;
}

await analytics.track('generation', metrics);

This data drove every optimization decision.

What's Next: Future Optimizations

We're not done. Here's what we're working on:

1. Fine-Tuned Models

Instead of using general-purpose GPT models, we're fine-tuning on our domain:

// Fine-tuning on our successful generations
const trainingData = await db.generation.findMany({
  where: { userRating: { gte: 4 } },
  select: { input, output },
  take: 10000,
});

const fineTuned = await openai.fineTuning.create({
  model: 'gpt-3.5-turbo',
  training_file: prepareTrainingFile(trainingData),
});

Expected: 20-30% cost reduction with comparable quality.

2. Local Model for Simple Tasks

For the simplest tasks (cover letters for short tenders), we're experimenting with locally-hosted models:

Mistral 7B for simple generation
Zero API costs
Higher latency (acceptable for async jobs)

3. Streaming Responses

Better UX and perceived performance:

async function* streamProposal(request: ProposalRequest) {
  const stream = await openai.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    yield chunk.choices[0]?.delta?.content || '';
  }
}

4. Intelligent Pre-fetching

Predict likely regenerations and pre-generate alternatives:

async function generateWithVariations(request: ProposalRequest) {
  const [base, formal, casual] = await Promise.all([
    generate(request, { tone: 'balanced' }),
    generate(request, { tone: 'formal' }),
    generate(request, { tone: 'casual' }),
  ]);

  // Cache all versions
  // User can switch tone without regenerating
}

Tools and Resources

Here are the tools that made optimization possible:

Cost Tracking:

OpenAI usage dashboard (built-in)
Custom analytics with Mixpanel
Real-time cost alerts via Slack

Quality Evaluation:

// Automated quality scoring
import { HumanEval } from '@anthropic-ai/eval';

async function evaluateQuality(output: string): Promise<number> {
  const criteria = [
    'addresses_requirements',
    'professional_tone',
    'appropriate_length',
    'sbd_compliance',
  ];

  const scores = await Promise.all(
    criteria.map(c => HumanEval.score(output, c))
  );

  return scores.reduce((a, b) => a + b) / scores.length;
}

Prompt Management:

Version control in Git
Feature flags for A/B testing
Automated performance tracking

Conclusion: AI Can Be Profitable

When we started, AI features seemed like a necessary loss leader. We'd subsidize them with other revenue and hope to break even eventually.

But with systematic optimization:

73% cost reduction
10x volume growth
+84% margins

AI features can be profitable and high-quality.

The key is treating AI like any other infrastructure: measure, optimize, iterate.

Final cost comparison:

Naive implementation: $0.112/request
Optimized implementation: $0.030/request
73% reduction while improving quality from 4.3/5 to 4.6/5

If you're building AI features and struggling with costs, start with these strategies:

Cache aggressively (42% savings for us)
Compress your prompts (27% savings)
Route to cheaper models when possible (35% savings)
Prevent unnecessary regenerations (12% savings)
A/B test everything (improved quality)

The math works. AI can be profitable.

Questions? Want to share your AI optimization strategies?

Drop them in the comments! I'm always learning better approaches.

Follow for more articles on:

Building cost-effective AI features
Production LLM best practices
Scaling developer tools
South African tech startup journey

Check out our platform: tenders-sa.org

Currently building Tenders SA - South Africa's AI-powered government tender platform. We generate 2,500+ AI proposals monthly at profitable margins.

DEV Community

How We Cut AI Costs by 73% While Improving Quality: Building Cost-Effective LLM Features

The Naive Implementation: When AI Looks Easy

Strategy 1: Aggressive Caching (Saved 40% Immediately)

Implementation: Content-Addressed Caching

Strategy 2: Prompt Compression (Saved 25% More)

The Optimized Prompt

Strategy 3: Model Routing (Saved Another 35%)

The Routing Logic

Strategy 4: Smart Regeneration (Prevent Waste)

Smart Edits Instead of Regeneration

Strategy 5: Prompt Versioning and A/B Testing

The Final Architecture

The Results: By the Numbers

Key Lessons Learned

1. Cache Everything You Can

2. Test Model Performance Rigorously

3. Prompt Engineering Matters More Than Model Selection

4. User Feedback > Your Assumptions

5. Measure Everything

What's Next: Future Optimizations

Tools and Resources

Conclusion: AI Can Be Profitable

Top comments (0)