DEV Community

Cover image for How We Cut AI Costs by 73% While Improving Quality: Building Cost-Effective LLM Features
mobius-crypt
mobius-crypt

Posted on

How We Cut AI Costs by 73% While Improving Quality: Building Cost-Effective LLM Features

Our AI proposal generator was hemorrhaging money. In month two of operation, our OpenAI bill hit $3,200 while generating only $1,800 in revenue. Our gross margin was negative 78%.

Something had to change, fast.

Six months later, we're processing 10x the volume at 27% of the original cost per request. Our margins are now healthy (+62%), response quality improved (4.3/5 → 4.6/5), and we've learned some expensive lessons about production AI.

This is the story of how we optimized our AI features without sacrificing quality—and the specific technical strategies you can use to do the same.

The Naive Implementation: When AI Looks Easy

Here's what our original AI proposal generator looked like:

// ❌ The expensive, naive approach
async function generateProposal(request: ProposalRequest): Promise<string> {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

  const completion = await client.chat.completions.create({
    model: "gpt-4-turbo-preview", // Expensive!
    messages: [
      {
        role: "system",
        content: `You are an expert tender response writer for South African government procurement...
        [3,200 tokens of system prompt]`
      },
      {
        role: "user",
        content: `Tender: ${request.tenderTitle}
        Description: ${request.tenderDescription}
        Company: ${request.companyProfile}
        Requirements: ${request.requirements}
        [Often 2,000+ tokens]`
      }
    ],
    temperature: 0.7,
    max_tokens: 2000, // Expensive!
  });

  return completion.choices[0].message.content;
}
Enter fullscreen mode Exit fullscreen mode

This looks clean and simple. Ship it, right?

The costs (per request):

  • Input tokens: ~5,200 tokens (system prompt + user input)
  • Output tokens: ~2,000 tokens (response)
  • Model: GPT-4 Turbo ($0.01/1K input, $0.03/1K output)
  • Cost per request: $0.112

We were generating 200-300 proposals per day. Monthly cost at 250 proposals/day:

250 requests/day × 30 days × $0.112 = $840/month

Except that's the average calculation. In reality:

  • Some users regenerated 3-4 times (multiply by 4)
  • Some prompts were much longer (2-3x tokens)
  • Peak usage hours hit rate limits (wasted requests)

Actual monthly cost: $3,200

For a feature that generated $1,800 in premium subscriptions. Oops.

Strategy 1: Aggressive Caching (Saved 40% Immediately)

The first low-hanging fruit: identical requests.

Analysis showed that 37% of our requests were semantically identical or near-identical. Why? Because users would:

  • Generate proposal for tender X
  • Regenerate because they didn't like the tone
  • Generate again with slight wording changes

Each regeneration cost us money, but the tender specification was the same.

Implementation: Content-Addressed Caching

import { createHash } from 'crypto';
import { redis } from '@/lib/redis';

async function generateProposal(request: ProposalRequest): Promise<string> {
  // Generate cache key from request content
  const cacheKey = generateCacheKey(request);

  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) {
    console.log('Cache hit, saved $0.112');
    return cached;
  }

  // Generate new
  const result = await callOpenAI(request);

  // Cache for 7 days
  await redis.setex(cacheKey, 604800, result);

  return result;
}

function generateCacheKey(request: ProposalRequest): string {
  // Normalize to handle minor variations
  const normalized = {
    tenderTitle: request.tenderTitle.toLowerCase().trim(),
    description: request.tenderDescription.toLowerCase().trim(),
    type: request.documentType,
    // Don't include user-specific data
  };

  const content = JSON.stringify(normalized);
  return `proposal:${createHash('sha256').update(content).digest('hex')}`;
}
Enter fullscreen mode Exit fullscreen mode

Impact:

  • Cache hit rate: 42% (better than expected!)
  • Cost reduction: 42% × $0.112 = $0.047 saved per cached request
  • Monthly savings: ~$530
  • New monthly cost: $2,670

We were no longer losing money, but still not profitable. We needed more optimization.

Strategy 2: Prompt Compression (Saved 25% More)

Our 3,200-token system prompt was absurdly long. Did we really need all that context?

We analyzed the impact of system prompt length on output quality:

// Testing framework
async function testPromptVariations() {
  const prompts = [
    { name: 'Full (3200 tokens)', content: fullPrompt },
    { name: 'Medium (1200 tokens)', content: mediumPrompt },
    { name: 'Minimal (400 tokens)', content: minimalPrompt },
  ];

  const testCases = loadTestCases(10); // Real tender requests

  for (const prompt of prompts) {
    for (const testCase of testCases) {
      const output = await generate(prompt.content, testCase);
      const quality = await evaluateQuality(output);

      console.log({
        prompt: prompt.name,
        quality: quality.score,
        cost: calculateCost(prompt.content, output),
      });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Results:

  • Full prompt (3,200 tokens): Quality 4.3/5, Cost $0.112
  • Medium prompt (1,200 tokens): Quality 4.2/5, Cost $0.079
  • Minimal prompt (400 tokens): Quality 3.8/5, Cost $0.048

The medium prompt offered the best quality-cost tradeoff. We were willing to sacrifice 0.1 quality points for 29% cost reduction.

The Optimized Prompt

// ✅ Compressed but effective
const SYSTEM_PROMPT = `Expert SA govt tender writer. Generate ${documentType}.

Rules:
- SBD/MBD compliant
- Professional tone
- Address requirements explicitly
- 1-2 pages max
- No placeholders

Context: ${industryContext}`;
Enter fullscreen mode Exit fullscreen mode

From 3,200 tokens to 80-120 tokens (depending on document type). 27x compression!

Impact:

  • Input cost: $0.052 → $0.012 (77% reduction)
  • Quality: 4.3/5 → 4.2/5 (acceptable trade-off)
  • Monthly savings: ~$450
  • New monthly cost: $2,220

We were getting close to profitability, but not there yet.

Strategy 3: Model Routing (Saved Another 35%)

GPT-4 is powerful but expensive. Do we always need it?

We analyzed our requests by complexity:

interface RequestComplexity {
  documentType: 'cover-letter' | 'executive-summary' | 'capability-statement';
  tenderLength: number; // tokens
  requirementsCount: number;
  complexity: 'simple' | 'medium' | 'complex';
}

function assessComplexity(request: ProposalRequest): RequestComplexity {
  const tenderLength = estimateTokens(request.tenderDescription);
  const requirementsCount = request.requirements?.length || 0;

  let complexity: 'simple' | 'medium' | 'complex';

  if (request.documentType === 'cover-letter' && tenderLength < 500) {
    complexity = 'simple';
  } else if (requirementsCount > 10 || tenderLength > 2000) {
    complexity = 'complex';
  } else {
    complexity = 'medium';
  }

  return { documentType: request.documentType, tenderLength, requirementsCount, complexity };
}
Enter fullscreen mode Exit fullscreen mode

Distribution:

  • Simple: 45% (cover letters, short tenders)
  • Medium: 40% (standard proposals)
  • Complex: 15% (detailed technical responses)

We could route simple requests to cheaper models!

The Routing Logic

async function generateProposal(request: ProposalRequest): Promise<string> {
  const complexity = assessComplexity(request);

  let model: string;

  switch (complexity.complexity) {
    case 'simple':
      model = 'gpt-3.5-turbo'; // $0.0015/1K in, $0.002/1K out
      break;
    case 'medium':
      model = 'gpt-4-turbo-preview'; // $0.01/1K in, $0.03/1K out
      break;
    case 'complex':
      model = 'gpt-4-turbo-preview';
      break;
  }

  console.log(`Routing to ${model} for ${complexity.complexity} request`);

  return await callOpenAI(request, model);
}
Enter fullscreen mode Exit fullscreen mode

Cost Comparison:

  • Simple (GPT-3.5): $0.008 per request
  • Medium (GPT-4): $0.079 per request
  • Complex (GPT-4): $0.112 per request

Blended cost:

  • 45% × $0.008 = $0.0036
  • 40% × $0.079 = $0.0316
  • 15% × $0.112 = $0.0168
  • Average: $0.052 per request

Down from $0.112! That's a 53% reduction with minimal quality impact.

Impact:

  • Monthly cost: $2,220 → $1,450
  • Quality: Still 4.2/5 average
  • Margin: Now profitable!

But we weren't done optimizing.

Strategy 4: Smart Regeneration (Prevent Waste)

Users regenerating proposals 3-4 times was killing our margins. Why were they regenerating?

We added feedback tracking:

interface RegenerationFeedback {
  reason: 'tone' | 'length' | 'content' | 'format' | 'other';
  previousVersion: string;
  userComment?: string;
}

async function regenerateProposal(
  request: ProposalRequest,
  feedback: RegenerationFeedback
): Promise<string> {
  // Don't just regenerate blindly—modify the prompt
  const modifiedRequest = applyFeedback(request, feedback);

  return await generateProposal(modifiedRequest);
}
Enter fullscreen mode Exit fullscreen mode

Regeneration reasons:

  • Tone (too formal/informal): 38%
  • Length (too long/short): 27%
  • Content (missing points): 22%
  • Format (structure issues): 13%

We could address these without regenerating!

Smart Edits Instead of Regeneration

async function improveProposal(
  original: string,
  feedback: RegenerationFeedback
): Promise<string> {
  // For simple changes, use cheaper edit operations
  if (feedback.reason === 'tone' || feedback.reason === 'length') {
    return await editProposal(original, feedback);
  }

  // Only regenerate for content issues
  return await regenerateProposal(request, feedback);
}

async function editProposal(
  original: string,
  feedback: RegenerationFeedback
): Promise<string> {
  const editPrompt = `Original proposal:\n${original}\n\nModify to: ${feedback.reason}`;

  // Use GPT-3.5 for edits (cheaper)
  const completion = await openai.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: editPrompt }],
    max_tokens: 500, // Edits are shorter
  });

  return completion.choices[0].message.content;
}
Enter fullscreen mode Exit fullscreen mode

Impact:

  • 65% of regenerations now use edit mode (much cheaper)
  • Edit cost: $0.004 vs $0.052 regeneration
  • Saved regeneration costs: ~$180/month
  • New monthly cost: $1,270

Now we were solidly profitable.

Strategy 5: Prompt Versioning and A/B Testing

We weren't satisfied with 4.2/5 quality. Could we improve quality without increasing costs?

We implemented prompt versioning with A/B testing:

interface PromptVersion {
  version: string;
  prompt: string;
  modelConfig: {
    temperature: number;
    topP?: number;
    frequencyPenalty?: number;
  };
  activePercentage: number; // For A/B testing
}

const PROMPT_VERSIONS: PromptVersion[] = [
  {
    version: 'v1-baseline',
    prompt: COMPRESSED_PROMPT,
    modelConfig: { temperature: 0.7 },
    activePercentage: 50,
  },
  {
    version: 'v2-structured',
    prompt: STRUCTURED_PROMPT_V2,
    modelConfig: { temperature: 0.6, frequencyPenalty: 0.3 },
    activePercentage: 50,
  },
];

async function generateProposal(request: ProposalRequest): Promise<string> {
  const promptVersion = selectPromptVersion();

  const result = await callOpenAI(request, promptVersion);

  // Track performance
  await trackGeneration({
    version: promptVersion.version,
    requestId: request.id,
    cost: calculateCost(result),
  });

  return result;
}

function selectPromptVersion(): PromptVersion {
  const rand = Math.random() * 100;
  let cumulative = 0;

  for (const version of PROMPT_VERSIONS) {
    cumulative += version.activePercentage;
    if (rand < cumulative) return version;
  }

  return PROMPT_VERSIONS[0];
}
Enter fullscreen mode Exit fullscreen mode

After 2 weeks of A/B testing with 500+ generations:

// Results
{
  'v1-baseline': {
    avgQuality: 4.2,
    avgCost: $0.052,
    userRating: 4.2,
  },
  'v2-structured': {
    avgQuality: 4.6,
    avgCost: $0.054,
    userRating: 4.6,
  }
}
Enter fullscreen mode Exit fullscreen mode

v2 was slightly more expensive (+4%) but significantly better quality (+9.5%). We made it the default.

Impact:

  • Quality: 4.2/5 → 4.6/5
  • Cost increase: +4% ($1,270 → $1,320)
  • Worth it! Better product = more users = more revenue

The Final Architecture

Here's what our production system looks like now:

// ✅ Optimized production implementation
async function generateProposal(request: ProposalRequest): Promise<string> {
  // 1. Cache check
  const cacheKey = generateCacheKey(request);
  const cached = await redis.get(cacheKey);
  if (cached) {
    await analytics.track('cache_hit');
    return cached;
  }

  // 2. Assess complexity
  const complexity = assessComplexity(request);

  // 3. Select model
  const model = routeToModel(complexity);

  // 4. Get prompt version
  const promptVersion = selectPromptVersion();

  // 5. Generate
  const result = await callOpenAI({
    model,
    prompt: promptVersion.prompt,
    request,
    config: promptVersion.modelConfig,
  });

  // 6. Cache
  await redis.setex(cacheKey, 604800, result);

  // 7. Track
  await analytics.track('generation', {
    complexity,
    model,
    promptVersion: promptVersion.version,
    cost: calculateCost(result),
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

The Results: By the Numbers

Month 2 (Before Optimization):

  • Requests: 7,500
  • Average cost per request: $0.112
  • Total cost: $3,200
  • Revenue: $1,800
  • Margin: -78%

Month 8 (After Optimization):

  • Requests: 78,000 (10x growth!)
  • Average cost per request: $0.030
  • Total cost: $2,340
  • Revenue: $14,400 (8x growth)
  • Margin: +84%

Cost Breakdown:

  • Caching saved: 42%
  • Prompt compression saved: 27%
  • Model routing saved: 35%
  • Smart regeneration saved: 12%
  • Total reduction: 73%

(Yes, these add up to >100% because they compound)

Key Lessons Learned

1. Cache Everything You Can

Our cache hit rate is 42%. That's 42% of requests costing us literally nothing (Redis is cheap).

Pro tip: Use semantic hashing for cache keys. Minor wording differences shouldn't break cache hits:

function semanticHash(text: string): string {
  // Normalize
  const normalized = text
    .toLowerCase()
    .replace(/\s+/g, ' ')
    .trim();

  // Hash
  return createHash('sha256').update(normalized).digest('hex');
}
Enter fullscreen mode Exit fullscreen mode

2. Test Model Performance Rigorously

Don't assume GPT-4 is always necessary. We found:

  • GPT-3.5 works great for 45% of requests
  • GPT-4 only needed for complex tasks
  • User satisfaction barely changed

Create a test suite:

// tests/ai/model-comparison.test.ts
describe('Model Quality Comparison', () => {
  const testCases = loadRealTenderRequests(50);

  it('should compare GPT-3.5 vs GPT-4 on simple requests', async () => {
    for (const testCase of testCases.filter(isSimple)) {
      const gpt35 = await generate(testCase, 'gpt-3.5-turbo');
      const gpt4 = await generate(testCase, 'gpt-4-turbo');

      const quality35 = await evaluateQuality(gpt35);
      const quality4 = await evaluateQuality(gpt4);

      console.log({
        quality35,
        quality4,
        costDiff: calculateCost(gpt4) - calculateCost(gpt35),
        qualityDiff: quality4 - quality35,
      });
    }
  });
});
Enter fullscreen mode Exit fullscreen mode

3. Prompt Engineering Matters More Than Model Selection

We got bigger wins from better prompts than from better models.

Bad prompt (high cost, mediocre quality):

You are an expert tender response writer with 20 years of experience in South African government procurement. You have deep knowledge of PFMA, MFMA, PPPFA regulations, and all relevant compliance requirements. You understand the nuances of different government departments and their evaluation criteria...
[3,000 more tokens of background]
Enter fullscreen mode Exit fullscreen mode

Good prompt (low cost, high quality):

SA tender expert. Generate ${type}.
Must: SBD format, professional tone, 1-2 pages.
Context: ${industry}
Enter fullscreen mode Exit fullscreen mode

The second prompt is 30x shorter and produces comparable quality.

4. User Feedback > Your Assumptions

We thought users wanted longer, more detailed responses. They actually wanted shorter, more scannable content.

We discovered this by asking—not assuming:

async function collectFeedback(generationId: string) {
  return await showUserSurvey({
    questions: [
      'How would you rate the quality?',
      'Was the length appropriate?',
      'Did it address your needs?',
    ],
  });
}
Enter fullscreen mode Exit fullscreen mode

5. Measure Everything

You can't optimize what you don't measure. We track:

interface GenerationMetrics {
  requestId: string;
  model: string;
  promptVersion: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
  latency: number;
  cacheHit: boolean;
  complexity: string;
  userRating?: number;
  regenerationCount: number;
}

await analytics.track('generation', metrics);
Enter fullscreen mode Exit fullscreen mode

This data drove every optimization decision.

What's Next: Future Optimizations

We're not done. Here's what we're working on:

1. Fine-Tuned Models

Instead of using general-purpose GPT models, we're fine-tuning on our domain:

// Fine-tuning on our successful generations
const trainingData = await db.generation.findMany({
  where: { userRating: { gte: 4 } },
  select: { input, output },
  take: 10000,
});

const fineTuned = await openai.fineTuning.create({
  model: 'gpt-3.5-turbo',
  training_file: prepareTrainingFile(trainingData),
});
Enter fullscreen mode Exit fullscreen mode

Expected: 20-30% cost reduction with comparable quality.

2. Local Model for Simple Tasks

For the simplest tasks (cover letters for short tenders), we're experimenting with locally-hosted models:

  • Mistral 7B for simple generation
  • Zero API costs
  • Higher latency (acceptable for async jobs)

3. Streaming Responses

Better UX and perceived performance:

async function* streamProposal(request: ProposalRequest) {
  const stream = await openai.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    yield chunk.choices[0]?.delta?.content || '';
  }
}
Enter fullscreen mode Exit fullscreen mode

4. Intelligent Pre-fetching

Predict likely regenerations and pre-generate alternatives:

async function generateWithVariations(request: ProposalRequest) {
  const [base, formal, casual] = await Promise.all([
    generate(request, { tone: 'balanced' }),
    generate(request, { tone: 'formal' }),
    generate(request, { tone: 'casual' }),
  ]);

  // Cache all versions
  // User can switch tone without regenerating
}
Enter fullscreen mode Exit fullscreen mode

Tools and Resources

Here are the tools that made optimization possible:

Cost Tracking:

  • OpenAI usage dashboard (built-in)
  • Custom analytics with Mixpanel
  • Real-time cost alerts via Slack

Quality Evaluation:

// Automated quality scoring
import { HumanEval } from '@anthropic-ai/eval';

async function evaluateQuality(output: string): Promise<number> {
  const criteria = [
    'addresses_requirements',
    'professional_tone',
    'appropriate_length',
    'sbd_compliance',
  ];

  const scores = await Promise.all(
    criteria.map(c => HumanEval.score(output, c))
  );

  return scores.reduce((a, b) => a + b) / scores.length;
}
Enter fullscreen mode Exit fullscreen mode

Prompt Management:

  • Version control in Git
  • Feature flags for A/B testing
  • Automated performance tracking

Conclusion: AI Can Be Profitable

When we started, AI features seemed like a necessary loss leader. We'd subsidize them with other revenue and hope to break even eventually.

But with systematic optimization:

  • 73% cost reduction
  • 10x volume growth
  • +84% margins

AI features can be profitable and high-quality.

The key is treating AI like any other infrastructure: measure, optimize, iterate.

Final cost comparison:

  • Naive implementation: $0.112/request
  • Optimized implementation: $0.030/request
  • 73% reduction while improving quality from 4.3/5 to 4.6/5

If you're building AI features and struggling with costs, start with these strategies:

  1. Cache aggressively (42% savings for us)
  2. Compress your prompts (27% savings)
  3. Route to cheaper models when possible (35% savings)
  4. Prevent unnecessary regenerations (12% savings)
  5. A/B test everything (improved quality)

The math works. AI can be profitable.


Questions? Want to share your AI optimization strategies?

Drop them in the comments! I'm always learning better approaches.

Follow for more articles on:

  • Building cost-effective AI features
  • Production LLM best practices
  • Scaling developer tools
  • South African tech startup journey

Check out our platform: tenders-sa.org


Currently building Tenders SA - South Africa's AI-powered government tender platform. We generate 2,500+ AI proposals monthly at profitable margins.

Top comments (0)