DEV Community

slick phantom
slick phantom

Posted on • Edited on

Telegem first public plugin

Extract PDFs, JSON & HTML in Your Telegram Bot with Telegem's New Plugin

Building a Telegram bot that processes documents? Stop wrestling with file parsing. Telegem's new FileExtractor plugin lets you extract text from PDFs, parse JSON, and process HTML files in 3 lines of Ruby.

๐Ÿš€ What This Solves

Before: Your users send PDFs/JSON files โ†’ You write 50+ lines of parsing code + install dependencies + handle edge cases.

After: Your users send files โ†’ You call one method โ†’ Get clean extracted text.

๐Ÿ“ฆ Installation

# In your Gemfile
gem 'telegem'

# Install the optional dependency for PDF support
gem 'pdf-reader'
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฏ Real-World Examples

  1. PDF Invoice Processor
bot.command('invoice') do |ctx|
  if ctx.message.document&.mime_type == 'application/pdf'
    extractor = Telegem::Plugins::FileExtractor.new(
      bot, 
      ctx.message.document.file_id
    )

    result = extractor.extract

    if result[:success]
      # Find amounts in invoice text
      amounts = result[:content].scan(/\$\d+\.\d{2}/)
      ctx.reply("๐Ÿ“Š Found #{amounts.size} payment amounts")
    end
  end
end
Enter fullscreen mode Exit fullscreen mode
  1. JSON Config Validator
bot.on(:message, document: true) do |ctx|
  if ctx.message.document.file_name.end_with?('.json')
    extractor = Telegem::Plugins::FileExtractor.new(
      bot,
      ctx.message.document.file_id
    )

    config = extractor.extract

    if config[:success]
      ctx.reply("โœ… Valid JSON with #{config[:content].keys.size} keys")
    else
      ctx.reply("โŒ Invalid JSON: #{config[:error]}")
    end
  end
end
Enter fullscreen mode Exit fullscreen mode
  1. HTML to Markdown Converter
bot.command('html') do |ctx|
  if ctx.message.document&.mime_type == 'text/html'
    extractor = Telegem::Plugins::FileExtractor.new(
      bot,
      ctx.message.document.file_id
    )

    html = extractor.extract

    if html[:success]
      # Convert HTML to plain text (simplified)
      text = html[:content]
      ctx.reply("๐Ÿ“ Extracted #{text.length} characters")
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”ง How It Works Under the Hood

The plugin handles the tedious parts for you:

  1. Downloads the file from Telegram's servers
  2. Auto detects file type
  3. Processes it with the appropriate library (PDF::Reader for PDFs, JSON.parse for JSON)
  4. Cleans up temp files automatically
  5. Returns a consistent hash format:
{
  success: true,
  content: "Extracted text here...",
  pages: 3,           # PDF only
  file_size: 45210    # All file types
}
Enter fullscreen mode Exit fullscreen mode

โš ๏ธ Important Security Notes

# โœ… SAFE - Use only Telegram-generated file_ids
extractor = Telegem::Plugins::FileExtractor.new(
  bot,
  ctx.message.document.file_id,  # From Telegram context
)

# โŒ DANGEROUS - Never use user input
extractor = Telegem::Plugins::FileExtractor.new(
  bot,
  params[:user_input],  # Malicious users could hack your server
)
Enter fullscreen mode Exit fullscreen mode

๐ŸŽจ Advanced: Processing Replies

# Extract from replied-to PDFs
bot.command('extract') do |ctx|
  if ctx.message.reply_to_message&.document
    file_id = ctx.message.reply_to_message.document.file_id

    extractor = Telegem::Plugins::FileExtractor.new(bot, file_id, file_type: :pdf)
    result = extractor.extract_pdf

    ctx.reply(result[:success] ? "โœ… Done" : "โŒ Failed: #{result[:error]}")
  end
end
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ˆ Why This Matters

Most bot frameworks make you handle file parsing manually. Telegem's approach:

ยท Reduces boilerplate from 50+ lines to 3
ยท Handles edge cases (encrypted PDFs, malformed JSON)
ยท Auto-cleans temp files (no memory leaks)
ยท Works seamlessly with Telegem's async architecture

๐Ÿš€ Get Started

# Create a new bot
gem install telegem

# Check out the full example
git clone https://gitlab.com/ruby-telegem/telegem-examples
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ฌ Your Turn

What document processing tasks are you building with Telegram bots? Have you tried Telegem's new plugin? Share your use cases below!


Telegem is a modern, async-first Telegram Bot framework for Ruby. Built with โค๏ธ by @slick_phantom.

Top comments (0)