Extract PDFs, JSON & HTML in Your Telegram Bot with Telegem's New Plugin
Building a Telegram bot that processes documents? Stop wrestling with file parsing. Telegem's new FileExtractor plugin lets you extract text from PDFs, parse JSON, and process HTML files in 3 lines of Ruby.
๐ What This Solves
Before: Your users send PDFs/JSON files โ You write 50+ lines of parsing code + install dependencies + handle edge cases.
After: Your users send files โ You call one method โ Get clean extracted text.
๐ฆ Installation
# In your Gemfile
gem 'telegem'
# Install the optional dependency for PDF support
gem 'pdf-reader'
๐ฏ Real-World Examples
- PDF Invoice Processor
bot.command('invoice') do |ctx|
if ctx.message.document&.mime_type == 'application/pdf'
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id
)
result = extractor.extract
if result[:success]
# Find amounts in invoice text
amounts = result[:content].scan(/\$\d+\.\d{2}/)
ctx.reply("๐ Found #{amounts.size} payment amounts")
end
end
end
- JSON Config Validator
bot.on(:message, document: true) do |ctx|
if ctx.message.document.file_name.end_with?('.json')
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id
)
config = extractor.extract
if config[:success]
ctx.reply("โ
Valid JSON with #{config[:content].keys.size} keys")
else
ctx.reply("โ Invalid JSON: #{config[:error]}")
end
end
end
- HTML to Markdown Converter
bot.command('html') do |ctx|
if ctx.message.document&.mime_type == 'text/html'
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id
)
html = extractor.extract
if html[:success]
# Convert HTML to plain text (simplified)
text = html[:content]
ctx.reply("๐ Extracted #{text.length} characters")
end
end
end
๐ง How It Works Under the Hood
The plugin handles the tedious parts for you:
- Downloads the file from Telegram's servers
- Auto detects file type
- Processes it with the appropriate library (PDF::Reader for PDFs, JSON.parse for JSON)
- Cleans up temp files automatically
- Returns a consistent hash format:
{
success: true,
content: "Extracted text here...",
pages: 3, # PDF only
file_size: 45210 # All file types
}
โ ๏ธ Important Security Notes
# โ
SAFE - Use only Telegram-generated file_ids
extractor = Telegem::Plugins::FileExtractor.new(
bot,
ctx.message.document.file_id, # From Telegram context
)
# โ DANGEROUS - Never use user input
extractor = Telegem::Plugins::FileExtractor.new(
bot,
params[:user_input], # Malicious users could hack your server
)
๐จ Advanced: Processing Replies
# Extract from replied-to PDFs
bot.command('extract') do |ctx|
if ctx.message.reply_to_message&.document
file_id = ctx.message.reply_to_message.document.file_id
extractor = Telegem::Plugins::FileExtractor.new(bot, file_id, file_type: :pdf)
result = extractor.extract_pdf
ctx.reply(result[:success] ? "โ
Done" : "โ Failed: #{result[:error]}")
end
end
๐ Why This Matters
Most bot frameworks make you handle file parsing manually. Telegem's approach:
ยท Reduces boilerplate from 50+ lines to 3
ยท Handles edge cases (encrypted PDFs, malformed JSON)
ยท Auto-cleans temp files (no memory leaks)
ยท Works seamlessly with Telegem's async architecture
๐ Get Started
# Create a new bot
gem install telegem
# Check out the full example
git clone https://gitlab.com/ruby-telegem/telegem-examples
๐ฌ Your Turn
What document processing tasks are you building with Telegram bots? Have you tried Telegem's new plugin? Share your use cases below!
Telegem is a modern, async-first Telegram Bot framework for Ruby. Built with โค๏ธ by @slick_phantom.
Top comments (0)