Web Reader
LIVEMulti-source article aggregator and analyzer. RSS scraping, auto-translation to English, deduplication via hash, JSONL output with HTML dashboard.
Key Numbers
At a Glance
Multi-Feed RSS
Sources
Auto-Detect
Translation
Hash-Based
Deduplication
HTML + JSONL
Output
Overview
About This Project
A multi-source article aggregation and analysis platform that automatically discovers, extracts, translates, and deduplicates content from diverse RSS feeds and web sources. The system monitors dozens of feeds across financial news, crypto markets, technology, and geopolitics, producing a unified, searchable corpus of translated articles.
The translation pipeline automatically detects source language and translates non-English content, enabling monitoring of international news sources that would otherwise be inaccessible. Hash-based deduplication prevents the same story from appearing multiple times when covered by multiple outlets.
The output is a clean HTML dashboard that presents articles in a scannable format with source attribution, publication timestamps, and relevance scoring. The underlying data is also available in JSONL format for programmatic consumption by other systems.
Features
What It Does
Feed Discovery & Extraction
Automatic RSS feed parsing with content extraction that handles diverse feed formats, paywalled snippets, and varying HTML structures across publishers.
Language Detection & Translation
Automatic source language identification with translation to English, enabling monitoring of international financial and crypto news sources.
Hash-Based Deduplication
Content fingerprinting prevents duplicate articles from appearing when the same story is covered by multiple outlets or syndicated across feeds.
HTML Dashboard Generation
Clean, scannable dashboard presenting articles with source attribution, timestamps, and relevance scoring for rapid information consumption.
Architecture
How It Works
Challenges
What Made This Hard
RSS feeds are notoriously inconsistent in format, encoding, and content completeness. Building a robust extraction layer that handles partial content, encoding issues, malformed XML, and the wide variety of HTML structures across publishers required extensive edge-case handling. Translation quality varies significantly by source language and domain, requiring post-processing heuristics to catch common translation artifacts.
Stack