Web Reader

LIVE

Multi-source article aggregator and analyzer. RSS scraping, auto-translation to English, deduplication via hash, JSONL output with HTML dashboard.

Key Numbers

At a Glance

Multi-Feed RSS

Sources

Auto-Detect

Translation

Hash-Based

Deduplication

HTML + JSONL

Output

Overview

About This Project

A multi-source article aggregation and analysis platform that automatically discovers, extracts, translates, and deduplicates content from diverse RSS feeds and web sources. The system monitors dozens of feeds across financial news, crypto markets, technology, and geopolitics, producing a unified, searchable corpus of translated articles.

The translation pipeline automatically detects source language and translates non-English content, enabling monitoring of international news sources that would otherwise be inaccessible. Hash-based deduplication prevents the same story from appearing multiple times when covered by multiple outlets.

The output is a clean HTML dashboard that presents articles in a scannable format with source attribution, publication timestamps, and relevance scoring. The underlying data is also available in JSONL format for programmatic consumption by other systems.

Features

What It Does

Feed Discovery & Extraction

Automatic RSS feed parsing with content extraction that handles diverse feed formats, paywalled snippets, and varying HTML structures across publishers.

Language Detection & Translation

Automatic source language identification with translation to English, enabling monitoring of international financial and crypto news sources.

Hash-Based Deduplication

Content fingerprinting prevents duplicate articles from appearing when the same story is covered by multiple outlets or syndicated across feeds.

HTML Dashboard Generation

Clean, scannable dashboard presenting articles with source attribution, timestamps, and relevance scoring for rapid information consumption.

Architecture

How It Works

Challenges

What Made This Hard

RSS feeds are notoriously inconsistent in format, encoding, and content completeness. Building a robust extraction layer that handles partial content, encoding issues, malformed XML, and the wide variety of HTML structures across publishers required extensive edge-case handling. Translation quality varies significantly by source language and domain, requiring post-processing heuristics to catch common translation artifacts.

Stack

Tech Stack

PythonRSSBeautifulSoupNLP

Back to Projects