Building NewsDiff: Tracking How News Changes After Publication

TLDR;

What it does: Polls RSS/Atom/JSON feeds, extracts article content, detects word-level changes, shows inline diffs on a web UI, posts to the fediverse (threaded AP bot) and Bluesky, generates diff card images server-side.

Stack: SvelteKit + Drizzle + BullMQ + Defuddle + Botkit/Fedify + @atproto/api + satori/sharp


The Itch

News articles change after publication. Headlines get softened, paragraphs get rewritten, context gets quietly added or removed. Sometimes it’s a correction. Sometimes it’s something else entirely. Either way, the public rarely notices.

The idea of tracking these changes isn’t new. At least three projects tried before:

  • NewsDiffs (2012) — Built at a Knight-Mozilla hackathon. Full article body diffing with a Django web UI. Tracked NYT, CNN, BBC, Politico, and the Washington Post. Python 2, Django 1.5, per-site HTML parsers that target 2012-era markup. Dead.

  • diffengine (~2017) — Ed Summers took a smarter approach: monitor any RSS feed, use Mozilla’s Readability to extract content automatically. No per-site parsers. Submit every version to the Internet Archive. Selenium for screenshots, Twitter for notifications. Twitter API broke in 2023. Dead.

  • NYTdiff (~2020) — Focused on NYT metadata changes and pioneered the social-first model: generate visual diff images, post them as threaded replies. Bluesky support added later. Still somewhat alive but fragile.

All three projects are functional concepts with abandoned implementations. The ideas are right. The code has expired.

I wanted something fresh and easy to maintain.

Choosing the Stack

The goal: take the best ideas from all three, build something that works today, and deploy it on infrastructure I already have.

I run a Cloudron server. That means PostgreSQL, Redis, OIDC auth, and a Docker-based deployment pipeline come for free. The app just needs to fit the mold.

The stack came together fast:

What Why
SvelteKit SSR, form actions, API routes — one framework for everything
Drizzle ORM Type-safe SQL without the ORM overhead
BullMQ Redis-backed job queue for feed polling and syndication
Defuddle Mozilla’s newer content extractor (with Readability as fallback)
jsdiff Word-level diffing — shows exactly what changed
satori + sharp Server-side image generation without a browser
Botkit/Fedify ActivityPub federation — the bot is the fediverse
@atproto/api Bluesky posting

No Twitter. No Selenium. No per-site parsers. No Python 2.

Day One: The Skeleton

The first commit was a SvelteKit project with five database tables:

feeds → articles → versions → diffs → social_posts

Each feed gets polled on a schedule. Each article gets its content extracted via Readability (later Defuddle). Each new version is compared against the previous one. If something changed, a diff is created. If the diff isn’t “boring” (more on that later), it gets syndicated.

The feed poller was the first worker. It uses fetchWithUA with a proper User-Agent (some sites block bare fetches), follows redirects, and stores the final URL after any http → https redirects. Articles are upserted so duplicate URLs don’t cause crashes.

The content extractor was initially just Readability + linkedom. It worked for most sites. Then we hit Politico.

The Politico Problem

Politico Belgium’s RSS feed points to real article URLs. Readability extracts… the sidebar. A list of article teasers with “5 MINS ago” and “2 mins read” timestamps. Not the article itself.

The “diff” between two polls was just 5 MINS → 1 HR across twenty teasers. Completely useless.

The fix came in two layers:

Defuddle replaced Readability as the primary extractor. It’s Mozilla’s newer library, built to strip sidebars, related articles, and navigation. Readability stays as a fallback.

On top of that, a feed-listing detector rejects content that has 3+ lines matching patterns like X MINS ago or N mins read. If it looks like a feed index instead of an article, skip it.

We also swapped linkedom for JSDOM at this point. Defuddle recommends it, and it handles more edge cases.

The “Boring” Problem

Not all changes are interesting. A news site might update an article just to change “Published 3 hours ago” to “Published 4 hours ago.” Or append “Updated 15:10” to a timestamp. Or change a view counter from “123 views” to “456 views.”

The boring detector strips all time-related noise before comparing:

const stripTime = (s: string) =>
  s
    .replace(/\d+\s*(hrs?|hours?|mins?|minutes?)\s*(ago|read)?/gi, '')
    .replace(/\b\d{1,2}:\d{2}\s*(AM|PM)?\b/gi, '')
    .replace(/[•·]?\s*(?:updated|published|modified)\s*:?\s*[^\n.]*/gi, '')
    .replace(/\b(?:GMT|UTC|EST|CET)[+-]?\d*\b/gi, '');

If the text is identical after stripping timestamps, the diff is boring. Skip it.

This caught most noise. The euronews “Updated 15:10” pattern was the last one to fall — it took four iterations of the regex to cover all the variants news sites use.

The Bot

The ActivityPub bot isn’t a bridge to someone else’s Mastodon account. It’s a native fediverse actor. It has its own keys, its own outbox, its own followers. When it posts a diff, that post originates from your server.

Botkit (from the Fedify team) handles the federation plumbing: WebFinger, actor profiles, inbox/outbox, HTTP signatures. It runs as a separate process on port 8001. nginx routes ActivityPub paths there and sends everything else to SvelteKit on port 3000.

The first diff for an article creates a root post. Every diff after that chains off the previous one via message.reply(), building one thread per article. Open the thread on Mastodon and you see the full edit history in sequence.

Getting threading to work was fiddly. Botkit doesn’t support inReplyTo as a publish option, so you can’t just pass a URI. You need the actual message object to call .reply() on, which means iterating the outbox until you find it:

async function findOutboxMessage(session, uri) {
  for await (const msg of session.getOutbox()) {
    if (msg.id?.href === uri) return msg;
  }
  return undefined;
}

Bluesky gets the same content. Root posts use a website card embed (the article URL with a preview). Reply posts carry the diff card image at full size, with alt text and auto-linked URLs via RichText facets.

The Version Mismatch

Profile updates were a headache. When you change the bot’s avatar or bio, remote Mastodon instances need an Update activity to refresh their cache. Simple in theory:

await ctx.sendActivity(
  { identifier: username },
  "followers",
  new Update({ actor: actorUri, object: actor })
);

Except Botkit uses Fedify v1.10.5 internally, while our project also has Fedify v2.1.1 installed (via @fedify/redis). The Update class from v2 rejects v1 Service objects. The Update class from v1 has a broken Object export in CJS (JavaScript’s global Object shadows it). And even constructing the activity with workarounds fails inside Fedify’s own signObject during sendActivity.

So profile updates don’t broadcast. Not until Botkit ships a Fedify v2 upgrade. We opened an issue upstream. In the meantime, profile changes propagate naturally when the bot posts the next diff — remote instances re-fetch the actor.

Syndication Rate Limiting

The first time a batch of 20 diffs arrived at once, the bot machine-gunned them into the fediverse in under a minute. Not a great look. BullMQ has a built-in worker limiter:

const worker = new Worker('syndicate', syndicate, {
  connection: getRedisConnection(),
  concurrency: 1,
  limiter: { max: 1, duration: 5 * 60 * 1000 }
});

One post per five minutes. The rest queue up and trickle out. Configurable via SYNDICATE_RATE_MS if five minutes feels too slow or too fast.

The Image Pipeline

Each diff produces two images. The card is 800×418, fixed. It shows the first ~600 characters of the diff with red/green highlighting — enough for a social media post or an OG preview. The full diff is 800px wide but as tall as it needs to be. It renders everything and powers the download button. A typical one is 100–500KB.

The whole pipeline is satori (HTML → SVG) then sharp (SVG → PNG). No headless browser involved. Satori needs a font file, which was its own adventure — the Cloudron base image has DejaVuSans.ttf at a different path than Fedora, which is different again from Alpine. The font loader tries four paths and takes the first one that exists.

Three Repos

Early on I made a decision that paid off: separate the app from its deployment.

rmdes/newsdiff is the application itself. SvelteKit, Drizzle, BullMQ, Botkit, everything. It has a generic Dockerfile and a GitHub Actions workflow that pushes to ghcr.io/rmdes/newsdiff:main on every commit. No deployment opinions baked in.

rmdes/cloudron-newsdiff wraps the app for Cloudron. It pulls the app source as a git submodule and adds the Cloudron-specific pieces: manifest, startup script, nginx config, env var mapping. Pre-built images are on GHCR, so Cloudron users can cloudron install --image and skip building entirely.

rmdes/newsdiff-deploy is a Docker Compose stack for everyone else. Six services: app, bot, nginx, postgres, redis, and a one-shot migrate container. Clone it, copy .env.example, docker compose up -d.

The Frontend

The homepage was noisy at first. An article edited 13 times meant 12 separate cards cluttering the feed. Grouping fixed that: one card per article, with a “5 changes” expand button for the history. Each source gets a stable color (hashed from its name) on the card border and filter tabs, so you can scan by outlet at a glance.

The diff view shows inline changes: <ins> and <del> tags, the way track changes should look on the web. There’s a share dropdown for Bluesky and Mastodon, and an “Also on” button that lets fediverse users interact with the diff’s AP post from their own instance. It asks for the instance URL once, saves it to localStorage, and uses authorize_interaction after that.

If you’d rather follow diffs in a feed reader than on the fediverse, there are Atom feeds at /feed.xml, /feed/{feedId}.xml, and /article/{id}/feed.xml. The RSS icon next to the source filter tabs points to whichever feed matches your current view.

What’s Next

The Politico Belgium problem exposed a real gap: some sites’ layouts defeat any automatic content extractor. A per-feed configurable CSS selector would help — “for this feed, the article is inside .article__body.”

diffengine’s best idea was archiving every version to the Wayback Machine. The plumbing is in the code, but the API isn’t returning archive URLs yet. Getting this working matters — permanent evidence of what changed is the whole point.

Botkit needs to upgrade to Fedify v2 so we can broadcast profile updates properly. The issue is open upstream.

87 commits, a live instance, and a bot that’s been catching edits all day. The glass newsroom is open.


NewsDiff is open source. The code is at github.com/rmdes/newsdiff. The live instance is at diff.rmendes.net. The bot is at @bot@diff.rmendes.net.

AI: Text Co-drafted · Code AI-generated · Claude

Learn more about AI usage on this site

Comments

Sign in with your website to comment:

Signed in as
Send a Webmention

Have you written a response to this post? Send a webmention by entering your post URL below.