Skip to main content
All insights

2026-05-14DataMesh Consulting

14 May — V3 self-healing scraping lands end-to-end, junk filter closed, six new extractors

A long sprint day. The V3 self-healing refactor (S1 → S6) merged in full — three dedup layers, an LLM-provider seam, nightly site-health audit, and auto-relearn/auto-disable on regression. Alongside that, the junk-title filter was bypassed at two ingest paths (now closed), description quality got a residue-stripping pass, six new portal extractors went live, and the operator dashboard finally has an Intelligence tab tracking Kimi spend per-site.

The shape of today

If yesterday was about visibility into the data (the world-map heatmap), today was about making the data pipeline notice when it's broken and try to fix itself before a human has to look.

The V3 refactor has been incubating in a feature branch for the better part of a week. It merged in six staged commits (S1–S6), plus assorted glue. The headline: every active portal is now audited nightly across five health dimensions, regressions auto-trigger a Hermes re-analysis, and catastrophic failures auto-disable the site and email an admin runbook.

V3 — Self-healing scraping in six stages

S1 — Telemetry foundation + TED/LU fix script

Every Kimi call (embed, match, summary, extract, site-learner) now writes an analytics.llm_call_logs row with model, tokens in/out, duration, cached flag, and the originating siteId where known. Without this, "what is each portal costing us?" was unanswerable. With it, the rest of V3 has a metric to optimise.

A standalone TED/LU repair script ran alongside the telemetry shipment — TED had been quietly dropping records due to a date-parsing edge case in the source XML. Backfilled the missing window, retitled the existing rows where header drift had corrupted the title field.

S2 — Layer 3 tender-content dedup

The single most expensive thing the pipeline does is the per-tender Kimi embedding. ~30–50% of upserts on active portals return the same content (pagination overlap, listing-page re-renders, re-publication of unchanged amendments). The new ScrapeDedupService short-circuits the embed + match enqueue when upsertTender returns isNew=false, recording two cached- result telemetry rows so the saving shows up in the cost dashboard. Gated behind SCRAPER_DEDUP_ENABLED. Targets ~40% of total Kimi spend with no Hermes or parser changes.

S3a — LlmProvider abstraction seam

Thin interface (LlmProvider), one concrete impl (KimiProvider wrapping KimiService), env-selected factory (LLM_PROVIDER, default kimi). Purely additive — existing call sites still inject KimiService directly. New call sites use the factory, so future swaps (LiteLLM gateway, OpenAI canary, local-model relearn) don't require touching every consumer.

S4 — Layer 2 page-fingerprint dedup

A second, durable dedup layer between the HTTP fetch and parseTenders(). Keyed on (siteId, contentHash) with a 24h TTL. The existing per-task "did this site's last task have the same body?" check missed constantly on paginated listings; the fingerprint table hits consistently across pagination and worker restarts. Wired into both ingest entry points (ScrapeWorker, ScrapeService) and prunes expired rows at 04:00 UTC.

S5 — Site-health audit service

Nightly cron at 03:30 UTC. Evaluates every active site across five dimensions:

1. fetchSuccessRate — % of ScrapeTask rows with status=COMPLETED over the last full UTC day. 2. tenderCountDelta — recent count vs the 14d baseline. Sparse-site guards: a site with baseline < 1/day stays GREEN regardless ("baseline too sparse — not actionable") so a single zero day doesn't trip a portal that just has low natural volume. 3. titleRejectionRate — % of completed tasks that yielded zero tenders. Thresholds tuned high (98% with 5+ tasks) because dedup + pagination naturally produce empty pulls. 4. fieldCoverage — per-field coverage (org / desc / deadline / cpv) vs the 14d baseline. 5. domFingerprint — sha256 of bag-of-tags from the latest RawPage. CATASTROPHIC (zero parsable elements) → RED. Structural diff vs prior → ORANGE. Recovery from a prior CATASTROPHIC → treated as NONE (don't punish recovery).

Verdict is the max severity across the five. Incidents open on first ORANGE/RED and auto-close on first GREEN/YELLOW.

The first dry-run on real prod data flagged 26 ORANGE + 4 RED — mostly false positives caused by mid-day audit windows. The sparse-site guards + the switch to "the last full UTC day" (stable regardless of when the audit fires) brought it down to 4 ORANGE + 3 RED, all genuine: World Bank, TED, eProcure India, African Development Bank, Find a Tender (UK), GETS New Zealand — sites with real baseline activity and today=0.

S6 — Auto-relearn on ORANGE, auto-disable on RED

S5 produces verdicts; S6 makes them consequential.

  • ORANGE → enqueue relearn-site BullMQ job → Hermes
/analyze-site re-synthesizes the extractor selectors via the existing /extraction-agent/site-knowledge sync.
  • REDRelearnProcessor.disableSiteOnRed: set
isActive=false, email a recovery runbook to admin. Idempotent — no double-disable, no spam.
  • ManualPOST /v1/scrape/sites/:siteId/relearn (JWT-
gated) runs runRelearn synchronously, bypassing the 6h cooldown so admins can force an analysis on demand.

Cooldown is Redis-backed and configurable (SCRAPER_RELEARN_COOLDOWN_SECONDS, default 6h). BullMQ jobId is fixed to relearn-${siteId} so duplicate enqueues collapse at the queue layer too. Hermes-side, if its own confidence gate trips (knowledgeSynced=false), the new ExtractionRule rows are NOT swapped in — we trust Hermes's gate, the backend just escalates with an admin email.

Seven synthetic scenarios pass end-to-end (ORANGE enqueues, RED disables + emails, re-audit doesn't double-disable, HEALTHY does nothing, cooldown blocks duplicates, manual synced=true updates rules, manual synced=false escalates).

Closing the junk-title filter bypass

The day's other infrastructure win: the existing junk-title filter (anti-page-chrome residue, anti-numeric-stub) had been silently bypassed at two ingest paths — ScrapeWorker and ScrapeService.upsertTender. That meant ~40 tenders/day were sneaking into the corpus with titles like "Search results" or "PROC-00219387" verbatim. The filter is now applied uniformly at every upsert site.

Companion shipment: a description-quality cleaner. Strips repeated page-chrome residue (cookie banners, nav fragments, "Skip to main content" leakage) at insert time and backfills the existing corpus. A few hundred descriptions got materially shorter and more readable.

Six new portal extractors

The Hermes extractor catalog grew today:

  • EOJN HR — Croatia, via the new portal JSON API.
  • EJN BA — Bosnia & Herzegovina, JSON API.
  • eProcure India (CPPP) — session-aware Playwright extractor.
  • Malaysia, JETRO, CAPT — three more API/HTML extractors
in one shipment.

We also recommended deactivating two long-broken portals (RTA Dubai and KONEPS Korea — both had unstable HTML for weeks with no recovery path; the cost of attempting them outweighed the data return).

A small-but-load-bearing fix: per-keyword seeding is now skipped for URL-invariant API extractors. The TED Europa search API is date-only — it ignores keyword arguments. So every per-keyword job (?search=agile, ?search=api, ...) was calling the same date-only API, returning the same 247 notices, and getting dedup-filtered after the fact. Ten+ sites went from 5 keyword jobs/pass to 1 each — ~40 wasted job creations eliminated per orchestrator cycle. Plus a 15- minute API-extractor result cache as defence-in-depth.

Intelligence dashboard tab

The operator dashboard now has an Intelligence tab that consolidates "what AI we're spending and where it's going" into one screen. Three new backend endpoints under /v1/dashboard/intelligence/{quotas,keywords,per-site} read from the analytics.llm_call_logs table the V3 S1 telemetry populates. All 60s-cached in Redis.

Also closed the missing-siteId gaps in LLM telemetry so the per-site view is actually populated: embeddings, AI tender summaries, and enrichment extraction now all thread siteId through, and the Kimi-CLI match path (which had been silent) records telemetry properly. Plus a follow-up later in the day to plumb real token counts from Hermes's /analyze-site response through to the relearn processor — previously tokens_in/tokens_out were NULL for every site-learner row, which made the cost cards render as $0 / 0 tokens even though the calls were happening.

TED lookback bumped 1d → 3d + backfill runner

Two complementary changes for freshness/coverage:

  • TED Europa extractor: lookbackDays 1 → 3, maxPages
5 → 15. The 30-min orchestrator cycle was missing late- published amendments and corrigenda that landed within the previous day's window. The 15-min API-extractor cache introduced earlier today keeps cost flat — one POST per cycle, not per keyword job. Result: ~1500 records inspected per cycle instead of ~500.
  • One-shot backfill runner with presets for 14 API
extractors (TED, UK FTS/CF, Etimad, CEJN, eTenders ZA, Marches MA, eNaročanje SI, e-Nabavki MK, eprocure BD, EJN BA, EOJN HR, EOP BG, APP Albania). First run: TED 14d / 60 pages inserted 3 290 of 3 291 records (the one drop was a malformed source URL — the canonical (siteId, sourceId) upsert dedups, so re-runs are safe).

Three refinements found while running that first backfill:

1. TED extractor: retry on 429/5xx with exponential backoff (2s/4s/8s/16s) and a configurable pageDelayMs. The previous extractor threw on first 429, breaking backfills at max-pages > 30. Backfill preset uses pageDelayMs=1500 to stay under TED's burst limit; the regular 30-min cycle stays at 0 because it only hits ≤15 pages. 2. Backfill DTO mapping: extractors emit { sourceUrl, publishedAt, value } but SubmitImmediateTenderDto expects { url, publishDate, estimatedValue }. The KimiBrowserAgent already translated at submit time; the backfill now does the same via toSubmitDto(). Without this, all 3 291 TED submissions returned 400 "url must be a URL address". 3. lookupSiteId now prefers siteType !== 'RSS' when multiple TenderSites match the pattern — otherwise the TED preset resolved to "TED.europa.eu RSS" instead of the canonical API entry.

SEO follow-up to yesterday's heatmap fix

Three Google Search Console issues addressed:

1. /tenders/[country]/notice/[id] was noindex'ing any tender whose raw t.title was junk — even when the displayTitle() helper rescued the page with a clean "Public tender from <organization>" fallback. ~30 of the 81 "Excluded by 'noindex' tag" entries were legit NZ / Oman tenders with usable org+description. Now we only noindex when title AND description AND organization are all unusable. 2. /countries/[country] rendered an 8-CPV sector-drill grid regardless of whether those CPVs had tenders in the country. That funneled Googlebot at /tenders/<c>/<cpv> pages that 404 by design — the discovery vector behind the 8.3K "Discovered – currently not indexed" backlog. Now samples 100 tenders per country, derives the active CPV-2 prefix set, and only links to those. 3. robots.txt allowed /auth/ and /proposals/ to be crawled even though both are page-level noindex. Those routes accounted for ~50 of the 81 noindex bucket — now disallowed at robots level so Googlebot stops wasting crawl budget revisiting them.

Production safety — scraper kill switches

A subtle bug in the operator runbook: the canonical "stop production scraping" recipe (pause asistan-scrape-cron, set asistan-hermes min=0) left an undocumented third path live. The in-process ScrapeScheduler runs a @Cron('0 0 /2 ') inside any warm asistan-api instance, and the co-located ScrapeWorker consumes the resulting BullMQ jobs — so a warm API could keep scraping even with both kill-switches flipped.

Two new opt-in env flags, both default OFF in prod: SCRAPER_SCHEDULER_ENABLED and SCRAPER_WORKER_ENABLED. Local dev keeps both on. Prod now starts dormant-by-default on this redeploy. The full V3 flag matrix (SCRAPER_DEDUP / PAGE_DEDUP / HEALTH_AUDIT / SELF_HEAL / SCHEDULER / WORKER) is now in SYSTEM-STATE.md.

Status & what's next

V3 is feature-complete on the backend: the audit produces verdicts, the verdicts trigger actions, the actions are gated and idempotent, and every step writes telemetry the Intelligence dashboard surfaces.

Remaining V3 work:

  • S7 — Admin dashboard surfaces in web + iOS for the
site-health verdicts and incident timeline.
  • S8 — Docs sweep (AGENTS.md, SYSTEM-STATE.md,
HERMES_PIPELINE.md updates).

Both are presentation layers. The pipeline itself is what we set out to build.

Methodology: drawn from the week ending 2026-05-14 tender corpus. Tender data sourced from public procurement portals worldwide; see our methodology for the extraction pipeline.