2026-05-14DataMesh Consulting
14 May — V3 self-healing scraping lands end-to-end, junk filter closed, six new extractors
A long sprint day. The V3 self-healing refactor (S1 → S6) merged in full — three dedup layers, an LLM-provider seam, nightly site-health audit, and auto-relearn/auto-disable on regression. Alongside that, the junk-title filter was bypassed at two ingest paths (now closed), description quality got a residue-stripping pass, six new portal extractors went live, and the operator dashboard finally has an Intelligence tab tracking Kimi spend per-site.
The shape of today
If yesterday was about visibility into the data (the world-map heatmap), today was about making the data pipeline notice when it's broken and try to fix itself before a human has to look.
The V3 refactor has been incubating in a feature branch for the better part of a week. It merged in six staged commits (S1–S6), plus assorted glue. The headline: every active portal is now audited nightly across five health dimensions, regressions auto-trigger a Hermes re-analysis, and catastrophic failures auto-disable the site and email an admin runbook.
V3 — Self-healing scraping in six stages
S1 — Telemetry foundation + TED/LU fix script
Every Kimi call (embed, match, summary, extract, site-learner)
now writes an analytics.llm_call_logs row with model, tokens
in/out, duration, cached flag, and the originating siteId where
known. Without this, "what is each portal costing us?" was
unanswerable. With it, the rest of V3 has a metric to optimise.
A standalone TED/LU repair script ran alongside the telemetry
shipment — TED had been quietly dropping records due to a
date-parsing edge case in the source XML. Backfilled the missing
window, retitled the existing rows where header drift had
corrupted the title field.
S2 — Layer 3 tender-content dedup
The single most expensive thing the pipeline does is the
per-tender Kimi embedding. ~30–50% of upserts on active portals
return the same content (pagination overlap, listing-page
re-renders, re-publication of unchanged amendments). The new
ScrapeDedupService short-circuits the embed + match enqueue
when upsertTender returns isNew=false, recording two cached-
result telemetry rows so the saving shows up in the cost
dashboard. Gated behind SCRAPER_DEDUP_ENABLED. Targets ~40%
of total Kimi spend with no Hermes or parser changes.
S3a — LlmProvider abstraction seam
Thin interface (LlmProvider), one concrete impl (KimiProvider
wrapping KimiService), env-selected factory (LLM_PROVIDER,
default kimi). Purely additive — existing call sites still
inject KimiService directly. New call sites use the factory,
so future swaps (LiteLLM gateway, OpenAI canary, local-model
relearn) don't require touching every consumer.
S4 — Layer 2 page-fingerprint dedup
A second, durable dedup layer between the HTTP fetch and
parseTenders(). Keyed on (siteId, contentHash) with a 24h
TTL. The existing per-task "did this site's last task have
the same body?" check missed constantly on paginated listings;
the fingerprint table hits consistently across pagination and
worker restarts. Wired into both ingest entry points
(ScrapeWorker, ScrapeService) and prunes expired rows at
04:00 UTC.
S5 — Site-health audit service
Nightly cron at 03:30 UTC. Evaluates every active site across five dimensions:
1. fetchSuccessRate — % of ScrapeTask rows with
status=COMPLETED over the last full UTC day.
2. tenderCountDelta — recent count vs the 14d baseline.
Sparse-site guards: a site with baseline < 1/day stays GREEN
regardless ("baseline too sparse — not actionable") so a
single zero day doesn't trip a portal that just has low
natural volume.
3. titleRejectionRate — % of completed tasks that yielded
zero tenders. Thresholds tuned high (98% with 5+ tasks)
because dedup + pagination naturally produce empty pulls.
4. fieldCoverage — per-field coverage (org / desc /
deadline / cpv) vs the 14d baseline.
5. domFingerprint — sha256 of bag-of-tags from the latest
RawPage. CATASTROPHIC (zero parsable elements) → RED.
Structural diff vs prior → ORANGE. Recovery from a prior
CATASTROPHIC → treated as NONE (don't punish recovery).
Verdict is the max severity across the five. Incidents open on first ORANGE/RED and auto-close on first GREEN/YELLOW.
The first dry-run on real prod data flagged 26 ORANGE + 4 RED — mostly false positives caused by mid-day audit windows. The sparse-site guards + the switch to "the last full UTC day" (stable regardless of when the audit fires) brought it down to 4 ORANGE + 3 RED, all genuine: World Bank, TED, eProcure India, African Development Bank, Find a Tender (UK), GETS New Zealand — sites with real baseline activity and today=0.
S6 — Auto-relearn on ORANGE, auto-disable on RED
S5 produces verdicts; S6 makes them consequential.
- ORANGE → enqueue
relearn-siteBullMQ job → Hermes
/analyze-site re-synthesizes the extractor selectors via
the existing /extraction-agent/site-knowledge sync.
- RED →
RelearnProcessor.disableSiteOnRed: set
isActive=false, email a recovery runbook to admin.
Idempotent — no double-disable, no spam.
- Manual →
POST /v1/scrape/sites/:siteId/relearn(JWT-
runRelearn synchronously, bypassing the 6h
cooldown so admins can force an analysis on demand.
Cooldown is Redis-backed and configurable
(SCRAPER_RELEARN_COOLDOWN_SECONDS, default 6h). BullMQ
jobId is fixed to relearn-${siteId} so duplicate enqueues
collapse at the queue layer too. Hermes-side, if its own
confidence gate trips (knowledgeSynced=false), the new
ExtractionRule rows are NOT swapped in — we trust Hermes's
gate, the backend just escalates with an admin email.
Seven synthetic scenarios pass end-to-end (ORANGE enqueues, RED disables + emails, re-audit doesn't double-disable, HEALTHY does nothing, cooldown blocks duplicates, manual synced=true updates rules, manual synced=false escalates).
Closing the junk-title filter bypass
The day's other infrastructure win: the existing junk-title
filter (anti-page-chrome residue, anti-numeric-stub) had been
silently bypassed at two ingest paths — ScrapeWorker and
ScrapeService.upsertTender. That meant ~40 tenders/day were
sneaking into the corpus with titles like "Search results" or
"PROC-00219387" verbatim. The filter is now applied uniformly
at every upsert site.
Companion shipment: a description-quality cleaner. Strips repeated page-chrome residue (cookie banners, nav fragments, "Skip to main content" leakage) at insert time and backfills the existing corpus. A few hundred descriptions got materially shorter and more readable.
Six new portal extractors
The Hermes extractor catalog grew today:
- EOJN HR — Croatia, via the new portal JSON API.
- EJN BA — Bosnia & Herzegovina, JSON API.
- eProcure India (CPPP) — session-aware Playwright extractor.
- Malaysia, JETRO, CAPT — three more API/HTML extractors
We also recommended deactivating two long-broken portals (RTA Dubai and KONEPS Korea — both had unstable HTML for weeks with no recovery path; the cost of attempting them outweighed the data return).
A small-but-load-bearing fix: per-keyword seeding is now
skipped for URL-invariant API extractors. The TED Europa
search API is date-only — it ignores keyword arguments. So
every per-keyword job (?search=agile, ?search=api, ...)
was calling the same date-only API, returning the same 247
notices, and getting dedup-filtered after the fact. Ten+
sites went from 5 keyword jobs/pass to 1 each — ~40 wasted
job creations eliminated per orchestrator cycle. Plus a 15-
minute API-extractor result cache as defence-in-depth.
Intelligence dashboard tab
The operator dashboard now has an Intelligence tab that
consolidates "what AI we're spending and where it's going"
into one screen. Three new backend endpoints under
/v1/dashboard/intelligence/{quotas,keywords,per-site} read
from the analytics.llm_call_logs table the V3 S1 telemetry
populates. All 60s-cached in Redis.
Also closed the missing-siteId gaps in LLM telemetry so the
per-site view is actually populated: embeddings, AI tender
summaries, and enrichment extraction now all thread siteId
through, and the Kimi-CLI match path (which had been silent)
records telemetry properly. Plus a follow-up later in the day
to plumb real token counts from Hermes's /analyze-site
response through to the relearn processor — previously
tokens_in/tokens_out were NULL for every site-learner row,
which made the cost cards render as $0 / 0 tokens even
though the calls were happening.
TED lookback bumped 1d → 3d + backfill runner
Two complementary changes for freshness/coverage:
- TED Europa extractor:
lookbackDays1 → 3,maxPages
- One-shot backfill runner with presets for 14 API
(siteId, sourceId) upsert dedups, so re-runs are safe).
Three refinements found while running that first backfill:
1. TED extractor: retry on 429/5xx with exponential backoff
(2s/4s/8s/16s) and a configurable pageDelayMs. The
previous extractor threw on first 429, breaking backfills
at max-pages > 30. Backfill preset uses pageDelayMs=1500
to stay under TED's burst limit; the regular 30-min cycle
stays at 0 because it only hits ≤15 pages.
2. Backfill DTO mapping: extractors emit
{ sourceUrl, publishedAt, value } but
SubmitImmediateTenderDto expects
{ url, publishDate, estimatedValue }. The KimiBrowserAgent
already translated at submit time; the backfill now does
the same via toSubmitDto(). Without this, all 3 291 TED
submissions returned 400 "url must be a URL address".
3. lookupSiteId now prefers siteType !== 'RSS' when
multiple TenderSites match the pattern — otherwise the
TED preset resolved to "TED.europa.eu RSS" instead of the
canonical API entry.
SEO follow-up to yesterday's heatmap fix
Three Google Search Console issues addressed:
1. /tenders/[country]/notice/[id] was noindex'ing any
tender whose raw t.title was junk — even when the
displayTitle() helper rescued the page with a clean
"Public tender from <organization>" fallback. ~30 of
the 81 "Excluded by 'noindex' tag" entries were legit
NZ / Oman tenders with usable org+description. Now we
only noindex when title AND description AND organization
are all unusable.
2. /countries/[country] rendered an 8-CPV sector-drill grid
regardless of whether those CPVs had tenders in the
country. That funneled Googlebot at /tenders/<c>/<cpv>
pages that 404 by design — the discovery vector behind
the 8.3K "Discovered – currently not indexed" backlog.
Now samples 100 tenders per country, derives the active
CPV-2 prefix set, and only links to those.
3. robots.txt allowed /auth/ and /proposals/ to be
crawled even though both are page-level noindex. Those
routes accounted for ~50 of the 81 noindex bucket — now
disallowed at robots level so Googlebot stops wasting
crawl budget revisiting them.
Production safety — scraper kill switches
A subtle bug in the operator runbook: the canonical "stop
production scraping" recipe (pause asistan-scrape-cron,
set asistan-hermes min=0) left an undocumented third
path live. The in-process ScrapeScheduler runs a
@Cron('0 0 /2 ') inside any warm asistan-api
instance, and the co-located ScrapeWorker consumes the
resulting BullMQ jobs — so a warm API could keep scraping
even with both kill-switches flipped.
Two new opt-in env flags, both default OFF in prod:
SCRAPER_SCHEDULER_ENABLED and SCRAPER_WORKER_ENABLED.
Local dev keeps both on. Prod now starts dormant-by-default
on this redeploy. The full V3 flag matrix
(SCRAPER_DEDUP / PAGE_DEDUP / HEALTH_AUDIT / SELF_HEAL /
SCHEDULER / WORKER) is now in SYSTEM-STATE.md.
Status & what's next
V3 is feature-complete on the backend: the audit produces verdicts, the verdicts trigger actions, the actions are gated and idempotent, and every step writes telemetry the Intelligence dashboard surfaces.
Remaining V3 work:
- S7 — Admin dashboard surfaces in web + iOS for the
- S8 — Docs sweep (AGENTS.md, SYSTEM-STATE.md,
Both are presentation layers. The pipeline itself is what we set out to build.