2026-05-12DataMesh Consulting
12 May — SEO recovery sprint, sitemap drops 8 548 → 2 186 URLs, freshness pipeline live
A Google Search Console audit reported 9 189 URLs in our sitemap but only 21 crawled and 0 indexed. We traced it back to junk-titled tenders inflating the corpus with auto-generated-looking pages, then shipped a five-commit fix that cleans them out at the source, tightens the sitemap to high-confidence URLs only, and wires up an end-to-end freshness pipeline so future tenders reach Google within minutes.
The trigger
A Google Search Console snapshot of datameshconsulting.co.uk
showed the numbers we'd been dreading on a young domain:
- Sitemap: 9 189 URLs discovered
- Crawled: 21
- Indexed: 0
- 17 pages "Crawled — currently not indexed", 4 pages
Over four weeks online, we'd asked Google to discover 9 000+ URLs and they'd indexed none of them. The "Crawled — currently not indexed" bucket is the worst signal: it means Google looked at the page, decided the content didn't warrant indexing, and moved on. Accumulate enough of those and the whole-domain crawl budget shrinks; we were on track for permanent invisibility.
The diagnosis
We sampled tender titles across the corpus by country. The quality was wildly inconsistent:
| Country | Sample title | Verdict |
|---|---|---|
| GB | "Human Resources Support Services for Camborne Town Council" | ✅ Real |
| EU | "Poland – Medical equipments – Postępowanie…" | ✅ Real |
| OM | 134022360.00, 100780500.00, 16083.00 الأشجار | ❌ KPI values misextracted into the title field |
| DE | "Öffentliche Ausschreibungen" | ❌ Portal homepage title |
| FR | "Accueil \| Pages — boamp.fr" | ❌ Portal homepage title |
| ES | "Plataforma de Contratación del Estado" | ❌ Portal homepage title |
| NZ | "746-26-835-PS", "TMT017-2526-M" | ❌ Bare reference numbers |
| MK | "TEST-INGEST-1778320031830" | ❌ Test artefact in prod |
Roughly 187 tenders in the public corpus had titles that read to Google as auto-generated low-value content. Oman in particular was 100% junk — every single ACTIVE Oman tender was a numeric KPI misextracted by the scraper. Google was correctly flagging these and the cumulative quality signal was dragging the whole domain down.
The fix — five commits over twelve hours
1 · Frontend junk filter (web/lib/quality.ts)
A new isJunkTitle() heuristic catches the six patterns above:
- Pure numeric titles (Oman) —
^[\s][0-9]+([.,][0-9]+)?[\s]$ - Portal homepage titles — substring match against a curated list
- Pure reference codes (NZ) —
^[A-Z0-9][A-Z0-9-]{4,40}$ - Test-data artefacts —
TEST-INGEST- - Sub-15-char stubs
- Self-doubling titles (
X — Xpatterns where description echoes title)
Applied at every public surface: sitemap generator, homepage hero
feed, /tenders listing, country×CPV programmatic pages, and the
per-tender detail page. Detail pages with junk titles now emit
<meta name="robots" content="noindex, nofollow"> so any URL
Google has already crawled gets deindexed on the next refresh,
and the visible page substitutes a synthesised fallback
(Public tender from <Country>) rather than the raw garbage.
2 · Backend insert-time filter
DataQualityService.GARBAGE_TITLE_PATTERNS in the backend's
scrape module gained six new regex patterns mirroring the
frontend filter. Checked at tender-upsert.util.ts before
the row hits Postgres — closes the loop so junk doesn't keep
accumulating from new scrape cycles.
3 · DB archive of historical junk
A one-off Cloud Run Job ran the equivalent SQL across the prod
database. Flipped status = ARCHIVED on 187 tenders:
| Pattern | Count | |---|---:| | Pure numeric (Oman KPI) | 99 | | Pure reference codes (NZ) | 11 | | Portal homepages (DE/FR/ES/IT/PT/PL) | 9 | | Test ingest artefacts | 1 | | Too short (< 15 chars; overlaps with above) | 163 | | Unique rows archived | 187 |
Oman's ACTIVE corpus went 99 → 0 in a single update — every
"tender" had been a KPI value, none were real.
4 · Sitemap tightening
Even after junk was out, 8 548 URLs was still too much. A young domain has an effective Google crawl budget of around 300 URLs per day — at 8 548 URLs in the sitemap, Google would spend a month grinding through before getting to most of our content.
The new isSitemapEligible() gate is stricter than the user-
facing filter:
- Junk-title-free
- Description ≥ 50 chars
organizationfield setpublishedAtwithin 90 days
Tenders that fail this gate still exist as detail pages and are reachable via internal links from country/sector hubs and search — Google will crawl them naturally as the domain's authority builds. We just don't ask up front.
Result: sitemap dropped from 8 548 → 2 186 URLs (4× reduction). New breakdown: 1 824 tender details + 109 country×sector landings + 127 sources + 63 countries + 45 sectors + 6 value bands + 12 static. Comfortably within a week of crawl budget.
5 · End-to-end freshness pipeline
The sitemap tightening only matters if fresh content reaches Google quickly. Three layers now in place:
- Backend → web webhook (sub-second). Every tender upsert
/api/revalidate with a shared REVALIDATE_SECRET,
which calls revalidateTag('tenders-sitemap') so the next
sitemap fetch rebuilds.
- Next ISR safety net (5 min). If the webhook ever silently
- IndexNow (minutes to Bing/Yandex). On insert, the backend
INDEXNOW_KEY is now served
by /api/indexnow-key for verification. Google's been using
IndexNow signals since 2024 too.
All three layers smoke-tested live: secrets set on both
asistan-api and asistan-web Cloud Run services, /api/
indexnow-key returns the key (HTTP 200), /api/revalidate
with the correct secret returns { ok: true, revalidated:
{ tag: 'tenders-sitemap', path: '/sitemap.xml' } }.
Country hubs got real content too
The country pages (/countries/<code>) used to be thin —
choropleth map, a couple of stats, a sector grid. For the
~40 countries without an authored editorial, that wasn't
enough to earn an index slot. Two additions:
- 5-question FAQ block with FAQPage JSON-LD attached on
- Three templated body paragraphs for un-editorialised
Combined with the FAQ, every country hub now has well over 300 words of unique body copy + FAQPage schema, which is the practical floor for Google's "this page deserves indexing" call.
iOS distributed-proxy scraping → opt-in
Quietly related: the iOS app implements an "iOS-as-residential- proxy" pattern where opted-in devices fetch portal HTML using their home/cellular network, bypassing server-side rate limits. Previously the setting defaulted to ON, so every fresh install started fetching gov-portal HTML in the background on first launch. That's generous (crowd-sourced residential proxying) but unexpected.
Flipped the default to OFF. The Profile → Contribute toggle
remains for users who want to participate. Heartbeats still
fire from BGAppRefreshTask every ~15 min — the backend
keeps a roster of available devices so we can route proxy
jobs the moment we hit a portal that blocks Cloud Run egress.
~30 bytes per ping, cost negligible.
What to expect
Index recovery on a young domain is slow even after you fix the root cause. Google's quality estimate of a domain takes 2-4 weeks to recover, longer if "Crawled — currently not indexed" has accumulated for a while.
- Week 1-2. The "Excluded by 'noindex' tag" count
- Week 2-3. "Crawled — currently not indexed" starts
- Week 3-4. "Indexed" count starts climbing from 0.
- Beyond. Every new tender lands in the sitemap within
Backlinks remain the big multiplier we haven't pulled yet — one or two moderately-authoritative inbound links would compound all of the above. That's the next focus.