2026-05-12DataMesh Consulting

12 May — SEO recovery sprint, sitemap drops 8 548 → 2 186 URLs, freshness pipeline live

A Google Search Console audit reported 9 189 URLs in our sitemap but only 21 crawled and 0 indexed. We traced it back to junk-titled tenders inflating the corpus with auto-generated-looking pages, then shipped a five-commit fix that cleans them out at the source, tightens the sitemap to high-confidence URLs only, and wires up an end-to-end freshness pipeline so future tenders reach Google within minutes.

The trigger

A Google Search Console snapshot of datameshconsulting.co.uk showed the numbers we'd been dreading on a young domain:

Sitemap: 9 189 URLs discovered
Crawled: 21
Indexed: 0
17 pages "Crawled — currently not indexed", 4 pages

"Excluded by 'noindex' tag"

Over four weeks online, we'd asked Google to discover 9 000+ URLs and they'd indexed none of them. The "Crawled — currently not indexed" bucket is the worst signal: it means Google looked at the page, decided the content didn't warrant indexing, and moved on. Accumulate enough of those and the whole-domain crawl budget shrinks; we were on track for permanent invisibility.

The diagnosis

We sampled tender titles across the corpus by country. The quality was wildly inconsistent:

| Country | Sample title | Verdict | |---|---|---| | GB | "Human Resources Support Services for Camborne Town Council" | ✅ Real | | EU | "Poland – Medical equipments – Postępowanie…" | ✅ Real | | OM | 134022360.00, 100780500.00, 16083.00 الأشجار | ❌ KPI values misextracted into the title field | | DE | "Öffentliche Ausschreibungen" | ❌ Portal homepage title | | FR | "Accueil \| Pages — boamp.fr" | ❌ Portal homepage title | | ES | "Plataforma de Contratación del Estado" | ❌ Portal homepage title | | NZ | "746-26-835-PS", "TMT017-2526-M" | ❌ Bare reference numbers | | MK | "TEST-INGEST-1778320031830" | ❌ Test artefact in prod |

Roughly 187 tenders in the public corpus had titles that read to Google as auto-generated low-value content. Oman in particular was 100% junk — every single ACTIVE Oman tender was a numeric KPI misextracted by the scraper. Google was correctly flagging these and the cumulative quality signal was dragging the whole domain down.

The fix — five commits over twelve hours

1 · Frontend junk filter (`web/lib/quality.ts`)

A new isJunkTitle() heuristic catches the six patterns above:

Pure numeric titles (Oman) — ^[\s][0-9]+([.,][0-9]+)?[\s]$
Portal homepage titles — substring match against a curated list
Pure reference codes (NZ) — ^[A-Z0-9][A-Z0-9-]{4,40}$
Test-data artefacts — TEST-INGEST-

Sub-15-char stubs

Self-doubling titles (X — X patterns where description echoes title)

Applied at every public surface: sitemap generator, homepage hero feed, /tenders listing, country×CPV programmatic pages, and the per-tender detail page. Detail pages with junk titles now emit <meta name="robots" content="noindex, nofollow"> so any URL Google has already crawled gets deindexed on the next refresh, and the visible page substitutes a synthesised fallback (Public tender from <Country>) rather than the raw garbage.

2 · Backend insert-time filter

DataQualityService.GARBAGE_TITLE_PATTERNS in the backend's scrape module gained six new regex patterns mirroring the frontend filter. Checked at tender-upsert.util.ts before the row hits Postgres — closes the loop so junk doesn't keep accumulating from new scrape cycles.

3 · DB archive of historical junk

A one-off Cloud Run Job ran the equivalent SQL across the prod database. Flipped status = ARCHIVED on 187 tenders:

| Pattern | Count | |---|---:| | Pure numeric (Oman KPI) | 99 | | Pure reference codes (NZ) | 11 | | Portal homepages (DE/FR/ES/IT/PT/PL) | 9 | | Test ingest artefacts | 1 | | Too short (< 15 chars; overlaps with above) | 163 | | Unique rows archived | 187 |

Oman's ACTIVE corpus went 99 → 0 in a single update — every "tender" had been a KPI value, none were real.

4 · Sitemap tightening

Even after junk was out, 8 548 URLs was still too much. A young domain has an effective Google crawl budget of around 300 URLs per day — at 8 548 URLs in the sitemap, Google would spend a month grinding through before getting to most of our content.

The new isSitemapEligible() gate is stricter than the user- facing filter:

Junk-title-free

Description ≥ 50 chars

organization field set

publishedAt within 90 days

Tenders that fail this gate still exist as detail pages and are reachable via internal links from country/sector hubs and search — Google will crawl them naturally as the domain's authority builds. We just don't ask up front.

Result: sitemap dropped from 8 548 → 2 186 URLs (4× reduction). New breakdown: 1 824 tender details + 109 country×sector landings + 127 sources + 63 countries + 45 sectors + 6 value bands + 12 static. Comfortably within a week of crawl budget.

5 · End-to-end freshness pipeline

The sitemap tightening only matters if fresh content reaches Google quickly. Three layers now in place:

Backend → web webhook (sub-second). Every tender upsert

POSTs to /api/revalidate with a shared REVALIDATE_SECRET, which calls revalidateTag('tenders-sitemap') so the next sitemap fetch rebuilds.

Next ISR safety net (5 min). If the webhook ever silently

fails, the sitemap regenerates at most 5 min stale.

IndexNow (minutes to Bing/Yandex). On insert, the backend

pings IndexNow with the new URL; INDEXNOW_KEY is now served by /api/indexnow-key for verification. Google's been using IndexNow signals since 2024 too.
All three layers smoke-tested live: secrets set on both asistan-api and asistan-web Cloud Run services, /api/ indexnow-key returns the key (HTTP 200), /api/revalidate with the correct secret returns { ok: true, revalidated: { tag: 'tenders-sitemap', path: '/sitemap.xml' } }.

Country hubs got real content too

The country pages (/countries/<code>) used to be thin — choropleth map, a couple of stats, a sector grid. For the ~40 countries without an authored editorial, that wasn't enough to earn an index slot. Two additions:

5-question FAQ block with FAQPage JSON-LD attached on

every country hub. Templated answers ("How many active public tenders does X currently have?", "Can a non-resident company bid?", "How often is the list updated?"). Google rewards FAQ-marked pages with expandable rich snippets; CTR lift typically 15-30% even at equal rank.

Three templated body paragraphs for un-editorialised

countries — opens with the live visible count, explains the official-source link + deadline-timezone behaviour, mentions the iOS push-notification flow. Each paragraph templates on the country name so no two pages read identically.
Combined with the FAQ, every country hub now has well over 300 words of unique body copy + FAQPage schema, which is the practical floor for Google's "this page deserves indexing" call.

iOS distributed-proxy scraping → opt-in

Quietly related: the iOS app implements an "iOS-as-residential- proxy" pattern where opted-in devices fetch portal HTML using their home/cellular network, bypassing server-side rate limits. Previously the setting defaulted to ON, so every fresh install started fetching gov-portal HTML in the background on first launch. That's generous (crowd-sourced residential proxying) but unexpected.

Flipped the default to OFF. The Profile → Contribute toggle remains for users who want to participate. Heartbeats still fire from BGAppRefreshTask every ~15 min — the backend keeps a roster of available devices so we can route proxy jobs the moment we hit a portal that blocks Cloud Run egress. ~30 bytes per ping, cost negligible.

What to expect

Index recovery on a young domain is slow even after you fix the root cause. Google's quality estimate of a domain takes 2-4 weeks to recover, longer if "Crawled — currently not indexed" has accumulated for a while.

Week 1-2. The "Excluded by 'noindex' tag" count

increases* first as Google processes our new deindex signals on the legacy junk URLs. That's expected.

Week 2-3. "Crawled — currently not indexed" starts

dropping as the domain-wide quality estimate recovers.

Week 3-4. "Indexed" count starts climbing from 0.
Beyond. Every new tender lands in the sitemap within

5 min, in Bing/Yandex within minutes via IndexNow, and in Google within hours rather than days.

Backlinks remain the big multiplier we haven't pulled yet — one or two moderately-authoritative inbound links would compound all of the above. That's the next focus.

The trigger

The diagnosis

The fix — five commits over twelve hours

1 · Frontend junk filter (web/lib/quality.ts)

2 · Backend insert-time filter

3 · DB archive of historical junk

4 · Sitemap tightening

5 · End-to-end freshness pipeline

Country hubs got real content too

iOS distributed-proxy scraping → opt-in

What to expect

1 · Frontend junk filter (`web/lib/quality.ts`)