When pages vanish from Google's index, guesswork costs time. This checklist walks you through the five most common bottlenecks — robots.txt blocks, noindex tags, canonical misdirection, crawl budget traps, and server response errors. Use it step by step or jump to the issue you suspect most.
Most diagnostics start with 'check the URL in Search Console.' That gives you a status — Crawled but not indexed, Discovered but not indexed, Excluded — but not the root cause. The real work begins when you correlate that status with your site's technical settings. In practice, when you see 13,000 pages discovered but not indexed in a 50,000-page site, the culprit is almost never a single noindex tag. It's a combination: a bloated sitemap, slow server responses on category pages, and a robots.txt that accidentally blocks the /blog/ path. Start by isolating which of the five zones below is failing.
Get the exact exclusion reason. Not 'not indexed' but the sub-status.
Use the live test tool. Look for unintended Disallow directives on the URL path.
View page source or use the URL Inspection API. Confirm noindex is absent and canonical is self-referencing.
Check total crawl requests, average response time, and by-content-type breakdown in Search Console.
Run curl -I for HTTP status, headers, and load time. 5xx or slow 200s block indexing.
Use site: search or the API. If still missing after fixes, request indexing.
Verify the page URL is not blocked by robots.txt using the <a href="https://developers.google.com/search/docs/monitor-debug/debugging-robots-txt">Google robots.txt debugging tool</a>. Focus on Disallow and Allow rules for the exact path.
Scan for accidental noindex tags in the page <code><head></code>. Use a crawler like Screaming Frog on a 5000-URL sample to catch bulk errors.
Inspect the rel=canonical tag. Ensure it points to the exact URL you want indexed. A single wrong attribute across paginated category pages can cascade.
Check server logs for 404, 410, and 5xx status codes on the target URL. Google stops crawling after three consecutive 503s.
Evaluate internal links pointing to the page. Pages with zero internal links are often left in 'Discovered but not indexed' limbo.
| Diagnostic Zone | Concrete Check | Tools & Metrics | Hidden Failure Mode |
|---|---|---|---|
| robots.txt | Run live test for the exact URL path. Look for Disallow: / or wildcard rules. | Google Search Console robots.txt tester cURL with header inspection | A Disallow: /category/ rule aimed at staging accidentally blocks live product pages if paths overlap. |
| noindex tag | Check in page source.Verify no X-Robots-Tag in HTTP headers. | Screaming Frog (filter: noindex) Google URL Inspection API | WordPress SEO plugins sometimes add noindex to pagination pages by default. A site with 200 category pages can lose 600+ product URLs. |
| canonical tag | Confirm rel=canonical points to the exact page URL.Watch for HTTP-to-HTTPS mismatches. | Ahrefs Site Audit (canonical report) Manually view page source | On faceted navigation, a filter parameter like ?color=red may self-canonicalize to the main category, causing Google to drop the filtered URL. |
| crawl budget | Review Crawl Stats in Search Console. Check average response time per content type. | Search Console Crawl Stats report Log file analysis (e.g., Logz.io) | If 85% of crawl budget goes to /images/ or /tag/ pages, new product pages may wait weeks for a first crawl. |
| server response | Run curl -I -w '%{http_code} %{time_total}'.Look for >3s load time or 5xx responses. | cURL, Chrome DevTools Network tab GTmetrix (waterfall) | A 200 response that takes 8 seconds to start sending body bytes is treated as a soft timeout by Googlebot. The page gets dequeued. |
Context: A mid-size ecommerce site with 50,000 product URLs. Google had indexed only 12,400. The Crawl Stats report showed 3,400 pages in 'Discovered but not indexed' and 1,200 in 'Crawled but not indexed.'
Step 1: We extracted the list of 4,600 unindexed URLs via Search Console API. Step 2: We ran the list through a bulk robots.txt tester. Found that Disallow: /product-category/ was blocking 2,100 of those URLs — an old rule from a site redesign. Fixed in robots.txt within 10 minutes. Step 3: Of the remaining 2,500 URLs, 1,800 had a noindex tag injected by the theme's SEO plugin on out-of-stock variants. Removed the tag via functions.php filter. Step 4: The final 700 URLs had a canonical tag pointing to the parent category page. Corrected the canonical to self-referencing. After these three changes, Google re-crawled 3,100 URLs within 8 days. Indexed count rose to 15,200.
Empty results from Search Console API? Sometimes the API returns zero unindexed URLs for a site that clearly has missing pages. This usually means the date range is too narrow or the property is misconfigured (host vs domain property). Switch to a domain property and extend the range to 6 months.
Wrong filters on crawler reports. A common situation we see is a team running Screaming Frog with 'Ignore robots.txt' checked, then wondering why their noindex scan shows no blocked URLs. Always run two crawls: one that obeys robots.txt and one that ignores it. Compare the difference.
Duplicate lists. When you export sitemaps from Google Search Console and combine them with a crawl export, you often get duplicate rows. Deduplicate by URL before running any analysis. A single duplicate can throw off count-based prioritization.
Slow vendors. If you rely on a third-party crawling service, check the refresh cadence. We've seen cases where a vendor's data was 72 hours stale, causing the team to 'fix' issues that were already resolved. Use live tools for final verification.
For a broader operational view of indexing health, refer to the Google Index Update Detection Checklist — it helps distinguish site-level issues from algorithm-driven index fluctuations.
Use the Google Search Console API to automate URL inspection across properties. Build a script that checks each site daily for new 'Excluded' URLs, then categorizes them by reason (robots, noindex, canonical). Prioritize fixes by the count of affected pages. For agencies, a 15-minute cron job per client saves hours of manual diagnosis.
No. If a page is blocked by robots.txt, Googlebot cannot crawl it, regardless of backlinks. The links may appear in the index as 'URL not available' or with a snippet from the linking page. To get the page indexed, you must first remove the Disallow rule and then request indexing via Search Console.
Use the Google Search Console URL Inspection API with a batch size of 100 URLs per request. Store the inspectionResult.indexStatusResult.verdict field. Filter for 'PASS' (indexed) vs 'FAIL' (not indexed). For sites over 50,000 URLs, iterate through sitemaps rather than random URL lists to stay within API quota limits.
First, check if the page was previously indexed using the Search Console date filter. If it lost index status, compare the page content against Google's helpful content guidelines. Then run the five-zone diagnostic (robots, noindex, canonical, crawl stats, server response). Often, a core update increases quality signals, making thin pages fall out of the index.
Guest posts often face indexing issues because they live on domains with low crawl priority or have thin content. Ensure the post has at least 800 words, a unique image, and internal links from the host site's main content area. Submit the URL to Google via the Inspection API. Avoid placing guest posts on sites with a high ratio of outbound links to content body.
Look for 'Server error (5xx)' and 'Redirect error' in the URL inspection result. A high count of 'Crawled but not indexed' with 200 status but slow load times (over 3 seconds) also points to server-side timeout issues. Check the Crawl Stats report for a rising average response time, especially on mobile-first crawl data.
Google Search Console itself is the best free tool. Use the 'Pages' tab, filter by 'Not indexed,' and export the list. For a structured workflow, clone the Google Index Update Detection Checklist and cross-reference each URL against the five zones. No paid tool is required for the initial 80% of fixes.
Check the HTTP response headers for server-side rendering. If the page returns a 200 with an empty body (common in client-side rendering), Googlebot may see a blank page. Use the 'View crawled page' feature in Search Console. If the rendered HTML lacks content, implement server-side rendering or pre-rendering for critical pages.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.