bokumin.org

Github

A story about changing the sitemap generation logic of a blog and resubmitting it to Google/Bing

This article is a translation of the following my article:

 

 

* Translated automatically by Google.
* Please note that some links or referenced content in this article may be in Japanese.
* Comments in the code are basically in Japanese.

 

by bokumin

 

Introduction

 

Recently, I added an English version page to this personal blog.
Up until now, we have been operating with only Japanese articles, but since we occasionally saw access from overseas, we implemented multilingual support (i18n).

 

However, just rewriting the blog system (SSG) code and translating articles is not enough. If you do not properly propagate to search engines the fact that a new language page has been created or the URL structure has been changed, the page you created will not be indexed.

 

This time, I would like to share the thought process and work involved in redesigning the sitemap (sitemap.

 

In conclusion, by making this configuration change, we were able to achieve a response speed that would be reflected in search results as early as 24 hours (the next day) after the article was published.
However, this also involves the prerequisite of “domain reliability”, so I will explain the relationship there as well.

 


 

1. Physical division and re-registration of sitemap

 

With this multilingualization, the URL structure has changed (Japanese is directly under the root, English is under /en/, etc.).
In line with this, we have also changed the sitemap generation logic and designed the physical files to be divided by language.

 

Why did you separate the files?

 

According to the specifications, it is possible to write URLs for all languages in one sitemap.xml. However, we adopted splitting for the following reasons.
The first issue is management. In the future, if you want to exclude only English articles or analyze index status by language, it will be easier to track and isolate them in Search Console if the files are separate.
The second reason is that I wanted to reduce the parsing load. When the number of articles increases, presenting them in sections has the advantage of reducing crawler resource consumption rather than parsing one huge XML file.

 

Specifically, we modified the build script to generate the following two things.

 

sitemap.xml 日本語記事のみ(/en/を含まないURL)
sitemap-en.xml 英語記事のみ(/en/のみのURL)

 

When the sitemap URLs become too large or cannot be managed, it is better to separate them.

 

Resubmit to Google Search Console (Sachiko)

 

Because the configuration has changed, you need to re-register with Google Search Console. If old information remains, it will cause crawl errors (404, etc.), so we reset the status using the following steps.

 

  1. Access the “Sitemap” menu on the admin screen.
  2. Select the existing (old) sitemap.xml and execute “Delete sitemap”.
  3. Submit each of the two generated files from the “Add new sitemap” form.
    • sitemap.xml
    • sitemap-en.xml

     

 

Don’t forget Bing Webmaster Tools

 

Similar to Google, I opened Bing Webmaster Tools and submitted the above two files from the sitemap feature.
Bingbot patrols in a different cycle than Googlebot, so by explicitly communicating the endpoint, you can reduce the initial latency until indexing.

 

[Supplement] Also register the feed (RSS/Atom)

 

Although often overlooked, Google Search Central recommends submitting both a sitemap and an RSS/Atom feed.

 

  • Sitemap: A “map” of the entire site. Convey the structure of all pages.
  • Feed (RSS/Atom): “Notifications” of the latest information. Quickly communicate new content.

 

Since each has a different role, you can maximize crawler coverage by registering both in Search Console. This time, I added the feed as well as the site map.

 


 

2. Infrastructure settings

 

In order to pass correct information to search engines, it is essential to optimize not only the application layer (sitemap generation) but also the infrastructure layer (distribution settings).

 

robots.txt の更新

 

robots.txt acts like an access control list (ACL) for the crawler.
If you do not specify the location of the sitemap here, there is a possibility that only search-type crawlers will be able to discover the sitemap.

 

I declared the two sitemap locations with canonical URLs (FQDN) by writing as follows. Additionally, system paths such as the management screen are set to Disallow to avoid wasting crawl resources.

 

User-agent: *
Disallow: /admin/
Disallow: /drafts/

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-en.xml

 

Cloudflare Page Rules settings (cache bypass)

 

If you are building a blog using a static site generator (SSG) and using a CDN (Cloudflare, etc.), you need to be careful about “Sitemap cache inconsistency”.

 

Even if a new sitemap.xml is generated on the origin server, if the CDN edge node continues to cache the “stale sitemap”, it will return an old list (Stale Content) to Googlebot. With this, new articles will never be discovered.

 

The sitemap is a lightweight text file, and the load on the origin due to cache misses is negligible.
We used Cloudflare’s Page Rules feature to apply the following settings.

 

  • URL pattern: *domain.com/sitemap*.xml
    • Use wildcards to match both Japanese and English versions.

     

  • Settings (Cache Level): Bypass (do not cache)

 

This allows the crawler to always retrieve the latest sitemap just generated from the origin.

 

You can check the script used to create it below.

 

 


 

3. Generation logic and date (Lastmod) control

 

This is the logic part that I paid the most attention to this time.
The output logic for the <lastmod> (last updated date) tag in the XML sitemap has been branched according to the operational characteristics of each language.

 

Japanese article (JA) → Modified Date adopted

 

Japanese articles are original content. If you rewrite or add technical information, you should notify search engines of the update.
Therefore, updatedAt (modified date) in the article’s metadata is output to lastmod. This allows you to emphasize the freshness of your information on SERPs (search results pages).

 

English article (EN) → Publish Date adopted

 

On the other hand, for the newly created English version article, we purposely output publishedAt (posting date) and have a policy of not changing the Lastmod even if the update date changes internally.

 

The reason is as follows.

 

  1. Response to translation tuning:
    In the initial stage of operation, minor fixes such as “fine adjustment of phrasing” and “correction of typographical errors” frequently occur.
  2. Stability of evaluation metrics:
    If you update the date by modifying a few words, the search algorithm will see that the content has been updated frequently, but the difference in content is small. This risks becoming noise in quality evaluation.

 

Since the original is the Japanese version, and the English version is mainly subject to minor modifications to improve translation quality, we took a safety measure: We do not require re-crawling unless it is a major rewrite.

 


 

4. Notification design considering crawl budget

 

Next is the mechanism of update notification (Ping).
It is important to aim for “article publication = immediate indexing,” but as an engineer, you are required to design a design that avoids wasting “crawl budget (Googlebot resources)”.

 

Google Indexing API not adopted

 

There is also information that “immediate indexing is possible using the Indexing API”, but this time we decided to reject.
Google’s official documents have limited uses as follows.

 

Currently, the Indexing API can only be used to crawl pages with either JobPosting or BroadcastEvent embedded in a VideoObject.

 

 

In other words, it is an API dedicated to “short-lived content that disappears frequently” such as job information and live streaming, and is not recommended for use in permanent blog articles. Misuse of APIs risks violating guidelines.

 

WebSub (formerly PubSubHubbub) and conditional notifications

 

Instead, we use the standard protocol WebSub.
However, here too we need control that takes batch processing into account.

 

When supporting multiple languages, batch processing such as “translating and deploying past articles in bulk” occurs.
At this time, if you enable the notification function unconditionally, Hundreds of Pings will be sent to Google instantly. From a search engine’s perspective, the behavior is as follows.

 

  1. Receive bulk update notifications and allocate crawler resources.
  2. If you check the sitemap, the date has not changed (due to the Lastmod settings mentioned above).
  3. Or, the content difference is extremely small.
  4. The domain will be determined to be a “domain that sends excessive notifications” and the subsequent crawl frequency will be reduced.

 

This is a waste of crawl budget.
Therefore, we implemented the following conditional branch in the sitemap generation script.

 

  • When a new article is published: Send WebSub notification
  • During batch processing such as regenerating all articles: Do not send WebSub notifications

 


 

5. Correlation between actual effectiveness and domain credibility

 

As a result of applying these settings, we observed cases in the Search Console logs where the crawler reached the article within a few minutes of publication, and the article was indexed the next day (within approximately 24 hours).

 

However, this speed cannot be achieved only by technical settings. It is important to combine this with “domain authority”.

 

Relationship between technical settings and domain evaluation

 

Googlebot’s patrol frequency is assigned based on the domain’s performance and content quality.

 

  • Technical settings (content of this article):
    Implementation to accurately send update signals to the crawler and minimize latency until discovery.
  • Domain trustworthiness:
    The algorithmic basis by which Google determines that this domain deserves priority allocation of crawl budget.

 

The result of this “next day index” can be said to be that the blog has been in operation for a certain period of time and has obtained a certain trust score from past article update records, so the technical measures (WebSub + latest sitemap) worked without becoming a bottleneck.

 

Conversely, no matter how good your CI/CD pipeline or notification system is, you won’t be able to achieve the same speed on a new domain on the first day of operation (because of the sandbox period, etc.).
However, if the index is slow after a long period of operation, it is most likely due to technical debt (defective cache settings or notification logic) as introduced here.

 


 

まとめ

 

It is easy to think that “sitemaps are automatically generated and that’s it,” but when requirements such as multilingual support and SSG operation are added, there are a wide variety of engineering considerations.

 

  • Improved manageability by splitting physical files
  • Bypassing the CDN cache to ensure data integrity
  • Lastmod logic branching stabilizes evaluation
  • Resource optimization with WebSub conditional submission

 

These settings work automatically once you include them in your pipeline.
I hope this will be helpful to someone.

 

End