Make one change at a time, don't rush it, and maintain control throughout

Updates to the .nl zone need thorough preparation and very cautious implementation

Illustration of blocks with the letters DNS (Domain Name System) on them.

Wednesday 22 November 2023

The .nl zone is hugely important to Dutch society and to the national economy. Updates to the zone's DNS(SEC) system therefore have to be prepared thoroughly and implemented with great caution. However, updates do need to be made promptly, because excessive delay would undermine stability and security.

Our engineers approach essential maintenance, such as the recent migration from DNSSEC algorithm 8 to algorithm 13, primarily from a technical perspective. "If you let yourself dwell on the social and economic consequences of the zone going down, you wouldn't dare do anything. Questions would certainly be asked in parliament, and, in the worst-case scenario, an outage could lead to intervention by the Ministry of Economic Affairs." Our engineers therefore make a point of studying all incident reports released by other registries – as well as learning all they can about things that went well elsewhere. "Incidents impact the internet as a whole. So sharing information about them is vital for prevention."

Over the last year, there have been several occasions when a top-level domain had an issue with its DNSSEC configuration, making the domain unreachable for validating resolver users, even if only briefly.

Inconsistent DNSSEC records in the .nz zone

The most serious incident involved New Zealand's .nz top-level domain, which experienced major reachability problems for a period of 15 to 20 hours on 29 and 30 May [1, 2]. The problems were caused by the annual KSK rollover for the zone's traditional (categorical) second-level domains, during which the old KSK signatures attached to the ZSK keys were deleted too soon. At the time of deletion, the old signatures were still needed according to the DS records held in the caches of some validating resolvers. Consequently, resolvers with cached records regarded the domain names in question as bogus and correctly blocked them. Although InternetNZ engineers quickly realised what was amiss, it was too late. The situation could be rectified only by waiting for the old DS records cached by validating resolvers to expire (at the end of the TTL) or by manually flushing them (as InternetNZ asked the main access providers to do).

Percentage mobiele apparaten met validatieproblemen gedurende het incident. Bron: InternetNZ

Figure 1: Percentage of mobile devices experiencing validation problems over the course of the incident. Source: InternetNZ.

Human errors and unforeseen circumstances

The underlying cause of the incident was an oversight during deployment of the new InternetNZ Registry System (IRS) six months earlier: the OpenDNSSEC timing parameters weren't aligned with the new system, as they should have been. As a result, when the KSKs were rolled over, the OpenDNSSEC signer erroneously assumed that the old DS records had already been flushed from resolver caches around the internet, which wasn't the case. For a detailed account of the incident, see the technical report and external report.

Venezuela's .vz top-level domain was hit by a more minor incident this summer, when it too briefly acquired 'bogus' status during a KSK rollover [1].

Finally and most recently, there were problems with the ripe.net domain belonging to RIPE NCC, the regional internet registry (RIR) for greater Europe and West Asia (including the Netherlands, of course). At the start of this month, ripe.net's domain names could not be validated (i.e. were viewed as bogus) because of a typing error in a single record, which consequently acquired a TTL exceeding the validity period of the DNSSEC signatures. As a result, the DNSSEC software (Knot) stopped re-signing the zone, and the issue wasn't detected until the old signatures expired. [1]

A list of tens of DNSSEC-related incidents is available from the IANIX website. A common feature linking many of the events is that the root cause was human error in combination with unforeseen circumstances not covered by the (automated) routines.

Valuable incident reports

Portrait photo of Stefan Ubbink — Stefan Ubbink, DNS & Systems Engineer at SIDN

For our own engineers, the reports on such incidents make uncomfortable but valuable reading. "Of course, there are numerous best practice documents and RFCs [9364, 6781 and 6840] describing operational DNSSEC and key/algorithm rollovers," says SIDN DNS Engineer Stefan Ubbink. "However, incident reports are very helpful as well. Incidents are discussed both online and within CENTR [the European organisation for ccTLD registries]."

"It's important for everyone that we have complete transparency about what went wrong. After all, these incidents have repercussions for the internet as a whole. Everyone makes mistakes, and disclosure is the only way to prevent the same thing happening again."

By way of illustration, SIDN's Infrastructure and Security Architect Marc Groeneweg points to the DNSSEC problems encountered by Sweden's .se top-level domain early last year [1, 2]. "After reading their incident reports, we investigated whether we were at risk of anything similar happening to .nl. Fortunately, we concluded that .nl wasn't at risk, because we validate the entire zone file prior to publication."

Communication and testing

Information about successful transitions can be equally useful. This summer, we rolled over the .nl zone's KSK pair from algorithm 8 to algorithm 13. "We published details of the rollover before and after it was carried out," recalls Ubbink. "Blogs about it are available in both Dutch and English, so that as many people as possible have the opportunity to read them and ask questions. We're also available to do presentations on the process."

"The pre-rollover blog was important as a means of informing people who weren't directly involved that an unusual rollover was on the way. By contrast, I understand that InternetNZ didn't communicate widely about the KSK rollover for the .nz domain before it took place. As a result, the problems came as a complete surprise to many people."

"We also asked our SIDN Labs colleagues to actively monitor the .nl zone during the rollover. So, for example, we got to know about an upturn in TCP traffic to our name servers: a logical consequence of the zone temporarily being double-signed, meaning that DNS responses were too big for UDP transport."

One change at a time

A key principle in the context of critical transitions of this kind is that changes must be made one at a time. That allows for complete focus on the change being made, removes the need to trace the root cause of any problems that might arise, and prevents issues attributable to interactions and interdependencies. The importance of the principle is underscored by events in Canada in early August of this year. The country's .ca top-level domain was hit by a DNSSEC issue when the registry CIRA tried to replace its HSM equipment at the same time as performing an algorithm rollover [1].

A case can therefore be made for saying that preparations for our algorithm rollover actually began in 2021. "We already knew back then that we wanted to switch to algorithm 13," Ubbink explains. So we replaced our HSMs with systems capable of supporting ECDSA cryptography."

"Then, at the end of last year, we set the opt-out bit for the .nl zone to 0, meaning that the zone now includes NSEC(3) records for all unsigned delegations as well. That's the way we handle critical updates: one step at a time."

Testing and acceptance

Preparations for July's algorithm rollover started in earnest in February of this year. "Our first step was to simulate the process in our test environment and document each step," Ubbink continues. "Then we did a dry run of the process in our acceptance environment, from which we learnt that the duration of the rollover combined with the memory requirements of the OpenDNSSEC system represented a challenge. The point being that the .nl zone is now refreshed every half an hour, while, with the new configuration, the total time required for signing and validation was 45 minutes. We validate the entire zone before publishing a new version, but the validation of digital signatures takes considerably longer when the ECDSA algorithm is used. What's more, additional memory was needed during the transition, because a double-signature rollover [1] implies temporarily having twice as many signatures in the zone."

Task	Duration (minutes)
Retrieving zone information from database	6
Signing the zone	8
Validating the zone	5
Checking the zone	3
Publishing the zone on the primary server
Total	~22

Table 1: The existing DNSSEC signing process now takes 20 to 22 minutes.

Production playbook

Marc Groeneweg, Infrastructure & Security Architect at SIDN

"After upgrading the OpenDNSSEC system, we went back to our acceptance environment and went through the entire process 5 times. That enabled us to formulate a production playbook for the actual rollover. At 2 stages of the process, we were dependent on IANA. So we knew the sequence and duration of the internal stages, but not the overall duration. The 2 points where IANA's involvement was needed also served as checkpoints for us and as elements of our key ceremony."

"The production playbook identified decision points: points in the process when we have to decide whether or not to continue with the rollover. If the process was aborted at one of those points, it wouldn't be a disaster. In that scenario, the existing zone (serial number) would remain active, while we manually investigated and resolved whatever problems led to us halting the rollover. The only impact on users would be that register amendments wouldn't take effect within half an hour."

However, emergency stops are rare: years can pass without a significant issue. The .nl zone has been incident-free for a long time, but the .politie zone – also managed by SIDN – did see an emergency stop about six months ago. "The last real DNSSEC issue with the .nl zone was more than a decade ago, back in 2012," recalls Groeneweg, who has been involved with the .nl zone since it was managed by KEMA 25 years ago. That was when DNSSEC was only just finding its feet and there was very little validation taking place, so the incident had barely any impact."

Automation

Along with the phased, one-change-at-a-time approach, automation has a vital role to play in incident prevention. In practice, problems are often attributable to manual revisions and input errors. "OpenDNSSEC is very good in that respect, relieving us of a lot of work," says Groeneweg. "However, you do need to understand the timing properly and work it all out carefully. Fortunately, we have ample in-house DNSSEC know-how. And, if we need it, we can always turn to NLnet Labs for advice. They are the team behind OpenDNSSEC and other software. We've used OpenDNSSEC for signing the .nl zone from the start, and we actually contributed to version 2.0 of the software ourselves."

"One of the big advantages of OpenDNSSEC is that you've got direct access to the timing parameters and other details," adds Ubbink. "That was really useful in the test phase, because we could easily tweak this or that, or try out a different policy. If we'd wanted to do that with PowerDNS (also a Dutch product, but developed according to very different design principles), we would have needed to wait for the old configuration settings to be flushed from the cache before trying out each change."

Although the .nl domain's quarterly ZSK rollovers are completely automatic, the ordinary five-yearly KSK rollovers can't be fully automated because of the need for manual coordination with ICANN. Automation is technically possible, but would require the root zone to support the use of CDNSKEY/CDS records, as per RFC 8078), which it doesn't yet do. At the level beneath the root, Sweden's .se-domain does support automated KSK rollovers for its domain names, but we haven't provided support in the Netherlands and don't currently have any plans to do so.

Close scrutiny

Groeneweg and Ubbink approach major transitions primarily from a technical perspective. "If you let yourself dwell on the .nl zone's importance to Dutch society and to the national economy, you wouldn't dare do anything," reasons Groeneweg. "An outage would certainly lead to questions being asked in parliament, and, in the worst-case scenario, to intervention by the Ministry of Economic Affairs." [1]

"In that context, it's good that, in its capacity as the registry for .nl, SIDN is designated an essential service provider under the EU's new NIS2 Directive. It means that we're under increasingly close scrutiny, but that keeps us on our toes." Nevertheless, Groeneweg points out, the new regulations have few operational implications for SIDN. "Most of the requirements have been covered by our processes and procedures for a long time anyway."