Jenkins weekly release outage

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Jenkins weekly release outage

Olblak-2
Dear All,

I would like to do a quick postmortem for yesterday’s Jenkins weekly release outage that lasted for about 6h30.  The weekly release Jenkins 2.263 was listed on www.jenkins.io as available but was not available to download.

Since April 2020, the weekly release is fully automated and triggered every Tuesday by this job 

It runs two Jenkins jobs from a specific jenkins instance:
1) Build maven artifacts then publish them on repo.jenkins-ci.org, like on repo.jenkins-ci.org
2) Build distribution packages using the jenkins.war from the maven repository then update our mirror infrastructure

Yesterday, the second stage failed on the window package step which resulted in no distribution packages published at all.

But because a new version has been published on our maven repository by the first job, every Jenkins instance was notified that a new weekly version was available. And because we didn't update our mirror infrastructure, nobody was able to fetch the update. It took us 6h30 before fixing it, fortunately enough, the second stage is pretty quick, +-15min versus the 2h needed for the first stage, so we rerun the job without windows packaging.

Remark: At the moment the windows package is still not published due to a Windows issue in the infrastructure

This outage reminded us that we still have work to do and help is definitely more than welcome :)

Issues

* [INFRA-2538] -> To fix the windows packaging issue
* We wrote a python script to detect the latest version from maven-metadata.xml, for some reason the metadata file we rely on, still references the previous weekly release 2.262 while all the other maven-metadata.xml are correct. :/

Monitoring

6h30 is way too long to detect such issue, fatigue habit is a thing and we must detect when something went wrong as fast as possible     
[INFRA-2027] -> I started working on a python script that we could use with Datadog but I haven't had the time to finish it yet

Artifact Promotion
 
While it would have not solved the current problem, we could have published the maven release to a temporary maven repository then only promote the artifacts once every distribution package is available.
So people would not have been notified, considering that we mainly rely on people monitoring this would have probably delayed even more the release. We already have that logic in place as it's needed for the security release anyway, we just have to agree on a staging repository.

We’ll be working on those improvements and will share our progress as the improvements become available.

Cheers,

Olblak

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/4c4cc1b0-ee5e-4faf-ac7e-677e50dd36f5%40www.fastmail.com.