i have a little problem with the slave JVM randomly dying on our larger
build machines. When these machines are under heavy load the following
error tends to pop up somewhat randomly on the slave connection:
The likeliness of these errors increases with the amount of builds running
on the slave. If i allow only 20 executors it occurs very rarely or never
while with 100 executors its quite likely for the slave to disconnect from
this once all executors have a job running on them.
The affected machines are dual socket systems with two 18 Core Xeon CPUs
making for 36 cores and 72 threads (HT). 384GB RAM (or more) are installed
of which 200GB are assigned to a ramdisk (tmpfs). This ramdisk is used for
the jenkins workspace.
As OS we use Debian 8 (Jessie) with the 4.9 kernel from backports. The
Jenkins version is 2.55 and the installed Java version is OpenJDK
The running builds are mostly larger C projects being compiled with gcc
and some latex documentation.
Since it occurs only with many parallel builds running this somewhat
suggests that we might be hitting some kind of limit that causes the slave
process to be terminated. However there is nothing in the logs
(journalctl, dmesg) hinting at that and as far i know neither the oom
killer nor ulimit use SIGTERM for that purpose.