RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Basil Crow
I recently stabilized my plugin's test suite on ci.jenkins.io. The
following is my root cause analysis.

At present there are eight online Ubuntu EC2 agents on ci.jenkins.io.
Three of these are high memory and five of these are not:

• EC2 (aws) - High memory ubuntu 18.04 (i-067cdb5c4dd6bbc66)
• EC2 (aws) - High memory ubuntu 18.04 (i-09868363dd8e0e302)
• EC2 (aws) - High memory ubuntu 18.04 (i-0d3e670dcf9448827)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0147db496a4c3205b)
• EC2 (aws) - Ubuntu 18.04 LTS (i-066509d2e6e564444)
• EC2 (aws) - Ubuntu 18.04 LTS (i-06b6dd7739f0fcad8)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0c6752517c9e4dd86)
• EC2 (aws) - Ubuntu 18.04 LTS (i-0d7ea29c5c4d607c6)

Both the high memory and the regular memory agents have the "linux"
label, so the Linux branches of my plugin's tests may run on either
the high memory or the regular memory agents. I noticed that the
branches of my tests that happen to run on the high memory agents
usually pass, but the branches of my tests that happen to run on the
regular memory agents frequently time out.

I added additional logging and saw that the agent JVM being launched
by my tests was sometimes running out of memory and crashing. This in
turn was causing my test to time out waiting for the agent to connect.
Why was the agent JVM running out of memory?

I added additional logging to print memory usage by process during
each test. I discovered that the regular memory agents have 2 GB of
RAM. They run several JVMs in the course of a typical integration
test:

• Remoting (with no -Xmx or -Xms)
• Maven (with no -Xmx or -Xms)
• surefire (with -Xms768M -Xmx768M)
• The agent JVM launched by my tests (with no -Xmx or -Xms)

I added additional logging and determined that at the time my test
started (at which point the only JVMs running were Remoting, Maven,
and surefire), only about 400 MB of RAM remained free on the system.
Thus it was no surprise that my agent JVMs were frequently running out
of memory.

I worked around the problem by setting

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-surefire-plugin</artifactId>
  <configuration>
    <argLine>-Xmx256m -Xms256m</argLine>
  </configuration>
</plugin>

in pom.xml and setting "-Xmx64m -Xms64m" for my agent JVMs (in my
tests). With these settings my tests consistently pass, even on the
regular memory EC2 agents.

I suggest the Jenkins infrastructure team consider adding -Xmx and
-Xms options to the Remoting JVM and/or using EC2 instance types with
more memory.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjq3HKsXiO-%2BBjgKgn1fjxSaJApQGUf2HyRwW2jM28p4Jw%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Jesse Glick-4
Thank you for digging into this problem which has been plaguing us.
(INFRA-2548?) Your analysis sounds right. The next step would be PRs
to infrastructure repositories.

256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
plugins plus whatever your test code is doing. Of course 2Gb is also a
bit tight for `JenkinsRule` tests. I agree that `agent.jar` should be
able to run in quite a bit less than whatever HotSpot ergonomics would
pick by default, and probably `mvn` could as well, leaving more room
for the Surefire JVM and any extra processes such as mock agents, Git,
Docker fixtures, etc.

I wonder if there is any way to have all the JVMs in this VM coöperate
to jointly use, say, 75% of available RAM in whatever proportion.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1t-OO_xf4JC7ABZApepXSQ44P2SphF_8j0cf8FsZNU1A%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Basil Crow
On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
> The next step would be PRs to infrastructure repositories.

I agree. Unfortunately I have spent too much time on this issue already and
cannot volunteer to become an infrastructure developer at present.

> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing.

I agree, which is why I said "I worked around the problem" rather than "I
solved the problem." Changing the JVM settings for Surefire and the agent
launched by my tests was easy because both those JVMs were completely within
my control. I agree that a long-term solution would involve setting -Xmx and
-Xms on all agent and Maven JVMs as well as possibly increasing the EC2
instance size for these nodes.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Matt Sicker
In reply to this post by Jesse Glick-4
Might be worth looking at OpenJ9. It has some nifty cloud native
features for helping reduce JVM load. For example:

* https://www.eclipse.org/openj9/docs/jitserver/
* https://www.eclipse.org/openj9/docs/shrc/

Disclaimer: I've only seen a talk about this; I've never tried
configuring this in a real cloud environment. Looks nifty, though.

On Fri, Jun 5, 2020 at 3:24 PM Jesse Glick <[hidden email]> wrote:

>
> Thank you for digging into this problem which has been plaguing us.
> (INFRA-2548?) Your analysis sounds right. The next step would be PRs
> to infrastructure repositories.
>
> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing. Of course 2Gb is also a
> bit tight for `JenkinsRule` tests. I agree that `agent.jar` should be
> able to run in quite a bit less than whatever HotSpot ergonomics would
> pick by default, and probably `mvn` could as well, leaving more room
> for the Surefire JVM and any extra processes such as mock agents, Git,
> Docker fixtures, etc.
>
> I wonder if there is any way to have all the JVMs in this VM coöperate
> to jointly use, say, 75% of available RAM in whatever proportion.
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1t-OO_xf4JC7ABZApepXSQ44P2SphF_8j0cf8FsZNU1A%40mail.gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ozeXkYbRevuwk4F5wr%3DaR9sniv%2BJ3H60Ed-7iKpT%3DErGA%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

slide
In reply to this post by Basil Crow
We are currently using t2.small instances on EC2 for the non-high memory instances

image.png

Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem. 

The high mem instances are using m5.adxlarge:

image.png

I don't know what the cost difference is between the t2 and m5a instances.

On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
> The next step would be PRs to infrastructure repositories.

I agree. Unfortunately I have spent too much time on this issue already and
cannot volunteer to become an infrastructure developer at present.

> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing.

I agree, which is why I said "I worked around the problem" rather than "I
solved the problem." Changing the JVM settings for Surefire and the agent
launched by my tests was easy because both those JVMs were completely within
my control. I agree that a long-term solution would involve setting -Xmx and
-Xms on all agent and Maven JVMs as well as possibly increasing the EC2
instance size for these nodes.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Vlad Silverman
I don't know what the cost difference is between the t2 and m5a instances.

I guess it depends on the region.
More details are at https://aws.amazon.com/ec2/pricing/on-demand/

On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:

We are currently using t2.small instances on EC2 for the non-high memory instances

<image.png>

Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem. 

The high mem instances are using m5.adxlarge:

<image.png>

I don't know what the cost difference is between the t2 and m5a instances.

On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
> The next step would be PRs to infrastructure repositories.

I agree. Unfortunately I have spent too much time on this issue already and
cannot volunteer to become an infrastructure developer at present.

> 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
> plugins plus whatever your test code is doing.

I agree, which is why I said "I worked around the problem" rather than "I
solved the problem." Changing the JVM settings for Surefire and the agent
launched by my tests was easy because both those JVMs were completely within
my control. I agree that a long-term solution would involve setting -Xmx and
-Xms on all agent and Maven JVMs as well as possibly increasing the EC2
instance size for these nodes.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Matt Sicker
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.

On Fri, Jun 5, 2020 at 4:19 PM Vlad Silverman <[hidden email]> wrote:

>
> I don't know what the cost difference is between the t2 and m5a instances.
>
>
> I guess it depends on the region.
> More details are at https://aws.amazon.com/ec2/pricing/on-demand/
>
> On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:
>
> We are currently using t2.small instances on EC2 for the non-high memory instances
>
> <image.png>
>
> Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem.
>
> The high mem instances are using m5.adxlarge:
>
> <image.png>
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
> On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
>>
>> On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
>> > The next step would be PRs to infrastructure repositories.
>>
>> I agree. Unfortunately I have spent too much time on this issue already and
>> cannot volunteer to become an infrastructure developer at present.
>>
>> > 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
>> > plugins plus whatever your test code is doing.
>>
>> I agree, which is why I said "I worked around the problem" rather than "I
>> solved the problem." Changing the JVM settings for Surefire and the agent
>> launched by my tests was easy because both those JVMs were completely within
>> my control. I agree that a long-term solution would involve setting -Xmx and
>> -Xms on all agent and Maven JVMs as well as possibly increasing the EC2
>> instance size for these nodes.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
>
>
>
> --
> Website: http://earl-of-code.com
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ox0q94X6DA-iyhNwoEXXwZk98E5qN1da0ngO6XWCre2Vw%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

slide
Just for reference...

image.png
image.png
image.png

t2.medium may be the way to go

On Fri, Jun 5, 2020 at 2:32 PM Matt Sicker <[hidden email]> wrote:
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.

On Fri, Jun 5, 2020 at 4:19 PM Vlad Silverman <[hidden email]> wrote:
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
>
> I guess it depends on the region.
> More details are at https://aws.amazon.com/ec2/pricing/on-demand/
>
> On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:
>
> We are currently using t2.small instances on EC2 for the non-high memory instances
>
> <image.png>
>
> Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem.
>
> The high mem instances are using m5.adxlarge:
>
> <image.png>
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
> On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
>>
>> On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
>> > The next step would be PRs to infrastructure repositories.
>>
>> I agree. Unfortunately I have spent too much time on this issue already and
>> cannot volunteer to become an infrastructure developer at present.
>>
>> > 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
>> > plugins plus whatever your test code is doing.
>>
>> I agree, which is why I said "I worked around the problem" rather than "I
>> solved the problem." Changing the JVM settings for Surefire and the agent
>> launched by my tests was easy because both those JVMs were completely within
>> my control. I agree that a long-term solution would involve setting -Xmx and
>> -Xms on all agent and Maven JVMs as well as possibly increasing the EC2
>> instance size for these nodes.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
>
>
>
> --
> Website: http://earl-of-code.com
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ox0q94X6DA-iyhNwoEXXwZk98E5qN1da0ngO6XWCre2Vw%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVdB%2B1Yde21oead4b_BdfZS79C0af61__bMKC%2BtiXsNTLQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Gavin Mogan
Remember with more resources the tests can often run faster which reduces how much time the instance is needed for.

It's never straight simple math

On Fri., Jun. 5, 2020, 3:40 p.m. Slide, <[hidden email]> wrote:
Just for reference...

image.png
image.png
image.png

t2.medium may be the way to go

On Fri, Jun 5, 2020 at 2:32 PM Matt Sicker <[hidden email]> wrote:
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.

On Fri, Jun 5, 2020 at 4:19 PM Vlad Silverman <[hidden email]> wrote:
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
>
> I guess it depends on the region.
> More details are at https://aws.amazon.com/ec2/pricing/on-demand/
>
> On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:
>
> We are currently using t2.small instances on EC2 for the non-high memory instances
>
> <image.png>
>
> Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem.
>
> The high mem instances are using m5.adxlarge:
>
> <image.png>
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
> On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
>>
>> On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
>> > The next step would be PRs to infrastructure repositories.
>>
>> I agree. Unfortunately I have spent too much time on this issue already and
>> cannot volunteer to become an infrastructure developer at present.
>>
>> > 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
>> > plugins plus whatever your test code is doing.
>>
>> I agree, which is why I said "I worked around the problem" rather than "I
>> solved the problem." Changing the JVM settings for Surefire and the agent
>> launched by my tests was easy because both those JVMs were completely within
>> my control. I agree that a long-term solution would involve setting -Xmx and
>> -Xms on all agent and Maven JVMs as well as possibly increasing the EC2
>> instance size for these nodes.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
>
>
>
> --
> Website: http://earl-of-code.com
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ox0q94X6DA-iyhNwoEXXwZk98E5qN1da0ngO6XWCre2Vw%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVdB%2B1Yde21oead4b_BdfZS79C0af61__bMKC%2BtiXsNTLQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAG%3D_Duu93An6TLSK97%3Dm0im1PKxj5AyS%3DvGMDMLc%3DW%3DY9OHhUA%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

slide
True, but I am not super sure that double the memory (e.g., m5a.large over t2.medium) would make a big enough difference for almost double the cost. I could be wrong though, I am definitely not an expert in java optimization, etc.

On Fri, Jun 5, 2020 at 3:43 PM 'Gavin Mogan' via Jenkins Developers <[hidden email]> wrote:
Remember with more resources the tests can often run faster which reduces how much time the instance is needed for.

It's never straight simple math

On Fri., Jun. 5, 2020, 3:40 p.m. Slide, <[hidden email]> wrote:
Just for reference...

image.png
image.png
image.png

t2.medium may be the way to go

On Fri, Jun 5, 2020 at 2:32 PM Matt Sicker <[hidden email]> wrote:
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.

On Fri, Jun 5, 2020 at 4:19 PM Vlad Silverman <[hidden email]> wrote:
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
>
> I guess it depends on the region.
> More details are at https://aws.amazon.com/ec2/pricing/on-demand/
>
> On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:
>
> We are currently using t2.small instances on EC2 for the non-high memory instances
>
> <image.png>
>
> Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem.
>
> The high mem instances are using m5.adxlarge:
>
> <image.png>
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
> On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
>>
>> On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
>> > The next step would be PRs to infrastructure repositories.
>>
>> I agree. Unfortunately I have spent too much time on this issue already and
>> cannot volunteer to become an infrastructure developer at present.
>>
>> > 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
>> > plugins plus whatever your test code is doing.
>>
>> I agree, which is why I said "I worked around the problem" rather than "I
>> solved the problem." Changing the JVM settings for Surefire and the agent
>> launched by my tests was easy because both those JVMs were completely within
>> my control. I agree that a long-term solution would involve setting -Xmx and
>> -Xms on all agent and Maven JVMs as well as possibly increasing the EC2
>> instance size for these nodes.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
>
>
>
> --
> Website: http://earl-of-code.com
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ox0q94X6DA-iyhNwoEXXwZk98E5qN1da0ngO6XWCre2Vw%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVdB%2B1Yde21oead4b_BdfZS79C0af61__bMKC%2BtiXsNTLQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAG%3D_Duu93An6TLSK97%3Dm0im1PKxj5AyS%3DvGMDMLc%3DW%3DY9OHhUA%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVddZgoc-DwfugUfaMvq4hWcrGo3PcPLuqWWe-%2BpVj9RGQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Tim Jacomb
Hi all,

I've done the following:

* linux docker - was t3.small, now t3a.large (2 core 8gb)
* arm64 - was a1.medium, now t3a.large (2 core 8gb)


Let's monitor and see how we go (pricing and performance wise)

High mem could possibly do with a change, the AWS ones are much lower spec than the Azure ones, thoughts?

Thanks
Tim

On Sat, 6 Jun 2020 at 00:28, Slide <[hidden email]> wrote:
True, but I am not super sure that double the memory (e.g., m5a.large over t2.medium) would make a big enough difference for almost double the cost. I could be wrong though, I am definitely not an expert in java optimization, etc.

On Fri, Jun 5, 2020 at 3:43 PM 'Gavin Mogan' via Jenkins Developers <[hidden email]> wrote:
Remember with more resources the tests can often run faster which reduces how much time the instance is needed for.

It's never straight simple math

On Fri., Jun. 5, 2020, 3:40 p.m. Slide, <[hidden email]> wrote:
Just for reference...

image.png
image.png
image.png

t2.medium may be the way to go

On Fri, Jun 5, 2020 at 2:32 PM Matt Sicker <[hidden email]> wrote:
Looks like m5a are AMD and t2 are Intel (and burstable). If they cost
similar, m5a sounds better.

On Fri, Jun 5, 2020 at 4:19 PM Vlad Silverman <[hidden email]> wrote:
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
>
> I guess it depends on the region.
> More details are at https://aws.amazon.com/ec2/pricing/on-demand/
>
> On Jun 5, 2020, at 2:02 PM, Slide <[hidden email]> wrote:
>
> We are currently using t2.small instances on EC2 for the non-high memory instances
>
> <image.png>
>
> Going from t2.small to t2.medium would double the CPU Credits / hour, though it also doubles vCPU count and Mem.
>
> The high mem instances are using m5.adxlarge:
>
> <image.png>
>
> I don't know what the cost difference is between the t2 and m5a instances.
>
> On Fri, Jun 5, 2020 at 1:40 PM Basil Crow <[hidden email]> wrote:
>>
>> On Fri, Jun 5, 2020 at 1:24 PM Jesse Glick <[hidden email]> wrote:
>> > The next step would be PRs to infrastructure repositories.
>>
>> I agree. Unfortunately I have spent too much time on this issue already and
>> cannot volunteer to become an infrastructure developer at present.
>>
>> > 256Mb seems low for a Surefire JVM—this needs to run Jenkins and all
>> > plugins plus whatever your test code is doing.
>>
>> I agree, which is why I said "I worked around the problem" rather than "I
>> solved the problem." Changing the JVM settings for Surefire and the agent
>> launched by my tests was easy because both those JVMs were completely within
>> my control. I agree that a long-term solution would involve setting -Xmx and
>> -Xms on all agent and Maven JVMs as well as possibly increasing the EC2
>> instance size for these nodes.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjrjW2aA74pqoFMhca%2BD0YqL%3DWvNe46g7G1aM3Bmx5LrWQ%40mail.gmail.com.
>
>
>
> --
> Website: http://earl-of-code.com
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVfTTqU65LN7iDo%2BMw5xPnCKFJfiTqq7kgcCtfXVMsN42w%40mail.gmail.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/FCEB4838-4FF8-410F-AEB3-6FACBDF98D80%40gmail.com.



--
Matt Sicker
Senior Software Engineer, CloudBees

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAEot4ox0q94X6DA-iyhNwoEXXwZk98E5qN1da0ngO6XWCre2Vw%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVdB%2B1Yde21oead4b_BdfZS79C0af61__bMKC%2BtiXsNTLQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAG%3D_Duu93An6TLSK97%3Dm0im1PKxj5AyS%3DvGMDMLc%3DW%3DY9OHhUA%40mail.gmail.com.


--

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAPiUgVddZgoc-DwfugUfaMvq4hWcrGo3PcPLuqWWe-%2BpVj9RGQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3BifjquKiaebG%2BA8BmdGFdk9Kj1XDvm94z8dhAZs1NquDQw%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Jesse Glick-4
On Tue, Jun 9, 2020 at 3:59 AM Tim Jacomb <[hidden email]> wrote:
> High mem could possibly do with a change, the AWS ones are much lower spec than the Azure ones, thoughts?

Not sure but I just got an unexplained

 EC2 (aws) - High memory ubuntu 18.04  (i-0e7f3896526c7922e) was
marked offline: Connection was broken: java.io.EOFException

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr0QsftkUrXP2Z0Ld8HXYn_CJMb4oudqZsKsW02NBn2-ug%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: RCA of memory conditions on Ubuntu EC2 agents on ci.jenkins.io causing test instability

Tim Jacomb
Azure high mem:

Standard_D16_v3 
vcpu 16
memory 64

AWS:
m5a.xlarge
vcpu 4 
memory 16 GiB

I've changed it to: 'm5a.4xlarge' (16CPU, 64 GB ram)
It's 4 times the cost so we'll need to keep an eye on it

Thanks
Tim


On Fri, 12 Jun 2020 at 15:20, Jesse Glick <[hidden email]> wrote:
On Tue, Jun 9, 2020 at 3:59 AM Tim Jacomb <[hidden email]> wrote:
> High mem could possibly do with a change, the AWS ones are much lower spec than the Azure ones, thoughts?

Not sure but I just got an unexplained

 EC2 (aws) - High memory ubuntu 18.04  (i-0e7f3896526c7922e) was
marked offline: Connection was broken: java.io.EOFException

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr0QsftkUrXP2Z0Ld8HXYn_CJMb4oudqZsKsW02NBn2-ug%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3BieLPj2GgOLXRAxXxnBeiKvDai0vE0ieRK0c2PP1oVyMHQ%40mail.gmail.com.