[Issue 4093] New - Ec2 plugin can take down hudson due to lack of error checking

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Issue 4093] New - Ec2 plugin can take down hudson due to lack of error checking

jehenrik
https://hudson.dev.java.net/issues/show_bug.cgi?id=4093
                 Issue #|4093
                 Summary|Ec2 plugin can take down hudson due to lack of error c
                        |hecking
               Component|hudson
                 Version|current
                Platform|All
              OS/Version|All
                     URL|
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P3
            Subcomponent|core
             Assigned to|issues@hudson
             Reported by|jehenrik






------- Additional comments from [hidden email] Fri Jul 24 00:20:42 +0000 2009 -------
While troubleshooting the ec2 plugin, I encountered a fairly common failure mode
where the plugin can delete a particular "Node" (descendant is EC2Slave) but not
the corresponding "Computer" (descendant is EC2Computer).  This results in a
fairly deep failure mode because of a null pointer exception in this core code
with many usages in the system:

hudson/main/core/src/main/java/hudson/model/Hudson.java

    public Computer[] getComputers() {
        Computer[] r = computers.values().toArray(new Computer[computers.size()]
);
        Arrays.sort(r,new Comparator<Computer>() {
            final Collator collator = Collator.getInstance();
            public int compare(Computer lhs, Computer rhs) {
                if(lhs.getNode()==Hudson.this)  return -1;
                if(rhs.getNode()==Hudson.this)  return 1;
                return collator.compare(lhs.getDisplayName(), rhs.getDisplayName
());
            }
        });
        return r;
    }

My suggestion is to check that lhs.getNode and rhs.getNode check for null, and
fall back to sorting any such computers to the end of the list.  This is not a
good situation to be in, and whatever upstream error caused the situation should
definitely be fixed.  But in this case ec2 can't even recover itself here
without serious hackwork because of the very many uses of Hudson.computers,
including:

    /*package*/ Computer getComputer(Node n) {
        return computers.get(n);
    }

    public Computer getComputer(String name) {
        if(name.equals("(master)"))
            name = "";

        for (Computer c : computers.values()) {
            if(c.getNode().getNodeName().equals(name))
                return c;
        }
        return null;
    }

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]