Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Job hangs when daemon does not report back from remote machine
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-09 08:54:32


The default launcher is ssh - the "rsh" things you see are the name of
the particular component, not the name of the actual command being
used. That launcher looks for "ssh" first, and then falls back to
"rsh" if ssh isn't found.

OMPI currently doesn't support restricted port ranges. We are working
on a new release that does, but it won't be out for a few weeks. Until
that time, my only suggestion would be to look at removing the
firewall on every node in favor of a firewall on the outside of the
cluster. I'm not sure any other solution is available just yet.

Ralph

On Feb 8, 2009, at 2:08 PM, Kersey Black wrote:

> Many thanks. The firewall is the issue.
>
> On Feb 9, 2009, at 5:56 AM, Ralph Castain wrote:
>> It sounds to me like TCP communication isn't getting through for
>> some reason. Try the following:
>>
>> mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname
> black_at_ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --
> hostfile myh3 -pernode hostname
> [ccn3:26932] mca:base:select:( plm) Querying component [rsh]
> [ccn3:26932] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [ccn3:26932] mca:base:select:( plm) Querying component [slurm]
> [ccn3:26932] mca:base:select:( plm) Skipping component [slurm].
> Query failed to return a module
> [ccn3:26932] mca:base:select:( plm) Selected component [rsh]
> -----hangs here
>
> But, when I turn off the firewall for a moment on both machines,
> local and remote, everything works:
> black_at_ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 --
> hostfile myh3 -pernode hostname
> [ccn3:26442] mca:base:select:( plm) Querying component [rsh]
> [ccn3:26442] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [ccn3:26442] mca:base:select:( plm) Querying component [slurm]
> [ccn3:26442] mca:base:select:( plm) Skipping component [slurm].
> Query failed to return a module
> [ccn3:26442] mca:base:select:( plm) Selected component [rsh]
> ccn3
> ccn4
>
> 2 Questions:
> 1) Is it really trying to use 'rsh', or is that just part of the
> language in the debugging reporting? I assume it is actually using
> ssh under the hood, but it is worth asking. I am using the default
> configuration on this.
> black_at_ccn3:~/Documents/mp> ompi_info --param all all | grep pls
> MCA plm: parameter "plm_rsh_agent" (current value:
> "ssh : rsh", data source: default value, synonyms: pls_rsh_agent)
> 2) Since it is a firewall issue, I read what I could find and it
> seems there is not a means of restricting port ranges. Right now,
> each node in this small cluster is running its own firewall rather
> than all being hidden behind some other machine or switch. Any
> pointers for handling this most easily.
>
> Cheers, Kersey
>
>> You should see output from the receipt of a daemon callback for
>> each daemon, the the sending of the launch command. My guess is
>> that you won't see all the daemons callback, which is why you hang.
>>
>> This should tell you which node isn't getting a message back to
>> wherever mpirun is executing. You might then check to ensure no
>> firewalls are in the way to that node, there is a TCP path back
>> from it, etc.
>>
>> I can help with additional diagnostics once we get that far.
>> Ralph
>>
>> On Feb 7, 2009, at 2:40 PM, Kersey Black wrote:
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users