Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI Hangs, No Error
From: Reuti (reuti_at_[hidden])
Date: 2010-07-06 18:36:00


Am 06.07.2010 um 23:31 schrieb Ralph Castain:

> Problem isn't with ssh - the problem is that the daemons need to
> open a TCP connection back to the machine where mpirun is running.
> If the firewall blocks that connection, then we can't run.
>
> If you can get a range of ports opened, then you can specify the
> ports OMPI should use for this purpose. If the sysadmin won't allow
> even that, then you are pretty well hosed.

Yes, often MPI takes place inside a cluster which is on a private
subnet anyway, hence there are no security impacts at all. I have no
firewalls on my cluster nodes (only on the headnode), as they are not
connected to the outside world.

But just for curiosity: at one point Open MPI chooses the ports. At
that point it might possible to implement to start two SSH tunnels per
slave node to have both directions and the daemons have to contact
then "localhost" on a specific port which will be tunneled to each
slave. In principle it should work I think, but it's just not
implemented for now.

Maybe it could be an addition to Open MPI for security concerned
usage. I wonder about the speed impact, when compression is switched
on per se in SSH in such a setup in case you transfer large amounts of
data via Open MPI.

-- Reuti

> On Jul 6, 2010, at 2:23 PM, Robert Walters wrote:
>
>> Yes, there is a system firewall. I don't think the sysadmin will
>> allow it to go disabled. Each Linux machine has the built-in RHEL
>> firewall. SSH is enabled through the firewall though.
>>
>> --- On Tue, 7/6/10, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> From: Ralph Castain <rhc_at_[hidden]>
>> Subject: Re: [OMPI users] OpenMPI Hangs, No Error
>> To: "Open MPI Users" <users_at_[hidden]>
>> Date: Tuesday, July 6, 2010, 4:19 PM
>>
>> It looks like the remote daemon is starting - is there a firewall
>> in the way?
>>
>> On Jul 6, 2010, at 2:04 PM, Robert Walters wrote:
>>
>>> Hello all,
>>>
>>> I am using OpenMPI 1.4.2 on RHEL. I have a cluster of AMD
>>> Opteron's and right now I am just working on getting OpenMPI
>>> itself up and running. I have a successful configure and make all
>>> install. LD_LIBRARY_PATH and PATH variables were correctly edited.
>>> mpirun -np 8 hello_c successfully works on all machines. I have
>>> setup my two test machines with DSA key pairs that successfully
>>> work with each other.
>>>
>>> The problem comes when I initiate my hostfile to attempt to
>>> communicate across machines. The hostfile is setup correctly with
>>> <host_name> <slots> <max-slots>. When running with all verbose
>>> options enabled "mpirun --mca plm_base_verbose 99 --debug-daemons
>>> --mca btl_base_verbose 30 --mca oob_base_verbose 99 --mca
>>> pml_base_verbose 99 -hostfile hostfile -np 16 hello_c" I receive
>>> the following text output.
>>>
>>> [machine1:03578] mca: base: components_open: Looking for plm
>>> components
>>> [machine1:03578] mca: base: components_open: opening plm components
>>> [machine1:03578] mca: base: components_open: found loaded
>>> component rsh
>>> [machine1:03578] mca: base: components_open: component rsh has no
>>> register function
>>> [machine1:03578] mca: base: components_open: component rsh open
>>> function successful
>>> [machine1:03578] mca: base: components_open: found loaded
>>> component slurm
>>> [machine1:03578] mca: base: components_open: component slurm has
>>> no register function
>>> [machine1:03578] mca: base: components_open: component slurm open
>>> function successful
>>> [machine1:03578] mca:base:select: Auto-selecting plm components
>>> [machine1:03578] mca:base:select:( plm) Querying component [rsh]
>>> [machine1:03578] mca:base:select:( plm) Query of component [rsh]
>>> set priority to 10
>>> [machine1:03578] mca:base:select:( plm) Querying component [slurm]
>>> [machine1:03578] mca:base:select:( plm) Skipping component
>>> [slurm]. Query failed to return a module
>>> [machine1:03578] mca:base:select:( plm) Selected component [rsh]
>>> [machine1:03578] mca: base: close: component slurm closed
>>> [machine1:03578] mca: base: close: unloading component slurm
>>> [machine1:03578] mca: base: components_open: Looking for oob
>>> components
>>> [machine1:03578] mca: base: components_open: opening oob components
>>> [machine1:03578] mca: base: components_open: found loaded
>>> component tcp
>>> [machine1:03578] mca: base: components_open: component tcp has no
>>> register function
>>> [machine1:03578] mca: base: components_open: component tcp open
>>> function successful
>>> Daemon was launched on machine2- beginning to initialize
>>> [machine2:01962] mca: base: components_open: Looking for oob
>>> components
>>> [machine2:01962] mca: base: components_open: opening oob components
>>> [machine2:01962] mca: base: components_open: found loaded
>>> component tcp
>>> [machine2:01962] mca: base: components_open: component tcp has no
>>> register function
>>> [machine2:01962] mca: base: components_open: component tcp open
>>> function successful
>>> Daemon [[1418,0],1] checking in as pid 1962 on host machine2
>>> Daemon [[1418,0],1] not using static ports
>>>
>>> At this point the system hangs indefinitely. While running top on
>>> the machine2 terminal, I see several things come up briefly. These
>>> items are: sshd (root), tcsh (myuser), orted (myuser), and
>>> mcstransd (root). I was wondering if sshd needs to be initiated by
>>> myuser? It is currently turned off in sshd_config through UsePAM
>>> yes. This was setup by the sysadmin but it can be worked around if
>>> this is necessary.
>>>
>>> So in summary, mpirun works on each machine individually, but
>>> hangs when initiated through a hostfile or with the -host flag. ./
>>> configure with defaults and --prefix. LD_LIBRARY_PATH and PATH set
>>> up correctly. Any help is appreciated. Thanks!
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> -----Inline Attachment Follows-----
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users