Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpiexec seems to be resolving names on server insteadof each node
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-07-20 09:43:02


Micha --

(re-digging up this really, really old issue because Manuel just pointed me at the Debian bug for the same issue: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524553)

Can you confirm that this is still an issue on the latest Open MPI?

If so, it should probably piggyback onto this Open MPI tickets:

    https://svn.open-mpi.org/trac/ompi/ticket/2045
    https://svn.open-mpi.org/trac/ompi/ticket/2383
    https://svn.open-mpi.org/trac/ompi/ticket/1983

On Apr 17, 2009, at 8:45 PM, Micha Feigin wrote:

> I am having problems running openmpi 1.3 on my claster and I was wondering if
> anyone else is seeing this problem and/or can give hints on how to solve it
>
> As far as I understand the error, mpiexec resolves host names on the master node
> it is run on instead of an each host seperately. This works in an environment where
> each hostname resolves to the same address on each host (cluster connected via a
> switch) but fails where it resolves to different addresses (ring/star setups for
> example where each computer is connected directly to all/some of the others)
>
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error message.
>
> version 1.2.8 worked fine for the same simple program (a simple hellow world that
> just comunicated the computer name for each process)
>
> An example output:
>
> mpiexec is run on the master node hubert and is set to run the processes on two nodes
> fry and leela. As is understood from the error messages leela tries to connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela (where it
> is 192.168.4.1)
>
> This is a four node claster all interconnected
>
> 192.168.1.1 192.168.1.2
> hubert ------------------------ fry
> | \ / | 192.168.4.1
> | \ / |
> | \ / |
> | \ / |
> | / \ |
> | / \ |
> | / \ |
> | / \ | 192.168.4.2
> hermes ----------------------- leelas
>
> =================================================================
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect] from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: Network is unreachable
>
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> =================================================================
>
> This seems to be a directional issue as running the program -H fry,leela failes
> where -H leela,fry works, same behaviour for all senarious except those that include
> the master node (hubert) where it resolves the external ip (from an external dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no external connection
> at the moment for all but the master) and the other causes a lockup
>
> I hope that the explenation is not too convoluted
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/