Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with hostfile when doing MPMD
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-04-13 19:14:45


I believe this -should- work, but can't verify it myself. The most important
thing is to be sure you built with --enable-heterogeneous or else it will
definitely fail.

Ralph

On 4/10/08 7:17 AM, "Rolf Vandevaart" <Rolf.Vandevaart_at_[hidden]> wrote:

>
> On a CentOS Linux box, I see the following:
>
>> grep 113 /usr/include/asm-i386/errno.h
> #define EHOSTUNREACH 113 /* No route to host */
>
> I have also seen folks do this to figure out the errno.
>
>> perl -e 'die$!=113'
> No route to host at -e line 1.
>
> I am not sure why this is happening, but you could also check the Open
> MPI User's Mailing List Archives where there are other examples of
> people running into this error. A search of "113" had a few hits.
>
> http://www.open-mpi.org/community/lists/users
>
> Also, I assume you would see this problem with or without the
> MPI_Barrier if you add this parameter to your mpirun line:
>
> --mca mpi_preconnect_all 1
>
> The MPI_Barrier is causing the bad behavior because by default
> connections are setup up lazily. Therefore only when the MPI_Barrier
> call is made and we start communicating and establishing connections do
> we start seeing the communication problems.
>
> Rolf
>
> jody wrote:
>> Rolf,
>> I was able to run hostname on the two noes that way,
>> and also a simplified version of my testprogram (without a barrier)
>> works. Only MPI_Barrier shows bad behaviour.
>>
>> Do you know what this message means?
>> [aim-plankton][0,1,2][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> Does it give an idea what could be the problem?
>>
>> Jody
>>
>> On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
>> <Rolf.Vandevaart_at_[hidden]> wrote:
>>> This worked for me although I am not sure how extensive our 32/64
>>> interoperability support is. I tested on Solaris using the TCP
>>> interconnect and a 1.2.5 version of Open MPI. Also, we configure
>>> with
>>> the --enable-heterogeneous flag which may make a difference here.
>>> Also
>>> this did not work for me over the sm btl.
>>>
>>> By the way, can you run a simple /bin/hostname across the two nodes?
>>>
>>>
>>> burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
>>> simple.32
>>> burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
>>> simple.64
>>> burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
>>> btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -
>>> np 3
>>> simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
>>> [burl-ct-v20z-4]I am #0/6 before the barrier
>>> [burl-ct-v20z-5]I am #3/6 before the barrier
>>> [burl-ct-v20z-5]I am #4/6 before the barrier
>>> [burl-ct-v20z-4]I am #1/6 before the barrier
>>> [burl-ct-v20z-4]I am #2/6 before the barrier
>>> [burl-ct-v20z-5]I am #5/6 before the barrier
>>> [burl-ct-v20z-5]I am #3/6 after the barrier
>>> [burl-ct-v20z-4]I am #1/6 after the barrier
>>> [burl-ct-v20z-5]I am #5/6 after the barrier
>>> [burl-ct-v20z-5]I am #4/6 after the barrier
>>> [burl-ct-v20z-4]I am #2/6 after the barrier
>>> [burl-ct-v20z-4]I am #0/6 after the barrier
>>> burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
>>> MPI) 1.2.5r16572
>>>
>>> Report bugs to http://www.open-mpi.org/community/help/
>>> burl-ct-v20z-4 65 =>
>>>
>>>
>>>
>>>
>>> jody wrote:
>>>> i narrowed it down:
>>>> The majority of processes get stuck in MPI_Barrier.
>>>> My Test application looks like this:
>>>>
>>>> #include <stdio.h>
>>>> #include <unistd.h>
>>>> #include "mpi.h"
>>>>
>>>> int main(int iArgC, char *apArgV[]) {
>>>> int iResult = 0;
>>>> int iRank1;
>>>> int iNum1;
>>>>
>>>> char sName[256];
>>>> gethostname(sName, 255);
>>>>
>>>> MPI_Init(&iArgC, &apArgV);
>>>>
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
>>>> MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>>>>
>>>> printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1,
>>>> iNum1);
>>>> MPI_Barrier(MPI_COMM_WORLD);
>>>> printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1,
>>>> iNum1);
>>>>
>>>> MPI_Finalize();
>>>>
>>>> return iResult;
>>>> }
>>>>
>>>>
>>>> If i make this call:
>>>> mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
>>>> ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
>>>> ./run_gdb.sh ./MPITest64
>>>>
>>>> (run_gdb.sh is a script which starts gdb in a xterm for each
>>>> process)
>>>> Process 0 (on aim-plankton) passes the barrier and gets stuck in
>>>> PMPI_Finalize,
>>>> all other processes get stuck in PMPI_Barrier,
>>>> Process 1 (on aim-plankton) displays the message
>>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113
>>>> Process 2 on (aim-plankton) displays the same message twice.
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks Jody
>>>>
>>>> On Thu, Apr 10, 2008 at 1:05 PM, jody <jody.xha_at_[hidden]> wrote:
>>>>> Hi
>>>>> Using a more realistic application than a simple "Hello, world"
>>>>> even the --host version doesn't work correctly
>>>>> Called this way
>>>>>
>>>>> mpirun -np 3 --host aim-plankton ./QHGLauncher
>>>>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-
>>>>> fanta4
>>>>> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>>>>>
>>>>> the application starts but seems to hang after a while.
>>>>>
>>>>> Running the application in gdb:
>>>>>
>>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>> QHGLauncher
>>>>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-
>>>>> fanta4
>>>>> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-
>>>>> config=pureveg_new.cfg
>>>>> -o bruzlopf -n 12
>>>>> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>>>>>
>>>>> i can see that the processes on aim-fanta4 have indeed gotten stuck
>>>>> after a few initial outputs,
>>>>> and the processes on aim-plankton all have a messsage:
>>>>>
>>>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() failed with errno=113
>>>>>
>>>>> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung
>>>>> runs
>>>>> as expected.
>>>>>
>>>>> BTW: i'm, using open MPI 1.2.2
>>>>>
>>>>> Thanks
>>>>> Jody
>>>>>
>>>>>
>>>>> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody.xha_at_[hidden]> wrote:
>>>>>> HI
>>>>>> In my network i have some 32 bit machines and some 64 bit
>>>>>> machines.
>>>>>> With --host i successfully call my application:
>>>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>>> MPITest :
>>>>>> -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>> (MPITest64 has the same code as MPITest, but was compiled on the
>>>>>> 64 bit machine)
>>>>>>
>>>>>> But when i use hostfiles:
>>>>>> mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./
>>>>>> MPITest :
>>>>>> -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>> all 6 processes are started on the 64 bit machine aim-fanta4.
>>>>>>
>>>>>> hosts32:
>>>>>> aim-plankton slots=3
>>>>>> hosts64
>>>>>> aim-fanta4 slots
>>>>>>
>>>>>> Is this a bug or a feature? ;)
>>>>>>
>>>>>> Jody
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>>
>>> =========================
>>> rolf.vandevaart_at_[hidden]
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>