Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with hostfile when doing MPMD
From: jody (jody.xha_at_[hidden])
Date: 2008-04-10 08:50:31


Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.

Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?

Jody

On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
<Rolf.Vandevaart_at_[hidden]> wrote:
>
> This worked for me although I am not sure how extensive our 32/64
> interoperability support is. I tested on Solaris using the TCP
> interconnect and a 1.2.5 version of Open MPI. Also, we configure with
> the --enable-heterogeneous flag which may make a difference here. Also
> this did not work for me over the sm btl.
>
> By the way, can you run a simple /bin/hostname across the two nodes?
>
>
> burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
> simple.32
> burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
> simple.64
> burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
> btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
> simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
> [burl-ct-v20z-4]I am #0/6 before the barrier
> [burl-ct-v20z-5]I am #3/6 before the barrier
> [burl-ct-v20z-5]I am #4/6 before the barrier
> [burl-ct-v20z-4]I am #1/6 before the barrier
> [burl-ct-v20z-4]I am #2/6 before the barrier
> [burl-ct-v20z-5]I am #5/6 before the barrier
> [burl-ct-v20z-5]I am #3/6 after the barrier
> [burl-ct-v20z-4]I am #1/6 after the barrier
> [burl-ct-v20z-5]I am #5/6 after the barrier
> [burl-ct-v20z-5]I am #4/6 after the barrier
> [burl-ct-v20z-4]I am #2/6 after the barrier
> [burl-ct-v20z-4]I am #0/6 after the barrier
> burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
> MPI) 1.2.5r16572
>
> Report bugs to http://www.open-mpi.org/community/help/
> burl-ct-v20z-4 65 =>
>
>
>
>
> jody wrote:
> > i narrowed it down:
> > The majority of processes get stuck in MPI_Barrier.
> > My Test application looks like this:
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include "mpi.h"
> >
> > int main(int iArgC, char *apArgV[]) {
> > int iResult = 0;
> > int iRank1;
> > int iNum1;
> >
> > char sName[256];
> > gethostname(sName, 255);
> >
> > MPI_Init(&iArgC, &apArgV);
> >
> > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
> > MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
> >
> > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
> > MPI_Barrier(MPI_COMM_WORLD);
> > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
> >
> > MPI_Finalize();
> >
> > return iResult;
> > }
> >
> >
> > If i make this call:
> > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
> > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
> > ./run_gdb.sh ./MPITest64
> >
> > (run_gdb.sh is a script which starts gdb in a xterm for each process)
> > Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
> > all other processes get stuck in PMPI_Barrier,
> > Process 1 (on aim-plankton) displays the message
> > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> > connect() failed with errno=113
> > Process 2 on (aim-plankton) displays the same message twice.
> >
> > Any ideas?
> >
> > Thanks Jody
> >
> > On Thu, Apr 10, 2008 at 1:05 PM, jody <jody.xha_at_[hidden]> wrote:
> >> Hi
> >> Using a more realistic application than a simple "Hello, world"
> >> even the --host version doesn't work correctly
> >> Called this way
> >>
> >> mpirun -np 3 --host aim-plankton ./QHGLauncher
> >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
> >> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
> >>
> >> the application starts but seems to hang after a while.
> >>
> >> Running the application in gdb:
> >>
> >> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
> >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
> >> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
> >> -o bruzlopf -n 12
> >> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
> >>
> >> i can see that the processes on aim-fanta4 have indeed gotten stuck
> >> after a few initial outputs,
> >> and the processes on aim-plankton all have a messsage:
> >>
> >> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> >> connect() failed with errno=113
> >>
> >> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
> >> as expected.
> >>
> >> BTW: i'm, using open MPI 1.2.2
> >>
> >> Thanks
> >> Jody
> >>
> >>
> >> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody.xha_at_[hidden]> wrote:
> >> > HI
> >> > In my network i have some 32 bit machines and some 64 bit machines.
> >> > With --host i successfully call my application:
> >> > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
> >> > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
> >> > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit machine)
> >> >
> >> > But when i use hostfiles:
> >> > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
> >> > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
> >> > all 6 processes are started on the 64 bit machine aim-fanta4.
> >> >
> >> > hosts32:
> >> > aim-plankton slots=3
> >> > hosts64
> >> > aim-fanta4 slots
> >> >
> >> > Is this a bug or a feature? ;)
> >> >
> >> > Jody
> >> >
> >>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>