Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with hostfile when doing MPMD
From: jody (jody.xha_at_[hidden])
Date: 2008-04-10 07:26:21


i narrowed it down:
The majority of processes get stuck in MPI_Barrier.
My Test application looks like this:

#include <stdio.h>
#include <unistd.h>
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
    int iResult = 0;
    int iRank1;
    int iNum1;

    char sName[256];
    gethostname(sName, 255);

    MPI_Init(&iArgC, &apArgV);

    MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
    MPI_Comm_size(MPI_COMM_WORLD, &iNum1);

    printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
    MPI_Barrier(MPI_COMM_WORLD);
    printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);

    MPI_Finalize();

    return iResult;
}

If i make this call:
mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
./run_gdb.sh ./MPITest64

(run_gdb.sh is a script which starts gdb in a xterm for each process)
Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
all other processes get stuck in PMPI_Barrier,
Process 1 (on aim-plankton) displays the message
   [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Process 2 on (aim-plankton) displays the same message twice.

Any ideas?

  Thanks Jody

On Thu, Apr 10, 2008 at 1:05 PM, jody <jody.xha_at_[hidden]> wrote:
> Hi
> Using a more realistic application than a simple "Hello, world"
> even the --host version doesn't work correctly
> Called this way
>
> mpirun -np 3 --host aim-plankton ./QHGLauncher
> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>
> the application starts but seems to hang after a while.
>
> Running the application in gdb:
>
> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4
> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
> -o bruzlopf -n 12
> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>
> i can see that the processes on aim-fanta4 have indeed gotten stuck
> after a few initial outputs,
> and the processes on aim-plankton all have a messsage:
>
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with errno=113
>
> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
> as expected.
>
> BTW: i'm, using open MPI 1.2.2
>
> Thanks
> Jody
>
>
> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody.xha_at_[hidden]> wrote:
> > HI
> > In my network i have some 32 bit machines and some 64 bit machines.
> > With --host i successfully call my application:
> > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
> > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
> > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit machine)
> >
> > But when i use hostfiles:
> > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
> > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
> > all 6 processes are started on the 64 bit machine aim-fanta4.
> >
> > hosts32:
> > aim-plankton slots=3
> > hosts64
> > aim-fanta4 slots
> >
> > Is this a bug or a feature? ;)
> >
> > Jody
> >
>