Hi,
I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.
I have a Linux machine and a SunOS machine in this cluster.
linux$ uname -a
Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
i686 i686 i386 GNU/Linux
OpenMPI-1.0.1 is installed uisng
./configure --prefix=...
make all install
sunos$ uname -a
SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10
OpenMPI-1.0.1 is installed uisng
./configure --prefix=...
make all install
I use ssh. Both nodes are accessible without prompts for password.
I use the following simple application:
------------------------------------------------------------------------
#include <mpi.h>
int main(int argc, char** argv)
{
int rc, me;
char pname[MPI_MAX_PROCESSOR_NAME];
int plen;
MPI_Init(
&argc,
&argv
);
rc = MPI_Comm_rank(
MPI_COMM_WORLD,
&me
);
if (rc != MPI_SUCCESS)
{
return rc;
}
MPI_Get_processor_name(
pname,
&plen
);
printf("%s:Hello world from %d\n", pname, me);
MPI_Finalize();
return 0;
}
------------------------------------------------------------------------
It is compiled as follows:
linux$ mpicc -o mpiinit_linux mpiinit.c
sunos$ mpicc -o mpiinit_sunos mpiinit.c
My hosts file is
hosts.txt
---------
pg1cluster01 slots=2
csultra01 slots=1
My app file is
mpiinit_appfile
---------------
-np 2 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_linux
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos
$ mpirun --hostfile hosts.txt --app mpiinit_appfile
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found
I have fixed this by compiling with "-lrt" option to the linker.
sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt
However when I run this again, I get the error:
$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start
as expected.
[pg1cluster01:19858] ERROR: There may be more information available from
[pg1cluster01:19858] ERROR: the remote shell (see above).
[pg1cluster01:19858] ERROR: The daemon exited unexpectedly with status 255.
2 processes killed (possibly by Open MPI)
Sometimes I get the error.
$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with errno=28
[csultra01:06256] mca_mpool_sm_init: unable to create shared memory mapping
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned value -2 instead of OMPI_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
Please let me know the resolution of this problem. Please let me know if
you need more details.
Regards,
Ravi.
|