Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] runtime error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-10 08:29:46


I typically see these kinds of errors when there's an Open MPI version mismatch between the nodes, and/or if there are slightly different flavors of Linux installed on each node (i.e., you're technically in a heterogeneous situation, but you're trying to run a single application binary). Can you verify:

1. that you have exactly the same version of Open MPI installed on all nodes? (and that your application was compiled against that exact version)

2. that you have exactly the same OS/update level installed on all nodes (e.g., same versions of glibc, etc.)

On Feb 10, 2011, at 3:13 AM, Marcela Castro León wrote:

> Hello
> I've a program that allways works fine, but i'm trying it on a new cluster and fails when I execute it on more than one machine.
> I mean, if I execute alone on each host, everything works fine.
> radic_at_santacruz:~/gaps/caso3-i1$ mpirun -np 3 ../test parcorto.txt
>
> But when I execute
> radic_at_santacruz:~/gaps/caso3-i1$ mpirun -np 3 -machinefile /home/radic/mfile ../test parcorto.txt
>
> I get this error:
>
> mpirun has exited due to process rank 0 with PID 2132 on
> node santacruz exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> Though the machinefile (mfile) had only one machine, the programs fails.
> This is the current content:
>
> radic_at_santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
> santacruz
> chubut
>
> I've debug the program and the error occurs after proc0 do an
> MPI_Recv(&nomproc,lennomproc,MPI_CHAR,i,tag,MPI_COMM_WORLD,&Stat);
> from the remote process.
>
> I've done several test I'll mention:
>
> 1) Change the order on machinefile
> radic_at_santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
> chubut
> santacruz
>
> In that case, I get this error:
> [chubut:2194] *** An error occurred in MPI_Recv
> [chubut:2194] *** on communicator MPI_COMM_WORLD
> [chubut:2194] *** MPI_ERR_TRUNCATE: message truncated
> [chubut:2194] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> and then
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 2194 on
> node chubut exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> 2) I've got the same error executing on host chubut intead of santacruz,
> 3) a simple mpi programs like MPI_Hello world are working fine, but I suppose that are very simple program.
>
> radic_at_santacruz:~/gaps$ mpirun -np 3 -machinefile /home/radic/mfile MPI_Hello
> Hola Mundo Hola Marce 1
> Hola Mundo Hola Marce 0
> Hola Mundo Hola Marce 2
>
>
> This is the information you ask for tuntime problem.
> a) radic_at_santacruz:~$ mpirun -version
> mpirun (Open MPI) 1.4.1
> b) i'm using ubuntu 10,04. I'm installing the packages using apt-get install, so, I don't have a config.log
> c) The ompi_info --all is on the file ompi_info.zip
> d) These are PATH and LD_LIBRARY_PATH
> radic_at_santacruz:~$ echo $PATH
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
> radic_at_santacruz:~$ echo $LD_LIBRARY_PATH
>
>
> Thank you very much.
>
> Marcela.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/