Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-10-15 10:52:08


This problem was caused by a couple of things.

First is a problem with the default MCA parameters. By default the
global and local snapshot directories are '/tmp', and the mode of
file transfer is 'in_place'. 'in_place' file transfer assumes that
the global snapshot directory points to an NFS mounted directory that
all machines can access. Typically '/tmp' is not such a directory. :(

I'll likely change the defaults (in the next day or so) to make the
default global snapshot directory $HOME or $CWD. Of course all of
this behavior can be changed by modifying the MCA parameters for the
global and local snapshot directories and the transfer mechanism. The
MCA parameters in question are described in the Checkpoint/Restart
users guide located at the link below:
   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Once we got around this problem then we discovered a problem with
restarting on a local machine without the aid of a resource manager
(e.g., SLURM, Torque, etc.). This bug was fixed in r16433.

The combination of these two items fixed the problems that Hiep was
experiencing.

-- Josh

On Oct 10, 2007, at 11:04 AM, Josh Hursey wrote:

> For anyone following this thread. I am following up with Hiep
> offline. I'll reply back to the list once the issue is resolved.
>
> -- Josh
>
> On Oct 3, 2007, at 11:11 AM, Hiep Bui Hoang wrote:
>
>> Hi,
>> I had found that the problem is the firewall on one of my
>> computers. When I set firewall allow to connect with orther
>> computer through tcp with port from 1024 to 4999, it is ok, there
>> is no error about connection. But I still can not checkpoint and
>> restart my program.
>>
>> The error is:
>> $ mpirun -np 3 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -am ft-
>> enable-cr send_recv
>> $ ompi-checkpoint 5693
>> --------------------------------------------
>> Error: The application (PID = 5693) failed to checkpoint properly.
>> Returned -1.
>>
>> ---------------------------------------------------------------------
>> -
>> ----
>>
>> There is only one local snapshot created on the computer where I
>> run command mpirun and ompi-checkpoint, and after create that local
>> snapshot the checkpoint is terminated with above error.
>> Some body help me to solve that error!
>> Thanks.
>>
>> On 10/2/07, Hiep Bui Hoang <bhoanghiep_at_[hidden]> wrote:
>> Hi,
>> I had setup Open MPI "trunk_16171" for 3 computers with Lan
>> connection, and set environment parameters, ssh without typing
>> password for each node. I use Red Hat Enterprise Linux 5 . The
>> program I tried is 'send_recv'. I run successful my 'send_recv'
>> program in those 3 nodes. And checkpoint/restart successful on 1
>> node. But I had error when try to checkpoint/restart that program
>> on 3 nodes.
>>
>> $ mpirun -np 4 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -
>> am ft-enable-cr send_recv
>>
>> ....
>> Send 32 from rank 0
>> Receive 32 at rank 1
>> Send 33 from rank 0
>> Receive 33 at rank 1
>> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
>> 172.28.11.40:3680 failed: Software caused connection abort (103)
>> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
>> 172.28.11.40:3680 failed, connecting over all interfaces failed!
>> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
>> 172.28.11.40:3680 failed: Software caused connection abort (103)
>> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
>> 172.28.11.40:3680 failed, connecting over all interfaces failed!
>> Receive 34 at rank 1
>> Send 34 from rank 0
>> .....
>>
>> PID of above mpirun is 5693.
>> $ ompi-checkpoint 5693
>> ---------------------------------------------------------------------
>> -
>> ----
>> Error: The application (PID = 5693) failed to checkpoint properly.
>> Returned -1.
>>
>> ---------------------------------------------------------------------
>> -
>> ----
>>
>> Somebody know about this error?
>> Thanks.
>>
>> This is my 'send_recv' program:
>>
>> main(int argc, char **argv)
>> {
>> int node;
>> int MAX = 1000;
>> MPI_Status status;
>> MPI_Init(&argc,&argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &node);
>>
>> int i = 0;
>> while( i <= MAX){
>> if( 0 == node){
>> MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
>> printf("Send %d from rank %d \n",i, node);
>> sleep(1);
>> }
>> if( 1 == node ){
>> MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
>> &status);
>> printf(" Receive %d at rank %d \n",i,node);
>> sleep(1);
>> }
>> i++;
>> }
>> MPI_Finalize();
>> }
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users