Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2007-10-10 11:04:29


For anyone following this thread. I am following up with Hiep
offline. I'll reply back to the list once the issue is resolved.

-- Josh

On Oct 3, 2007, at 11:11 AM, Hiep Bui Hoang wrote:

> Hi,
> I had found that the problem is the firewall on one of my
> computers. When I set firewall allow to connect with orther
> computer through tcp with port from 1024 to 4999, it is ok, there
> is no error about connection. But I still can not checkpoint and
> restart my program.
>
> The error is:
> $ mpirun -np 3 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -am ft-
> enable-cr send_recv
> $ ompi-checkpoint 5693
> --------------------------------------------
> Error: The application (PID = 5693) failed to checkpoint properly.
> Returned -1.
>
> ----------------------------------------------------------------------
> ----
>
> There is only one local snapshot created on the computer where I
> run command mpirun and ompi-checkpoint, and after create that local
> snapshot the checkpoint is terminated with above error.
> Some body help me to solve that error!
> Thanks.
>
> On 10/2/07, Hiep Bui Hoang <bhoanghiep_at_[hidden]> wrote:
> Hi,
> I had setup Open MPI "trunk_16171" for 3 computers with Lan
> connection, and set environment parameters, ssh without typing
> password for each node. I use Red Hat Enterprise Linux 5 . The
> program I tried is 'send_recv'. I run successful my 'send_recv'
> program in those 3 nodes. And checkpoint/restart successful on 1
> node. But I had error when try to checkpoint/restart that program
> on 3 nodes.
>
> $ mpirun -np 4 -host 172.28.11.40, 172.28.11.28,172.28.11.18 -
> am ft-enable-cr send_recv
>
> ....
> Send 32 from rank 0
> Receive 32 at rank 1
> Send 33 from rank 0
> Receive 33 at rank 1
> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed: Software caused connection abort (103)
> [HNP:05700] [1,2]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed, connecting over all interfaces failed!
> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed: Software caused connection abort (103)
> [node2:04837] [1,1]-[1,3] mca_oob_tcp_peer_try_connect: connect to
> 172.28.11.40:3680 failed, connecting over all interfaces failed!
> Receive 34 at rank 1
> Send 34 from rank 0
> .....
>
> PID of above mpirun is 5693.
> $ ompi-checkpoint 5693
> ----------------------------------------------------------------------
> ----
> Error: The application (PID = 5693) failed to checkpoint properly.
> Returned -1.
>
> ----------------------------------------------------------------------
> ----
>
> Somebody know about this error?
> Thanks.
>
> This is my 'send_recv' program:
>
> main(int argc, char **argv)
> {
> int node;
> int MAX = 1000;
> MPI_Status status;
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &node);
>
> int i = 0;
> while( i <= MAX){
> if( 0 == node){
> MPI_Send(&i, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
> printf("Send %d from rank %d \n",i, node);
> sleep(1);
> }
> if( 1 == node ){
> MPI_Recv(&i, 1, MPI_INT, 0, 10, MPI_COMM_WORLD,
> &status);
> printf(" Receive %d at rank %d \n",i,node);
> sleep(1);
> }
> i++;
> }
> MPI_Finalize();
> }
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users