Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem using openmpi with DMTCP
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-11-06 08:37:42


(Sorry for the excessive delay in replying)

I do not have any experience with the DMTCP project, so I can only
speculate on what might be going on here. If you are using DMTCP to
transparently checkpoint Open MPI you will need to make sure that you
are not using any other interconnect other than TCP.

If you are building an OPAL CRS component for DMTCP (actually you
probably want their MTCP project which is just the local checkpoint/
restart service), then what you might be seeing are the TCP sockets
that are left open across a checkpoint operation. As an optimization
for checkpoint->continue we leave sockets open when we checkpoint.
Since most checkpoint/restart services will skip over the socket fd
(since they are not supported) and take the checkpoint we leave them
open, and close them only on restart. I suspect that DMTCP is erroring
out since it is trying to do something else with those fds.

You may want to try just using the MTCP project, or ask for a way to
shut off the socket negotiation and just ignore the socket fds.

Let me know how it goes.

-- Josh

On Sep 28, 2009, at 9:55 AM, Kritiraj Sajadah wrote:

> Dear All,
> I am trying to integrate DMTCP with openmpi. IF I run a c
> application, it works fine. But when I execute the program using
> mpirun, It checkpoints application but gives error when restarting
> the application.
>
> #############
> [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING
> ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==
> SOCK_STREAM) failed'
> id() = 2ab3f248-30933-4ac0d75a(99007)
> _sockDomain = 10
> _sockType = 1
> _sockProtocol = 0
> Message: socket type not yet [fully] supported
> [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING
> ((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType ==
> SOCK_STREAM) failed'
> id() = 2ab3f248-30943-4ac0d75c(99007)
> _sockDomain = 10
> _sockType = 1
> _sockProtocol = 0
> Message: socket type not yet [fully] supported
> [31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING
> (_real_dup2 ( oldFd, fd ) == fd) failed'
> oldFd = 537
> fd = 1
> (strerror((*__errno_location ()))) = Bad file descriptor
> [31013] WARNING at connectionmanager.cpp:627 in closeAll;
> REASON='JWARNING(_real_close ( i->second ) ==0) failed'
> i->second = 537
> (strerror((*__errno_location ()))) = Bad file descriptor
> [31015] WARNING at connectionmanager.cpp:627 in closeAll;
> REASON='JWARNING(_real_close ( i->second ) ==0) failed'
> i->second = 537
> (strerror((*__errno_location ()))) = Bad file descriptor
> [31017] WARNING at connectionmanager.cpp:627 in closeAll;
> REASON='JWARNING(_real_close ( i->second ) ==0) failed'
> i->second = 537
> (strerror((*__errno_location ()))) = Bad file descriptor
> [31007] WARNING at connectionmanager.cpp:627 in closeAll;
> REASON='JWARNING(_real_close ( i->second ) ==0) failed'
> i->second = 537
> (strerror((*__errno_location ()))) = Bad file descriptor
> MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/
> gconv-modules.cache into memory;
> _not_ file as it existed at time of checkpoint.
> Change mtcp_restart_nolibc.c:634 and re-compile, if you want
> different behavior.
> [31015] ERROR at connection.cpp:372 in restoreOptions;
> REASON='JASSERT(ret == 0) failed'
> (strerror((*__errno_location ()))) = Invalid argument
> fds[0] = 6
> opt->first = 26
> opt->second.size() = 4
> Message: restoring setsockopt failed
> Terminating...
> #############################################################
>
> Any suggestions is very welcomed.
>
> regards,
>
> Raj
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users