Prakash,
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)
I encountered similar problem with OpenPBS before, which also uses the
TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to
call tm_init for the second time (which in turns call tm_poll and
returned that errno).
I think what you did to start tm_init from another node and connect to
another mom which I do not think is allowed. The TM module in OpenMPI
already called tm_init once. I am curious to know about the reason that
you need to call tm_init again?
If you are curious to know about the implementation for PBS, you can
download the source from openpbs.org. OpenPBS source:
v2.3.16/src/lib/Libifl/tm.c
--
Thanks,
- Pak Lui
pak.lui_at_[hidden]
|