Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Prakash Velayutham (Prakash.Velayutham_at_[hidden])
Date: 2006-04-08 14:45:12


>>> Prakash.Velayutham_at_[hidden] 04/08/06 1:42 PM >>>
Hi Jeff,

>>> jsquyres_at_[hidden] 04/08/06 7:10 AM >>>
I am also curious as to why this would not work -- I was not under the
impression that tm_init() would fail from a non mother-superior node...?

What others say is that it will fail this way inside a Open MPI job as
Open MPI's RTE is taking the only TM connection available. But the
strange thing is that it works from Mother Superior without Garrick's
patch (actually, regardless of the patch, the behaviour is the same, but
I have not rigorously tested the patch in itself, so cannot comment
about that), which I think should have failed according to the above
contention.

FWIW: It has been our experience with both Torque and the various
flavors of PBS that you can repeatedly call tm_init() and tm_finalize()
within a single process, so I would be surprised if that was the issue.
Indeed, I'd have to double check, but I'm pretty sure that our MPI
processes do not call tm_init() (I believe that only mpirun does).

But I am running my code using mpirun, so is this expected behaviour? I
am attaching my simple code below:

#include <stdio.h>
#include <tm.h>
#include <mpi.h>

extern char **environ;

void do_check(int val, char *msg) {
        if (TM_SUCCESS != val) {
                printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS,
msg);
                exit(1);
        }
}

main (int argc, char *argv[]) {
        int size, rank, ret, err, numnodes, local_err;
        MPI_Status status;
        char **input;
        input[0] = "/bin/echo";
        input[1] = "Hello There";
        struct tm_roots task_root;
        tm_node_id *nodelist;
        tm_event_t event;
        tm_task_id task_id;

        char hostname[64];
        char
buf[]="11000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000";

        gethostname(hostname, 64);
        ret = MPI_Init (&argc, &argv);
        if (ret) {
                printf ("Error: %d\n", ret);
                return (1);
        }
        ret = MPI_Comm_size (MPI_COMM_WORLD, &size);
        if (ret) {
                printf("Error: %d\n", ret);
                return (1);
        }
        ret = MPI_Comm_rank (MPI_COMM_WORLD, &rank);
        if (ret) {
                printf("Error: %d\n", ret);
                return (1);
        }
        printf ("First Hostname: %s node %d out of %d\n", hostname,
rank, size);
        if (size%2 && rank==size-1)
                printf("Sitting out\n");
        else {
                if (rank%2==0)
                        MPI_Send(buf, strlen(buf), MPI_BYTE, rank+1, 11,
MPI_COMM_WORLD);
                else
                        MPI_Recv(buf, sizeof(buf), MPI_BYTE, rank-1, 11,
MPI_COMM_WORLD, &status);
        }
        printf ("Second Hostname: %s node %d out of %d\n", hostname,
rank, size);

        if (rank == 1) {
                ret = tm_init(NULL, &task_root);
                do_check(ret, "tm_init failed");
                printf ("Special Hostname: %s node %d out of %d\n",
hostname, rank, size);
                task_id = 0xdeadbeef;
                event = 0xdeadbeef;
                printf("%s\t%s", input[0], input[1]);

                tm_finalize();
        }

        MPI_Finalize ();

        return (0);
}

And the error I am getting is:

First Hostname: wins05 node 0 out of 4
First Hostname: wins03 node 1 out of 4
First Hostname: wins02 node 2 out of 4
First Hostname: wins01 node 3 out of 4
Second Hostname: wins05 node 0 out of 4
Second Hostname: wins02 node 2 out of 4
Second Hostname: wins03 node 1 out of 4
Second Hostname: wins01 node 3 out of 4
tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I am using Torque-2.0.0p7 and Open MPI-1.0.1.

Prakash: are you running an unmodified version of Torque 2.0.0p7?

I will test an unmodified version of 2.0.0p8 right now and let you know,
but I am positive that is not the issue.

TIA,
Prakash

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Prakash Velayutham
> Sent: Friday, April 07, 2006 10:13 AM
> To: Open MPI Users
> Cc: Pak.Lui_at_[hidden]
> Subject: Re: [OMPI users] Open MPI and Torque error
>
> Pak Lui wrote:
> > Prakash,
> >
> > tm_poll: protocol number dis error 11
> > ret is 17002 instead of 0: tm_init failed
> > 3 processes killed (possibly by Open MPI)
> >
> > I encountered similar problem with OpenPBS before, which
> also uses the
> > TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I
> tried to
> > call tm_init for the second time (which in turns call tm_poll and
> > returned that errno).
> >
> > I think what you did to start tm_init from another node and
> connect to
> > another mom which I do not think is allowed. The TM module
> in OpenMPI
> > already called tm_init once. I am curious to know about the
> reason that
> > you need to call tm_init again?
> >
> > If you are curious to know about the implementation for
> PBS, you can
> > download the source from openpbs.org. OpenPBS source:
> > v2.3.16/src/lib/Libifl/tm.c
> I am interested in getting this to work as I am working on
> implementing
> support for dynamic scheduling in Torque. I want any node in an MPI-2
> job (basically Open MPI implementation) to be able to request the
> Torque/PBS server for more nodes. I am doing a little study in that
> right now. Instead of nodes talking directly to the server, I
> want them
> to be able to talk to Mother Superior and MS instead will talk to the
> Server.
>
> Could you please explain why this does not work now? And why it works
> when I do the tm_init from MS, and only does not work from
> any other MOM?
>
> Thanks,
> Prakash

Hi All,

I have an update on this. This is with the unmodified version of
torque-2.0.0p8, and the same MPI code given above:

If you notice in the code, I am doing tm_init() only on the node with
rank 1. What I see when I submit the job through Torque is that the MOM
daemon on the node that gets rank 1 dies for some reason. May be it dies
even with my modified Torque code and that was the reason I was getting
the ENOTCONNECTED errors all along. Now I guess I just have to try and
debug why the node's MOM daemon keeps dying just on this code. I am sure
it is dying during the tm_init call, but I just have to corroborate this
myself.

Thanks again,
Prakash