Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: hpetit_at_[hidden]
Date: 2006-11-02 03:10:05

Thank you for your support Ralf, I really appreciate.

I have now a better understanding of your very first answer asking if I had a NODES environment variable.
It was related to the fact that your platform is configured with LSF.
I have read some tutorials about LSF and it seems that LSF provides a "llogin" command that creates an environment where the NODES variables is permanently defined.

Then, under this "llogin" environment, all jobs are automatically allocated to the nodes defined with NODES.

This is why, I think, the spawning works fine in this condition.

Unfortunately, LSF is commercial and then I am not able to install it on my platform.
I whish I can not do anything more on my side now.

You proposed to concoct something over the next few days. I look forward to hearing from you.



Date: Tue, 31 Oct 2006 06:53:53 -0700
From: Ralph H Castain <rhc_at_[hidden]>
Subject: Re: [OMPI users] MPI_Comm_spawn multiple bproc support
To: "Open MPI Users <users_at_[hidden]>" <users_at_[hidden]>
Message-ID: <C16CA381.5759%rhc_at_[hidden]>
Content-Type: text/plain; charset="ISO-8859-1"

Aha! Thanks for your detailed information - that helps identify the problem.

See some thoughts below.

On 10/31/06 3:49 AM, "hpetit_at_[hidden]" <hpetit_at_[hidden]> wrote:

> Thank you for you quick reply Ralf,
> As far as I know, the NODES environment variable is created when a job is
> submitted to the bjs scheduler.
> The only way I know (but I am a bproc newbe) is to use the bjssub command.

That is correct. However, Open MPI requires that ALL of the nodes you are
going to use must be allocated in advance. In other words, you have to get
an allocation large enough to run your entire job - both the initial
application and anything you comm_spawn.

I wish I could help you with the proper bjs commands to get an allocation,
but I am not familiar with bjs and (even after multiple Google searches)
cannot find any documentation on that code. Try doing a "bjs --help" and see
what it says.

> Then, I have retried my test with the following running command: "bjssub -i
> mpirun -np 1 main_exe".

> I guess, this problem comes from the way I set the parameters to the spawned
> program. Instead of giving instructions to spawn the program on a specific
> host, I should set parameters to spawn the program on a specific node.
> But I do not know how to do it.

What you did was fine. "host" is the correct field to set. I suspect two
possible issues:

1. The specified host may not be in the allocation. In the case you showed
here, I would expect it to be since you specified the same host we are
already on. However, you might try running mpirun with the "--nolocal"
option - this will force mpirun to launch the processes on a machine other
than the one you are on (typically you are on the head node. In many bproc
machines, this node is not included in an allocation as the system admins
don't want you running MPI jobs on it).

2. We may have something wrong in our code for this case. I'm not sure how
well that has been tested, especially in the 1.1 code branch.

> Then, I have a bunch of questions:
> - when mpi is used together with bproc, is it necessary to use bjssub or bjs
> in general ?

You have to use some kind of resource manager to obtain a node allocation
for your use. At our site, we use LSF - other people use bjs. Anything that
sets the NODE variable is fine.

> - I was wondering if I had to submit to bjs the spawned program ? i.e do I
> have to add 'bjssub' to the commands parameter of the MPI_Comm_spawn_mutliple
> call ?

You shouldn't have to do so. I suspect, however, that bjssub is not getting
a large enough allocation for your combined mpirun + spawned job. I'm not
familiar enough with bjs to know for certain.
> As you can see, I am still not able to spawn a program and need some more help
> ?
> Do you have a some examples describing how to do it ?

Unfortunately, not in the 1.1 branch, nor do I have one for
comm_spawn_multiple that uses the "host" field. I can try to concoct
something over the next few days, though, and verify that our code is
working correctly.

--------------------- ALICE SECURITE ENFANTS ---------------------
Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le contrôle parental d'Alice.