Yes...it would indeed.


On 7/23/07 9:03 AM, "Kelley, Sean" <Sean.Kelley@solers.com> wrote:

Would this logic be in the bproc pls component?
Sean


From: users-bounces@open-mpi.org on behalf of Ralph H Castain
Sent: Mon 7/23/2007 9:18 AM
To: Open MPI Users <users@open-mpi.org>
Subject: Re: [OMPI users] orterun --bynode/--byslot problem

No, byslot appears to be working just fine on our bproc clusters (it is the
default mode). As you probably know, bproc is a little strange in how we
launch - we have to launch the procs in "waves" that correspond to the
number of procs on a node.

In other words, the first "wave" launches a proc on all nodes that have at
least one proc on them. The second "wave" then launches another proc on all
nodes that have at least two procs on them, but doesn't launch anything on
any node that only has one proc on it.

My guess here is that the system for some reason is insisting that your head
node be involved in every wave. I confess that we have never tested (to my
knowledge) a mapping that involves "skipping" a node somewhere in the
allocation - we always just map from the beginning of the node list, with
the maximum number of procs being placed on the first nodes in the list
(since in our machines, the nodes are all the same, so who cares?). So it is
possible that something in the code objects to skipping around nodes in the
allocation.

I will have to look and see where that dependency might lie - will try to
get to it this week.

BTW: that patch I sent you for head node operations will be in 1.2.4.

Ralph



On 7/23/07 7:04 AM, "Kelley, Sean" <Sean.Kelley@solers.com> wrote:

> Hi,
>
>      We are experiencing a problem with the process allocation on our Open MPI
> cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
> drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
> hardware consists of a head node and N blades on private ethernet and
> infiniband networks.
>
> The command run for these tests is a simple MPI program (called 'hn') which
> prints out the rank and the hostname. The hostname for the head node is 'head'
> and the compute nodes are '.0' ... '.9'.
>
> We are using the following hostfiles for this example:
>
> hostfile7
> -1 max_slots=1
> 0 max_slots=3
> 1 max_slots=3
>
> hostfile8
> -1 max_slots=2
> 0 max_slots=3
> 1 max_slots=3
>
> hostfile9
> -1 max_slots=3
> 0 max_slots=3
> 1 max_slots=3
>
> running the following commands:
>
> orterun --hostfile hostfile7 -np 7 ./hn
> orterun --hostfile hostfile8 -np 8 ./hn
> orterun --byslot --hostfile hostfile7 -np 7 ./hn
> orterun --byslot --hostfile hostfile8 -np 8 ./hn
>
> causes orterun to crash. However,
>
> orterun --hostfile hostfile9 -np 9 ./hn
> ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
>
> works outputing the following:
>
> 0 head
> 1 head
> 2 head
> 3 .0
> 4 .0
> 5 .0
> 6 .0
> 7 .0
> 8 .0
>
> However, running the following:
>
> orterun --bynode --hostfile hostfile7 -np 7 ./hn
>
> works, outputing the following
>
> 0 head
> 1 .0
> 2 .1
> 3 .0
> 4 .1
> 5 .0
> 6 .1
>
> Is the '--byslot' crash a known problem? Does it have something to do with
> BPROC? Thanks in advance for any assistance!
>
> Sean
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users