Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-07-23 11:07:12


Yes...it would indeed.

On 7/23/07 9:03 AM, "Kelley, Sean" <Sean.Kelley_at_[hidden]> wrote:

> Would this logic be in the bproc pls component?
> Sean
>
>
> From: users-bounces_at_[hidden] on behalf of Ralph H Castain
> Sent: Mon 7/23/2007 9:18 AM
> To: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] orterun --bynode/--byslot problem
>
> No, byslot appears to be working just fine on our bproc clusters (it is the
> default mode). As you probably know, bproc is a little strange in how we
> launch - we have to launch the procs in "waves" that correspond to the
> number of procs on a node.
>
> In other words, the first "wave" launches a proc on all nodes that have at
> least one proc on them. The second "wave" then launches another proc on all
> nodes that have at least two procs on them, but doesn't launch anything on
> any node that only has one proc on it.
>
> My guess here is that the system for some reason is insisting that your head
> node be involved in every wave. I confess that we have never tested (to my
> knowledge) a mapping that involves "skipping" a node somewhere in the
> allocation - we always just map from the beginning of the node list, with
> the maximum number of procs being placed on the first nodes in the list
> (since in our machines, the nodes are all the same, so who cares?). So it is
> possible that something in the code objects to skipping around nodes in the
> allocation.
>
> I will have to look and see where that dependency might lie - will try to
> get to it this week.
>
> BTW: that patch I sent you for head node operations will be in 1.2.4.
>
> Ralph
>
>
>
> On 7/23/07 7:04 AM, "Kelley, Sean" <Sean.Kelley_at_[hidden]> wrote:
>
>> > Hi,
>> >
>> > We are experiencing a problem with the process allocation on our Open
>> MPI
>> > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
>> > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
>> > hardware consists of a head node and N blades on private ethernet and
>> > infiniband networks.
>> >
>> > The command run for these tests is a simple MPI program (called 'hn') which
>> > prints out the rank and the hostname. The hostname for the head node is
>> 'head'
>> > and the compute nodes are '.0' ... '.9'.
>> >
>> > We are using the following hostfiles for this example:
>> >
>> > hostfile7
>> > -1 max_slots=1
>> > 0 max_slots=3
>> > 1 max_slots=3
>> >
>> > hostfile8
>> > -1 max_slots=2
>> > 0 max_slots=3
>> > 1 max_slots=3
>> >
>> > hostfile9
>> > -1 max_slots=3
>> > 0 max_slots=3
>> > 1 max_slots=3
>> >
>> > running the following commands:
>> >
>> > orterun --hostfile hostfile7 -np 7 ./hn
>> > orterun --hostfile hostfile8 -np 8 ./hn
>> > orterun --byslot --hostfile hostfile7 -np 7 ./hn
>> > orterun --byslot --hostfile hostfile8 -np 8 ./hn
>> >
>> > causes orterun to crash. However,
>> >
>> > orterun --hostfile hostfile9 -np 9 ./hn
>> > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
>> >
>> > works outputing the following:
>> >
>> > 0 head
>> > 1 head
>> > 2 head
>> > 3 .0
>> > 4 .0
>> > 5 .0
>> > 6 .0
>> > 7 .0
>> > 8 .0
>> >
>> > However, running the following:
>> >
>> > orterun --bynode --hostfile hostfile7 -np 7 ./hn
>> >
>> > works, outputing the following
>> >
>> > 0 head
>> > 1 .0
>> > 2 .1
>> > 3 .0
>> > 4 .1
>> > 5 .0
>> > 6 .1
>> >
>> > Is the '--byslot' crash a known problem? Does it have something to do with
>> > BPROC? Thanks in advance for any assistance!
>> >
>> > Sean
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users