Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-15 18:47:54


Okay, George - this is fixed in r21690.

Thanks again
Ralph

On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote:

> Ah - interesting scenario!
>
> Definitely a "bug" in the code, then. What it looks like, though, is
> that the jdata->num_procs is wrong. There shouldn't be any way that
> the num_procs in the node array is different than jdata->num_procs.
>
> My guess is that the rank_file mapper isn't correctly maintaining
> the bookkeeping when we map the procs beyond those in the rankfile.
> I'll dig into it - have to fix something for Lenny anyway.
>
> Meantime, this change looks fine regardless as it (a) is better code
> and (b) protects us against such errors.
>
> Thanks for catching it!
> Ralph
>
>
> On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca
> <bosilca_at_[hidden]> wrote:
> I think I found a better solution (in r21688). Here is what I was
> trying to do.
>
> I have a more or less homogeneous cluster. In fact all processors
> are identical, except that some are quad core and some dual core. Of
> course I care how my processes are mapped on the quad cores, but not
> really on the dual cores.
>
> My approach was to use the following configuration files.
>
> In /home/bosilca/.openmpi/mca-params.conf I have:
>
> orte_default_hostfile=/home/bosilca/.openmpi/machinefile
> rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
> rmaps_rank_file_priority = 100
>
> In /home/bosilca/.openmpi/machinefile I have the full description of
> the cluster. As an example:
> node01 slots=4
> node02 slots=4
> node03 slots=2
> node04 slots=2
>
> And in the /home/bosilca/.openmpi/rankfile file I have:
> rank 0=+n0 slot=0
> rank 1=+n0 slot=1
> rank 2=+n1 slot=0
> rank 3=+n1 slot=1
>
> As long as I spawn jobs with less than 4 processes everything worked
> fine. But when I used more than 4 processes, orterun segfaulted.
> After debugging I found that the nodes, lrank and nrank arrays were
> allocated based on the jdata->num_procs, but then filled based on
> the total number of processes in the jdata->nodes array. As it
> appears that the jdata->num_procs is somehow modified based on the
> number of entries in the rankfile, we end-up writing outside the
> allocation and then segfault. Now with the latest patch, we can cope
> with such a scenario by only packing the known information (and thus
> not writing outside the allocated arrays).
>
> This might not be the best approach, but it is doing what I'm
> looking for ...
>
> george.
>
>
> On Jul 15, 2009, at 15:50 , Ralph Castain wrote:
>
> The routed comm system relies on each daemon having complete
> information as to where every process is located, so the expectation
> was that only full maps would ever be sent. Thus, the nidmap code is
> setup to always send a full map.
>
> I don't know how to even generate a "partial" map. I assume you are
> doing something offline? Is this to update changed info? If so,
> you'll also have to do something to update the daemon's maps or the
> comm system will break down.
>
> Ralph
>
> On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca
> <bosilca_at_[hidden]> wrote:
> I have a question regarding the mapping. How can I declare a partial
> mapping ? In fact I only care about how some of the processes are
> mapped on some specific nodes. Right now if the rmaps doesn't
> contain information about all nodes, we give up (before this patch
> we segfaulted).
>
> Does it means we always have to declare the whole mapping or it's
> just that we overlooked this strange case?
>
> george.
>
> Begin forwarded message:
>
>
> Author: bosilca
> Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
> New Revision: 21686
> URL: https://svn.open-mpi.org/trac/ompi/changeset/21686
>
> Log:
> Reorder the nidmap encoding function. Add a check to make sure we
> don't write
> outside the boundaries of the allocated array.
>
> However, the problem is still there. If we have rmaps file
> containing only
> partial information the num_procs get set to the wrong value (the
> number of
> hosts in the rmaps file instead of the number of processes requested
> on the
> command line).
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>