Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Still troubles with 1.3 and MX
From: Scott Atchley (atchley_at_[hidden])
Date: 2009-01-22 12:15:45

On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote:

> I'm still having some troubles using the newly released 1.3 with
> Myricom's MX. I've meant to send a message earlier, but the release
> candidates went so fast that I didn't have time to catch up and test.
> General details:
> Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
> Debian etch x86_64, self-compiled kernel, gcc-4.1
> Torque 2.1.10 (but this shouldn't make a difference)
> MX 1.2.7 with a tiny patch from Myricom
> OpenMPI 1.3
> IMB 3.1
> OpenMPI was configured with '--enable-shared --enable-static --with-
> mx=... --with-tm=...'
> In all cases, there were no options specified at runtime (either in
> files or on the command line) except for the PML and BTL selection.
> Problem 1:
> I still see hangs of collective functions when running on large
> number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with
> 128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB
> hangs in Gather.

Bogdan, this sounds like a similar issue to what you experienced in
December and that it had been fixed. I do not remember if this was
tied to the default collective or to free list management.

Can you try a run with:

   -mca btl_mx_free_list_max 1000000

added to the command line?

After that, try a additional runs without the above but with:

   --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_gather_algorithm N

where N is 0, 1, 2, then 3 (one run for each value).

> Problem 2:
> When using the CM+MTL with 128 ranks, it finishes fine when running
> on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors
> that I haven't seen before:
> Max retransmit retries reached (1000) for message
> Max retransmit retries reached (1000) for message
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=510000ms)
> dest: 00:60:dd:47:89:40 (opt029:0)
> partner: peer_index=146, endpoint=3, seqnum=0x2944
> type (2): send_medium
> state (0x14): buffered dead
> requeued: 1000 (timeout=510000ms)
> dest: 00:60:dd:47:89:40 (opt029:0)
> partner: peer_index=146, endpoint=3, seqnum=0x2f9a
> matched_val: 0x0068002a_fffffff2
> slength=32768, xfer_length=32768
> matched_val: 0x0068002b_fffffff2
> slength=32768, xfer_length=32768
> seg: 0x2aaacc30f010,32768
> caller: 0x5b

These are two, overlapped messages from the MX library. It is unable
to send to opt029 (i.e. opt029 is not consuming messages).

> From the MX experts out there, I would also need some help to
> understand what is the source of these messages - I can only see
> opt029 mentioned,

Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should
rerun it with --display-map and the option to support labeling.

> so does it try to communicate intra-node ? (IOW the equivalent of
> "self" BTL in OpenMPI) This would be somehow consistent with running
> more ranks per node (4) than the successfull job (with 2 ranks per
> node).

I am under the impression that the MTLs pass all messages to the
interconnect. If so, then MX is handling self, shared memory (shmem),
and host-to-host. Self, by the way, is a single rank (process)
communicating with itself. In your case, you are using shmem.

> At this point, the job hangs in Alltoallv. The strace output is the
> same as for OB1+BTL above.
> Can anyone suggest some ways forward ? I'd be happy to help in
> debugging if given some instructions.

I would suggest the same test as above with:

   -mca btl_mx_free_list_max 1000000

Additionally, try the following tuned collectives for alltoallv:

   --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm N

where N is 0, 1, then 2 (one run for each value).