Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
From: Richard Treumann (treumann_at_[hidden])
Date: 2010-08-23 19:39:52


It is hard to imagine how a total data load of 41,943,040 bytes could be a
problem. That is really not much data. By the time the BCAST is done, each
task (except root) will have received a single half meg message form one
sender. That is not much.

IMB does shift the root so some tasks may be in iteration 9 while some are
still in iteration 8 or 7 but a 1/2 meg message should use rendezvous
protocol so no message will be injected into the network until the
destination task is ready to receive it.

Any task can be in only one MPI_Bcast at a time so the total active data
cannot ever exceed the 41,943,040 bytes, no matter how fast the MPI_Bcast
loop tries to iterate.

(There are MPI_Bcast algorithms that chunk the data into smaller messages
but even with those algorithms, the total concurrent load will not exceed
41,943,040 bytes.)

Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363

users-bounces_at_[hidden] wrote on 08/23/2010 05:09:56 PM:

> [image removed]
>
> Re: [OMPI users] IMB-MPI broadcast test stalls for large core
> counts: debug ideas?
>
> Rahul Nabar
>
> to:
>
> Open MPI Users
>
> 08/23/2010 05:11 PM
>
> Sent by:
>
> users-bounces_at_[hidden]
>
> Please respond to Open MPI Users
>
>

> On Sun, Aug 22, 2010 at 9:57 PM, Randolph Pullen
<randolph_pullen_at_[hidden]
> > wrote:
>
> Its a long shot but could it be related to the total data volume ?
> ie 524288 * 80 = 41943040 bytes active in the cluster
>
> Can you exceed this 41943040 data volume with a smaller message
> repeated more often or a larger one less often?
>
>
> Not so far, so your diagnosis could be right. The failures have been
> at the following data volumes:
>
> 41.9E6
> 4.1E6
> 8.2E6
>
> Unfortunately, I'm not sure I can change the repeat rate with the
> OFED/MPI tests. Can I do that? Didn't see a suitable flag.
>
> In any case, assuming it is related to the total data volume what
> could be causing such a failure?
>
> --
> Rahul_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users