Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SM backing file size
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2008-11-15 12:32:44


Ralph Castain wrote:

> I probably wasn't clear - see below
>
> On Nov 14, 2008, at 6:31 PM, Eugene Loh wrote:
>
>> Ralph Castain wrote:
>>
>>> I have two examples so far:
>>>
>>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>>> node, 2ppn, with btl=openib,sm,self. The program started, but
>>> segfaulted on the first MPI_Send. No warnings were printed.
>>
>> Interesting. So far as I can tell, the actual memory consumption
>> (total number of allocations in the mmapped segment) for 2 local
>> processes should be a little more than half a Mbyte. The bulk of
>> that would be from fragments (chunks). There are
>> btl_sm_free_list_num=8 per process, each of
>> btl_sm_max_frag_size=32K. So, that's 8x2x32K=512Kbyte. Actually, a
>> little bit more. Anyhow, that accounts for most of the allocations,
>> I think. Maybe if you're sending a lot of data, more gets allocated
>> at MPI_Send time. I don't know.
>>
>> While only < 1 Mbyte is needed, however, mpool_sm_min_size=128M will
>> be mapped.
>
> Right - so then it sounds to me like this would fail (which it did)
> since /tmp was fixed to 10M - and the mpool would be much too large
> given a minimum size of 128M. Right?

That makes sense to me.

My analysis of how little of the mapped segment will actually be used is
probably irrelevent.

Here is what I think should happen:

*) The lowest ranking process on the node opens and ftruncates the
file. Since there isn't enough space, the ftruncate fails. This is in
mca_common_sm_mmap_init() in ompi/mca/common/sm/common_sm_mmap.c.

*) The value sm_inited==0 is broadcast from this process to all other
local processes.

*) Nobody tries to mmap the file.

*) On each local process, mca_common_sm_mmap_init() returns a NULL map
to mca_mpool_sm_init(). This, incidentally, is the function where the
size of the backing file is determined, bounded by those max/min parameters.

*) In turn, mca_mpool_sm_init() returns a NULL value.

*) Therefore, sm_btl_first_time_init() returns OMPI_ERROR.

*) Therefore, mca_btl_sm_add_procs() goes into "CLEANUP" and returns
OMPI_ERROR.

*) Therefore, mca_bml_r2_add_procs() gives up on this BTL and tries to
establish connections otherwise.

I'm a little clear what should happen next. But, to reiterate, all
local processes should fail and indicate to the BML that the sm BTL
wasn't going to work for them.

>> It doesn't make sense that this case would fail, but the next case
>> should run. Are you sure this is related to the SM backing file?
>
Sorry, let me take that back. It does make some sense that the first
case would fail. The possible exception is if the connections fall over
to another BTL (openib, I presume).

What's weird is that the second case runs.

>>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>>> (unverified - some uncertainty, could be have been much larger).
>>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>>> The program ran to completion without errors or warning. I don't
>>> know the communication pattern - could be no local comm was
>>> performed, though that sounds doubtful.
>>
> This case -did- run successfully. However, what puzzled me is that it
> seems like it shouldn't have run because the 128M minimum was still
> much larger than the available 16M.

Right. Weird.

> One point that was made on an earlier thread - I don't know if either
> of these cases had a tmpfs file system. I will try to find out. My
> guess is "no" based on what I have been told so far - i.e., in both
> cases, I was told that /tmp's size was "fixed", but that might not be
> technically accurate.
>
> As to whether we are sure about this being an SM backing file issue:
> no, we can't say with absolute certainty. However, I can offer two
> points of validation:
>
> 1. the test that failed (#1) ran perfectly when we set btl=^sm
>
> 2. the test that failed (#1) ran perfectly again after we increased /
> tmp to 512M
>
> The test that did not fail (#2) has never failed for sm reasons as
> far as we know. We have had IB problems on occasion, but we believe
> that is unrelated to this issue.
>
> My point here was simply that I have two cases, one that failed and
> one that didn't, that seem to me to be very similar. I don't
> understand the difference in behavior, and am concerned that users
> will be surprised - and spend a lot of energy trying to figure out
> what happened. The possibility Tim M raised about the tmpfs may
> explain the difference (if #2 used tmpfs and #1 didn't), and I will
> check that ASAP.

I share your surprise.

Incidentally, does the MPI program test the return value from MPI_Init?
Another thing I've wondered about is if OMPI fails in MPI_Init() and
correctly indicates this to the user, but the user doesn't check the
MPI_Init() return value.

User: You were broken!
OMPI: Yes, I know! I TOLD you I was broken, but you didn't listen.