Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problems with MPI_Issend and MX
From: Scott Atchley (atchley_at_[hidden])
Date: 2009-07-03 10:53:29


Using MX_CSUM should _not_ make a difference by itself. But it
requires the debug library which may alter the timing enough to avoid
a race (in MX, OMPI, or the application).

Correct, if you use the MTL then all messages are handled by MX
(internode, shared memory and self).


On Jul 3, 2009, at 7:41 AM, 8mj6tc902_at_[hidden] wrote:

> Scott,
> Thanks for your advice! Good to know about the checksum debug
> functionality! Strangely enough running with either "MX_CSUM=1" or "-
> mca
> pml cm" allows Murasaki to work normally, and makes the test case I
> attached in my previous mail work. Very suspicious, but at least this
> does make a functional solution (however, if I understand OpenMPI
> correctly, I shouldn't be able to use the CM PML over a network where
> some nodes have MX and some don't, correct?).
> Scott Atchley |openmpi-users/Allow| wrote:
>> Hi Kris,
>> I have not run your code yet, but I will try to this weekend.
>> You can have MX checksum its messages if you set MX_CSUM=1 and use
>> the
>> MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug).
>> Do you have the problem if you use the MX MTL? To test it modify your
>> mpirun as follows:
>> $ mpirun -mca pml cm ...
>> and do not specify any BTL info.
>> Scott
>> On Jul 2, 2009, at 6:05 PM, 8mj6tc902_at_[hidden] wrote:
>>> Hi. I've now spent many many hours tracking down a bug that was
>>> causing
>>> my program to die, as though either its memory were getting
>>> corrupted or
>>> messages were getting clobbered while going through the network, I
>>> couldn't tell which. I really wish the checksum flag on btl_mx_flags
>>> were working. But anyway, I think I've managed to recreate the
>>> core of
>>> the problem in a small-ish test case which I've attached
>>> ( This usually segfaults at MPI_Issend after
>>> sending
>>> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
>>> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl
>>> (mpirun
>>> -mca btl ^mx) makes it work (likewise, the same for my own larger
>>> project (Murasaki)). The MPI_Ssend using version
>>> ( also works no problem over mx. So I
>>> suspect the
>>> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but
>>> it's
>>> also possible I've horribly misunderstood something fundamental
>>> about
>>> MPI and it's just my fault, so if that's the case, please let me
>>> know
>>> (but both my this test case and Murasaki work over mpichmx, so
>>> OpenMPI
>>> is definitely doing something different).
>>> Here's a brief description of to make reading it
>>> easier:
>>> * given -np=N, half the nodes will be sending, half will be
>>> receiving
>>> some number of messages (reps)
>>> * each message consists of buflen (5000) chars, set to some value
>>> based
>>> on the sending node's rank and the sequence number of the message
>>> * the receiving node starts an irecv for each sending node, tests
>>> each
>>> request until a message arrives
>>> * the receiver then checks the contents of the message to make
>>> sure it
>>> matches what was supposed to be in there (this is where my real
>>> project,
>>> Murasaki, fails actually. I can't seem to replicate that however).
>>> * the senders meanwhile keep sending messages and dequeuing them
>>> when
>>> their request tests as completed.
>>> Testing out the current subversion trunk version, 1.4a1r21594, that
>>> seems to pass my test case, but also tends to show errors like
>>> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)"
>>> on
>>> start up, and Murasaki still fails (messages turn into zeros about
>>> 132KB
>>> in), so something still isn't right...
>>> If anyone has any ideas about this test case failing, or my larger
>>> issue
>>> of messages turning into zeros after 132KB (though sadly sometimes
>>> it
>>> isn't at 132KB, but straight from 0KB, which is very confusing)
>>> while on
>>> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
>>> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
>>> --
>>> Kris Popendorf
>>> Keio University
>>> http://murasaki................... <- (Probably too cumbersome to
>>> expect
>>> most people to test, but if you feel daring, try putting in some
>>> Human/Mouse chromosomes over MX)
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> --Kris
> 叶ってしまう瘢雹夢は本当の夢と言えん。
> [A dream that comes true can't really be called a dream.]
> _______________________________________________
> users mailing list
> users_at_[hidden]