Using MX_CSUM should _not_ make a difference by itself. But it
requires the debug library which may alter the timing enough to avoid
a race (in MX, OMPI, or the application).
Correct, if you use the MTL then all messages are handled by MX
(internode, shared memory and self).
On Jul 3, 2009, at 7:41 AM, 8mj6tc902_at_[hidden] wrote:
> Thanks for your advice! Good to know about the checksum debug
> functionality! Strangely enough running with either "MX_CSUM=1" or "-
> pml cm" allows Murasaki to work normally, and makes the test case I
> attached in my previous mail work. Very suspicious, but at least this
> does make a functional solution (however, if I understand OpenMPI
> correctly, I shouldn't be able to use the CM PML over a network where
> some nodes have MX and some don't, correct?).
> Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote:
>> Hi Kris,
>> I have not run your code yet, but I will try to this weekend.
>> You can have MX checksum its messages if you set MX_CSUM=1 and use
>> MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug).
>> Do you have the problem if you use the MX MTL? To test it modify your
>> mpirun as follows:
>> $ mpirun -mca pml cm ...
>> and do not specify any BTL info.
>> On Jul 2, 2009, at 6:05 PM, 8mj6tc902_at_[hidden] wrote:
>>> Hi. I've now spent many many hours tracking down a bug that was
>>> my program to die, as though either its memory were getting
>>> corrupted or
>>> messages were getting clobbered while going through the network, I
>>> couldn't tell which. I really wish the checksum flag on btl_mx_flags
>>> were working. But anyway, I think I've managed to recreate the
>>> core of
>>> the problem in a small-ish test case which I've attached
>>> (verifycontent.cc). This usually segfaults at MPI_Issend after
>>> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
>>> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl
>>> -mca btl ^mx) makes it work (likewise, the same for my own larger
>>> project (Murasaki)). The MPI_Ssend using version
>>> (verifycontent-ssend.cc) also works no problem over mx. So I
>>> suspect the
>>> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but
>>> also possible I've horribly misunderstood something fundamental
>>> MPI and it's just my fault, so if that's the case, please let me
>>> (but both my this test case and Murasaki work over mpichmx, so
>>> is definitely doing something different).
>>> Here's a brief description of verifycontent.cc to make reading it
>>> * given -np=N, half the nodes will be sending, half will be
>>> some number of messages (reps)
>>> * each message consists of buflen (5000) chars, set to some value
>>> on the sending node's rank and the sequence number of the message
>>> * the receiving node starts an irecv for each sending node, tests
>>> request until a message arrives
>>> * the receiver then checks the contents of the message to make
>>> sure it
>>> matches what was supposed to be in there (this is where my real
>>> Murasaki, fails actually. I can't seem to replicate that however).
>>> * the senders meanwhile keep sending messages and dequeuing them
>>> their request tests as completed.
>>> Testing out the current subversion trunk version, 1.4a1r21594, that
>>> seems to pass my test case, but also tends to show errors like
>>> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)"
>>> start up, and Murasaki still fails (messages turn into zeros about
>>> in), so something still isn't right...
>>> If anyone has any ideas about this test case failing, or my larger
>>> of messages turning into zeros after 132KB (though sadly sometimes
>>> isn't at 132KB, but straight from 0KB, which is very confusing)
>>> while on
>>> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
>>> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
>>> Kris Popendorf
>>> Keio University
>>> http://murasaki................... <- (Probably too cumbersome to
>>> most people to test, but if you feel daring, try putting in some
>>> Human/Mouse chromosomes over MX)
>>> users mailing list
>> users mailing list
> [A dream that comes true can't really be called a dream.]
> users mailing list