Thanks for your advice! Good to know about the checksum debug
functionality! Strangely enough running with either "MX_CSUM=1" or "-mca
pml cm" allows Murasaki to work normally, and makes the test case I
attached in my previous mail work. Very suspicious, but at least this
does make a functional solution (however, if I understand OpenMPI
correctly, I shouldn't be able to use the CM PML over a network where
some nodes have MX and some don't, correct?).
Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote:
> Hi Kris,
> I have not run your code yet, but I will try to this weekend.
> You can have MX checksum its messages if you set MX_CSUM=1 and use the
> MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug).
> Do you have the problem if you use the MX MTL? To test it modify your
> mpirun as follows:
> $ mpirun -mca pml cm ...
> and do not specify any BTL info.
> On Jul 2, 2009, at 6:05 PM, 8mj6tc902_at_[hidden] wrote:
>> Hi. I've now spent many many hours tracking down a bug that was causing
>> my program to die, as though either its memory were getting corrupted or
>> messages were getting clobbered while going through the network, I
>> couldn't tell which. I really wish the checksum flag on btl_mx_flags
>> were working. But anyway, I think I've managed to recreate the core of
>> the problem in a small-ish test case which I've attached
>> (verifycontent.cc). This usually segfaults at MPI_Issend after sending
>> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
>> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun
>> -mca btl ^mx) makes it work (likewise, the same for my own larger
>> project (Murasaki)). The MPI_Ssend using version
>> (verifycontent-ssend.cc) also works no problem over mx. So I suspect the
>> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's
>> also possible I've horribly misunderstood something fundamental about
>> MPI and it's just my fault, so if that's the case, please let me know
>> (but both my this test case and Murasaki work over mpichmx, so OpenMPI
>> is definitely doing something different).
>> Here's a brief description of verifycontent.cc to make reading it easier:
>> * given -np=N, half the nodes will be sending, half will be receiving
>> some number of messages (reps)
>> * each message consists of buflen (5000) chars, set to some value based
>> on the sending node's rank and the sequence number of the message
>> * the receiving node starts an irecv for each sending node, tests each
>> request until a message arrives
>> * the receiver then checks the contents of the message to make sure it
>> matches what was supposed to be in there (this is where my real project,
>> Murasaki, fails actually. I can't seem to replicate that however).
>> * the senders meanwhile keep sending messages and dequeuing them when
>> their request tests as completed.
>> Testing out the current subversion trunk version, 1.4a1r21594, that
>> seems to pass my test case, but also tends to show errors like
>> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on
>> start up, and Murasaki still fails (messages turn into zeros about 132KB
>> in), so something still isn't right...
>> If anyone has any ideas about this test case failing, or my larger issue
>> of messages turning into zeros after 132KB (though sadly sometimes it
>> isn't at 132KB, but straight from 0KB, which is very confusing) while on
>> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
>> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
>> Kris Popendorf
>> Keio University
>> http://murasaki................... <- (Probably too cumbersome to expect
>> most people to test, but if you feel daring, try putting in some
>> Human/Mouse chromosomes over MX)
>> users mailing list
> users mailing list
[A dream that comes true can't really be called a dream.]