Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Problems with MPI_Issend and MX
From: 8mj6tc902_at_[hidden]
Date: 2009-07-02 18:05:14


Hi. I've now spent many many hours tracking down a bug that was causing
my program to die, as though either its memory were getting corrupted or
messages were getting clobbered while going through the network, I
couldn't tell which. I really wish the checksum flag on btl_mx_flags
were working. But anyway, I think I've managed to recreate the core of
the problem in a small-ish test case which I've attached
(verifycontent.cc). This usually segfaults at MPI_Issend after sending
about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun
-mca btl ^mx) makes it work (likewise, the same for my own larger
project (Murasaki)). The MPI_Ssend using version
(verifycontent-ssend.cc) also works no problem over mx. So I suspect the
issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's
also possible I've horribly misunderstood something fundamental about
MPI and it's just my fault, so if that's the case, please let me know
(but both my this test case and Murasaki work over mpichmx, so OpenMPI
is definitely doing something different).

Here's a brief description of verifycontent.cc to make reading it easier:
* given -np=N, half the nodes will be sending, half will be receiving
some number of messages (reps)
* each message consists of buflen (5000) chars, set to some value based
on the sending node's rank and the sequence number of the message
* the receiving node starts an irecv for each sending node, tests each
request until a message arrives
* the receiver then checks the contents of the message to make sure it
matches what was supposed to be in there (this is where my real project,
Murasaki, fails actually. I can't seem to replicate that however).
* the senders meanwhile keep sending messages and dequeuing them when
their request tests as completed.

Testing out the current subversion trunk version, 1.4a1r21594, that
seems to pass my test case, but also tends to show errors like
"mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on
start up, and Murasaki still fails (messages turn into zeros about 132KB
in), so something still isn't right...

If anyone has any ideas about this test case failing, or my larger issue
of messages turning into zeros after 132KB (though sadly sometimes it
isn't at 132KB, but straight from 0KB, which is very confusing) while on
MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.

-- 
Kris Popendorf
Keio University
http://murasaki................... <- (Probably too cumbersome to expect
most people to test, but if you feel daring, try putting in some
Human/Mouse chromosomes over MX)