Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Marcin Skoczylas (Marcin.Skoczylas_at_[hidden])
Date: 2007-10-02 14:20:02


Sorry I forgot to mention: Open MPI version 1.2.4

Marcin Skoczylas wrote:
> Hello,
>
> After whole day of coding I'm fighting little bit with one small
> fragment which seems strange for me.
> For testing I have one head node and two worker nodes on localhost.
> Having this code (with debug stuff added like sleeps, barriers, etc):
>
> void CImageData::SpreadToNodes()
> {
> sleep(5);
> logger->debug("CImageData::SpreadToNodes, w=%d h=%d type=%d",
> this->width, this->height, this->type);
>
> logger->debug("head barrier");
> MPI_Barrier(MPI_COMM_WORLD);
> sleep(2);
> MPI_Barrier(MPI_COMM_WORLD);
>
> // debug 'sync' test
> logger->debug("head send SYNC str");
> char buf[5];
> buf[0] = 'S'; buf[1] = 'Y'; buf[2] = 'N'; buf[3] = 'C';
> for (int nodeId = 1; nodeId < g_NumProcesses; nodeId++)
> {
> MPI_Send(buf, 4, MPI_CHAR, nodeId, DEF_MSG_TAG, MPI_COMM_WORLD);
> }
>
> logger->debug("head bcast width: %d", this->width);
> MPI_Bcast(&(this->width), 1, MPI_INT, 0, MPI_COMM_WORLD);
> logger->debug("head bcast height: %d", this->height);
> MPI_Bcast(&(this->height), 1, MPI_INT, 0, MPI_COMM_WORLD);
> logger->debug("head bcast type: %d", this->type);
> MPI_Bcast(&(this->type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
>
> logger->debug("head sleep 10s");
> sleep(10);
>
> logger->debug("finished CImageData::SpreadToNodes");
> }
>
> // this function is decleared static:
> CImageData *CImageData::ReceiveFromHead()
> {
> sleep(5);
>
> logger->debug("CImageData::ReceiveFromHead");
> MPI_Status status;
> int _width;
> int _height;
> byte _type;
>
> logger->debug("worker barrier");
> MPI_Barrier(MPI_COMM_WORLD);
> sleep(2);
> MPI_Barrier(MPI_COMM_WORLD);
>
> char buf[5];
> MPI_Recv(buf, 4, MPI_CHAR, HEAD_NODE, DEF_MSG_TAG, MPI_COMM_WORLD,
> &status);
> logger->debug("worker received sync str: '%c' '%c' '%c' '%c'",
> buf[0], buf[1], buf[2], buf[3]);
>
> logger->debug("worker bcast width");
> MPI_Bcast(&(_width), 1, MPI_INT, 0, MPI_COMM_WORLD);
> logger->debug("worker bcast height");
> MPI_Bcast(&(_height), 1, MPI_INT, 0, MPI_COMM_WORLD);
> logger->debug("worker bcast type");
> MPI_Bcast(&(_type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
>
> logger->debug("width=%d height=%d type=%d", _width, _height, _type);
>
> // TODO: create CImageData object, return...
> return NULL;
> }
>
>
> That part of code gives me an error:
> RANK 0 -> PID 17115
> RANK 1 -> PID 17116
> RANK 2 -> PID 17117
>
> 2007-10-02 19:50:37,829 [17115] DEBUG: CImageData::SpreadToNodes, w=768
> h=576 type=1
> 2007-10-02 19:50:37,829 [17117] DEBUG: CImageData::ReceiveFromHead
> 2007-10-02 19:50:37,829 [17115] DEBUG: head barrier
> 2007-10-02 19:50:37,829 [17116] DEBUG: CImageData::ReceiveFromHead
> 2007-10-02 19:50:37,829 [17116] DEBUG: worker barrier
> 2007-10-02 19:50:37,829 [17117] DEBUG: worker barrier
> 2007-10-02 19:50:39,836 [17115] DEBUG: head send SYNC str
> 2007-10-02 19:50:39,836 [17115] DEBUG: head bcast width: 768
> 2007-10-02 19:50:39,836 [17115] DEBUG: head bcast height: 576
> 2007-10-02 19:50:39,836 [17115] DEBUG: head bcast type: 1
> 2007-10-02 19:50:39,836 [17115] DEBUG: head sleep 10s
> 2007-10-02 19:50:39,836 [17116] DEBUG: worker received sync str: 'S' 'Y'
> 'N' 'C'
> 2007-10-02 19:50:39,836 [17116] DEBUG: worker bcast width
> [pc801:17116] *** An error occurred in MPI_Bcast
> [pc801:17116] *** on communicator MPI_COMM_WORLD
> [pc801:17116] *** MPI_ERR_TRUNCATE: message truncated
> [pc801:17116] *** MPI_ERRORS_ARE_FATAL (goodbye)
> 2007-10-02 19:50:39,836 [17117] DEBUG: worker received sync str: 'S' 'Y'
> 'N' 'C'
> 2007-10-02 19:50:39,836 [17117] DEBUG: worker bcast width
> [pc801:17117] *** An error occurred in MPI_Bcast
> [pc801:17117] *** on communicator MPI_COMM_WORLD
> [pc801:17117] *** MPI_ERR_TRUNCATE: message truncated
> [pc801:17117] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpirun noticed that job rank 0 with PID 17115 on node pc801 exited on
> signal 15 (Terminated).
>
>
> Could it be that somewhere before this part the data stream was out of
> sync? The project is quite large and I have a lot of communication
> between processes before CImageData::SpreadToNodes() so whole debugging
> could take hours/days, however it seems that data flow before this
> particular fragment is ok. How could it be that MPI_Send/Recv gave me
> good buffer (4 chars - SYNC) and MPI_Bcast of MPI_INT is truncated? I
> tested the code on Valgrind - it didn't complain and gave me exactly the
> same result. Can I assume that possibly I have somewhere memory-acces
> error before this part and I destroyed the MPI structures? How exactly
> MPI_Bcast is working?
>
> Sorry for disturb, but I'm little bit confused.
> Thank you & greetings, Marcin
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>