1.4.3 is fairly ancient.
Can you upgrade to 1.6.5?
On Jul 26, 2013, at 3:15 AM, Dusan Zoric <dusan.zoric_at_[hidden]> wrote:
> I am running application that performs some transformations of large matrices on 7-node cluster. Nodes are connected via QDR 40 Gbit Infiniband. Open MPI 1.4.3 is installed on the system.
> Given matrix transformation requires large data exchange between nodes in such a way that at each algorithm step there is one node that sends data and all others receive. Number of processes is equal to number of nodes used. I have to say that I am relatively new at MPI, but it seemed that ideal way of performing this is by using MPI_Bcast.
> Everything worked fine for some not so large matrices. However, when matrix size increases, at some point application hangs and stays there forever.
> I am not completely sure, but it seems like there is no errors in my code. I traced it in detail in order to check if there are some uncompleted collective operations before that specific call of MPI_Bcast, but everything looks fine. Also, for that specific call, root is correctly set in all processes, as well as message type and size, and, of course, MPI_Bcast is called in all processes.
> I also ran a lot of scenarios (running application on matrices of different sizes and changing the number of processes) in order to figure out when this happens. What can be observed is the following:
> for the matrix of the same size, application successfully finishes if I decrees number of processes
> however, for given number of processes application will hang for some slightly larger matrix
> for the given matrix size and number of processes where I have program hanging, if I reduce the size of the message in each MPI_Bcat call twice (of course the result will not be correct), there will not be hanging
> So, it seems to me that problem could be in some buffers that MPI uses, and maybe some default MCA parameter should be changed, but, as I said, I do not have a lot of experience in MPI programming, and I have not found solution for this problem. So, the question is whether anyone has had a similar problem, and maybe knows if this could be solved by setting appropriate MCA parameter, or knows any other solution or explanation?
> Dusan Zoric
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/