Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] help me understand these error msgs
From: Jure Pečar (pegasus_at_[hidden])
Date: 2013-01-16 10:41:10


Hello,

I have a large fortran code processing data (weather forecast). It runs ok with smaller dataset, but on larger dataset I get some errors I've never seen before:

node061:05144] [[55141,0],11]->[[55141,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 9]
[node061:05144] [[55141,0],11] routed:binomial: Connection to lifeline [[55141,0],0] lost

and

node084:7.0.Non-fatal temporary exhaustion of send tid dma descriptors
(elapsed=43.788s, source LID=0x49/context=11, count=1) (err=0)

I'm using QLogic software version 7.1.0.0.58 (ofed 1.5.4.1, open-mpi 1.4.3).

I'm starting this program with mpirun -mca btl openib,sm,self so I don't really understand what tcp has to do in the first error message.

Also I traced second error message to psm code, but it appears even if i add -mca mtl ^psm to my mpirun arguments. Why?

Any help appreciated.

-- 
Jure Pečar
http://jure.pecar.org