Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-10-21 09:52:35


For those following this thread: there was off-list discussion about
this topic -- re-starting the Torque daemons *seemed* to fix the
problem.

On Oct 20, 2006, at 6:00 PM, Ogden, Jeffry Brandon wrote:

> We don't actually have the capability to test the mpiexec + MVAPICH
> launch at the moment. I was able to get a job to launch at 1920 and
> I'm
> waiting for it to finish. When it is done, I can at least try an
> mpiexec
> -comm=none launch to see how TM responds to it.
>
>> -----Original Message-----
>> From: owner-tbird-admin_at_[hidden]
>> [mailto:owner-tbird-admin_at_[hidden]] On Behalf Of Jeff Squyres
>> Sent: Friday, October 20, 2006 1:17 PM
>> To: Open MPI Users
>> Cc: tbird-admin
>> Subject: Re: [OMPI users] OMPI launching problem using TM and
>> openib on 1920 nodes
>>
>> This message is coming from torque:
>>
>> [15:15] 69-94-204-35:~/Desktop/torque-2.1.2 % grep -r "out of space
>> in buffer and cannot commit message" *
>> src/lib/Libifl/tcp_dis.c: DBPRT(("%s: error! out of space in
>> buffer and cannot commit message (bufsize=%d, buflen=%d, ct=%d)\n",
>>
>> Are you able to use OSC mpiexec to launch over the same number of
>> nodes, perchance?
>>
>>
>> On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote:
>>
>>> We are having quite a bit of trouble reliably launching larger jobs
>>> (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the
>> moment. The
>>> launches usually either just hang or fail with output like:
>>>
>>> Cbench numprocs: 1920
>>> Cbench numnodes: 1921
>>> Cbench ppn: 1
>>> Cbench jobname: xhpl-1ppn-1920
>>> Cbench joblaunchmethod: openmpi
>>>
>>> tcp_puts: error! out of space in buffer and cannot commit message
>>> (bufsize=262144, buflen=261801, ct=450)
>>>
>>> [cn1023:02832] pls:tm: start_procs returned error -1
>>> [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at
>>> line
>>> 186
>>> [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at
>>> line
>>> 490
>>> [cn1023:02832] orterun: spawn failed with errno=-1
>>> [dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>>
>>> The OMPI environment parameters we are using are:
>>> %env | grep OMPI
>>> OMPI_MCA_oob_tcp_include=eth0
>>> OMPI_MCA_oob_tcp_listen_mode=listen_thread
>>> OMPI_MCA_btl_openib_ib_timeout=18
>>> OMPI_MCA_oob_tcp_listen_thread_max_time=100
>>> OMPI_MCA_oob_tcp_listen_thread_max_queue=100
>>> OMPI_MCA_btl_tcp_if_include=eth0
>>> OMPI_MCA_btl_openib_ib_retry_count=15
>>> OMPI_MCA_btl_openib_ib_cq_size=65536
>>> OMPI_MCA_rmaps_base_schedule_policy=node
>>>
>>> I have full output with generated from the following OMPI params
>>> attached:
>>> export OMPI_MCA_pls_tm_debug=1
>>> export OMPI_MCA_pls_tm_verbose=1
>>>
>>> We are running Toruqe 2.1.2. I'm mostly suspicious of the tcp_puts
>>> error and the 262144 bufsize limit... Any ideas? Thanks.
>>> <xhpl-1ppn-1920..o127407>
>>> <xhpl-1ppn-1920..e127407>
>>> <mime-attachment.txt>
>>
>>
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>>
>>
>> -------------------
>>
>>

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems