Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Ogden, Jeffry Brandon (jbogden_at_[hidden])
Date: 2006-10-20 12:23:13


We are having quite a bit of trouble reliably launching larger jobs
(1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the moment. The
launches usually either just hang or fail with output like:

Cbench numprocs: 1920
Cbench numnodes: 1921
Cbench ppn: 1
Cbench jobname: xhpl-1ppn-1920
Cbench joblaunchmethod: openmpi

tcp_puts: error! out of space in buffer and cannot commit message
(bufsize=262144, buflen=261801, ct=450)

[cn1023:02832] pls:tm: start_procs returned error -1
[cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line
186
[cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line
490
[cn1023:02832] orterun: spawn failed with errno=-1
[dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104
[dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with
errno=104

The OMPI environment parameters we are using are:
 %env | grep OMPI
 OMPI_MCA_oob_tcp_include=eth0
 OMPI_MCA_oob_tcp_listen_mode=listen_thread
 OMPI_MCA_btl_openib_ib_timeout=18
 OMPI_MCA_oob_tcp_listen_thread_max_time=100
 OMPI_MCA_oob_tcp_listen_thread_max_queue=100
 OMPI_MCA_btl_tcp_if_include=eth0
 OMPI_MCA_btl_openib_ib_retry_count=15
 OMPI_MCA_btl_openib_ib_cq_size=65536
 OMPI_MCA_rmaps_base_schedule_policy=node

I have full output with generated from the following OMPI params
attached:
 export OMPI_MCA_pls_tm_debug=1
 export OMPI_MCA_pls_tm_verbose=1

We are running Toruqe 2.1.2. I'm mostly suspicious of the tcp_puts
error and the 262144 bufsize limit... Any ideas? Thanks.