Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Andrew Friedley (afriedle_at_[hidden])
Date: 2007-08-29 15:15:54


Thanks for the suggestion; though that appears to hang with no output
whatsoever.

Andrew

Aurelien Bouteiller wrote:
> You should try mpirun -np 2 -bynode totalview ./NPmpi
>
> Aurelien
> Le 29 août 07 à 13:05, Andrew Friedley a écrit :
>
>> OK, I've never used totalview before. So doing some FAQ reading I got
>> an xterm on an Atlas node (odin doesn't have totalview AFAIK).
>> Trying a
>> simple netpipe run just to get familiar with things results in this:
>>
>> $ mpirun -debug -np 2 -bynode -debug-daemons ./NPmpi
>> ----------------------------------------------------------------------
>> ----
>> Internal error -- the orte_base_user_debugger MCA parameter was not
>> able to
>> be found. Please contact the Open RTE developers; this should not
>> happen.
>> ----------------------------------------------------------------------
>> ----
>>
>> Grepping for that param in ompi_info shows:
>>
>> MCA orte: parameter "orte_base_user_debugger" (current value:
>> "totalview @mpirun@ -a @mpirun_args@ : ddt -n @np@
>> -start @executable@ @executable_argv@
>> @single_app@ :
>> fxp @mpirun@ -a @mpirun_args@")
>>
>> What's going on? I also tried running totalview directly, using a
>> line
>> like this:
>>
>> totalview mpirun -a -np 2 -bynode -debug-daemons ./NPmpi
>>
>> Totalview comes up and seems to be running debugging the mpirun
>> process,
>> with only one thread. Doesn't seem to be aware that this is an MPI
>> job
>> with other MPI processes.. any ideas?
>>
>> Andrew
>>
>> George Bosilca wrote:
>>> The first step will be to figure out which version of the alltoall
>>> you're using. I suppose you use the default parameters, and then the
>>> decision function in the tuned component say it is using the linear
>>> all to all. As the name state it, this means that every node will
>>> post one receive from any other node and then will start sending to
>>> every other node the respective fragment. This will lead to a lot of
>>> outstanding sends and receives. I doubt that the receive can cause a
>>> problem, so I expect the problem is coming from the send side.
>>>
>>> Do you have TotalView installed on your odin ? If yes there is a
>>> simple way to see how many sends are pending and where ... That might
>>> pinpoint [at least] the process where you should look to see what'
>>> wrong.
>>>
>>> george.
>>>
>>> On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:
>>>
>>>> I'm having a problem with the UD BTL and hoping someone might have
>>>> some
>>>> input to help solve it.
>>>>
>>>> What I'm seeing is hangs when running alltoall benchmarks with
>>>> nbcbench
>>>> or an LLNL program called mpiBench -- both hang exactly the same
>>>> way.
>>>> With the code on the trunk running nbcbench on IU's odin using 32
>>>> nodes
>>>> and a command line like this:
>>>>
>>>> mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
>>>> 128-128
>>>> -s 1-262144
>>>>
>>>> hangs consistently when testing 256-byte messages. There are two
>>>> things
>>>> I can do to make the hang go away until running at larger scale.
>>>> First
>>>> is to increase the 'btl_ofud_sd_num' MCA param from its default
>>>> value of
>>>> 128. This allows you to run with more procs/nodes before hitting
>>>> the
>>>> hang, but AFAICT doesn't fix the actual problem. What this
>>>> parameter
>>>> does is control the maximum number of outstanding send WQEs
>>>> posted at
>>>> the IB level -- when the limit is reached, frags are queued on an
>>>> opal_list_t and later sent by progress as IB sends complete.
>>>>
>>>> The other way I've found is to play games with calling
>>>> mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
>>>> (). In
>>>> fact I replaced the CHECK_FRAG_QUEUES() macro used around
>>>> btl_ofud_endpoint.c:77 with a version that loops on progress until a
>>>> send WQE slot is available (as opposed to queueing). Same result
>>>> -- I
>>>> can run at larger scale, but still hit the hang eventually.
>>>>
>>>> It appears that when the job hangs, progress is being polled very
>>>> quickly, and after spinning for a while there are no outstanding
>>>> send
>>>> WQEs or queued sends in the BTL. I'm not sure where further up
>>>> things
>>>> are spinning/blocking, as I can't produce the hang at less than 32
>>>> nodes
>>>> / 128 procs and don't have a good way of debugging that (suggestions
>>>> appreciated).
>>>>
>>>> Furthermore, both ob1 and dr PMLs result in the same behavior,
>>>> except
>>>> that DR eventually trips a watchdog timeout, fails the BTL, and
>>>> terminates the job.
>>>>
>>>> Other collectives such as allreduce and allgather do not hang --
>>>> only
>>>> alltoall. I can also reproduce the hang on LLNL's Atlas machine.
>>>>
>>>> Can anyone else reproduce this (Torsten might have to make a copy of
>>>> nbcbench available)? Anyone have any ideas as to what's wrong?
>>>>
>>>> Andrew
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel