From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-30 10:25:49


I already tried that. However I'm trying it in a couple different
ways and getting some mixed results. Let me formulate the error cases
and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:

> Well, why don't you try first separating this from MTT? Just run
> the command
> manually in batch mode and see if it works. If that works, then the
> problem
> is with MTT. Otherwise, we have a problem with notification.
>
> Or are you saying that you have already done this?
> Ralph
>
>
> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>> yet another point (sorry for the spam). This may not be an MTT issue
>> but a broken ORTE on the trunk :(
>>
>> When I try to run in a allocation (srun -N 16 -A) things run fine.
>> But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
>> see the same hang as in MTT. seems that mpirun is not getting
>> properly notified of the completion of the job. :(
>>
>> I'll try to investigate a bit further today. Any thoughts on what
>> might be causing this?
>>
>> Cheers,
>> Josh
>>
>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>
>>> forgot this bit in my mail. With the mpirun just hanging out there I
>>> attached GDB and got the following stack trace:
>>> (gdb) bt
>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>> flags=5) at event.c:584
>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:514
>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>> opal_progress.c:
>>> 259
>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>> #6 0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at
>>> orterun.c:444
>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>> main.c:13
>>>
>>> Seems that mpirun is waiting for things to complete :/
>>>
>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>
>>>>
>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>
>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>
>>>>>>> Does this apply to *all* tests, or only some of the tests (like
>>>>>>> allgather)?
>>>>>>
>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>
>>>>> Blah. The trivial tests are simply "hello world", so they should
>>>>> take just
>>>>> about no time at all.
>>>>>
>>>>> Is this running under SLURM? I put the code in there to know how
>>>>> many procs
>>>>> to use in SLURM but have not tested it in eons. I doubt that's
>>>>> the
>>>>> problem,
>>>>> but that's one thing to check.
>>>>>
>>>>
>>>> Yep it is in SLURM. and it seems that the 'number of procs' code is
>>>> working fine as it changes with different allocations.
>>>>
>>>>> Can you set a super-long timeout (e.g., a few minutes), and while
>>>>> one of the
>>>>> trivial tests is running, do some ps's on the relevant nodes and
>>>>> see what,
>>>>> if anything, is running? E.g., mpirun, the test executable on the
>>>>> nodes,
>>>>> etc.
>>>>
>>>> Without setting a long timeout. It seems that mpirun is running,
>>>> but
>>>> nothing else and only on the launching node.
>>>>
>>>> When a test starts you see the mpirun launching properly:
>>>> $ ps aux | grep ...
>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>> COMMAND
>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>> perl ./
>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh]
>>>> <defunct>
>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>> mpirun
>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>> install collective/allgather_in_place
>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>> srun --
>>>> nodes=16 --ntasks=16 --
>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,od
>>>> in
>>>> 0
>>>> 15
>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>> num_procs 16 --vpid_start 0 --universe
>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>> 129.79.240.107:40904"
>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>> srun --
>>>> nodes=16 --ntasks=16 --
>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,od
>>>> in
>>>> 0
>>>> 15
>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>> num_procs 16 --vpid_start 0 --universe
>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>> 129.79.240.107:40904"
>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /
>>>> san/
>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>> vpid_start 0
>>>> --universe mpiteam_at_[hidden]:default-universe-28453 --
>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>> collective/allgather_in_place
>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>> collective/allgather_in_place
>>>>
>>>> But once the test finishes, mpirun seems to just be hanging out.
>>>> $ ps aux | grep ...
>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>> COMMAND
>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /
>>>> bin/
>>>> bash /var/tmp/slurmd/job148126/script
>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>> sshd:
>>>> mpiteam [priv]
>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>> sshd:
>>>> mpiteam_at_pts/1
>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 -
>>>> tcsh
>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>> perl ./
>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh]
>>>> <defunct>
>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>> mpirun
>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>> install collective/allgather_in_place
>>>>
>>>> Thoughts?
>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Server Virtualization Business Unit
>>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> mtt-users mailing list
>>>> mtt-users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>
>>> _______________________________________________
>>> mtt-users mailing list
>>> mtt-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>
>