From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-30 10:36:33


I'm trying to replicate the MTT environment as much as possible, and
have a couple of questions.

Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
MTT. After MTT builds Open MPI, how does it export these variables so
that it can build the tests? How does it export these when it runs
those tests (solely via --prefix)?

Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:

> I already tried that. However I'm trying it in a couple different
> ways and getting some mixed results. Let me formulate the error cases
> and get back to you.
>
> Cheers,
> Josh
>
> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>
>> Well, why don't you try first separating this from MTT? Just run
>> the command
>> manually in batch mode and see if it works. If that works, then the
>> problem
>> is with MTT. Otherwise, we have a problem with notification.
>>
>> Or are you saying that you have already done this?
>> Ralph
>>
>>
>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>
>>> yet another point (sorry for the spam). This may not be an MTT issue
>>> but a broken ORTE on the trunk :(
>>>
>>> When I try to run in a allocation (srun -N 16 -A) things run fine.
>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
>>> see the same hang as in MTT. seems that mpirun is not getting
>>> properly notified of the completion of the job. :(
>>>
>>> I'll try to investigate a bit further today. Any thoughts on what
>>> might be causing this?
>>>
>>> Cheers,
>>> Josh
>>>
>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>
>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>> there I
>>>> attached GDB and got the following stack trace:
>>>> (gdb) bt
>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>>> flags=5) at event.c:584
>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:514
>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>> opal_progress.c:
>>>> 259
>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>> #6 0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at
>>>> orterun.c:444
>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>> main.c:13
>>>>
>>>> Seems that mpirun is waiting for things to complete :/
>>>>
>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>
>>>>>
>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>
>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>>
>>>>>>>> Does this apply to *all* tests, or only some of the tests (like
>>>>>>>> allgather)?
>>>>>>>
>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>
>>>>>> Blah. The trivial tests are simply "hello world", so they should
>>>>>> take just
>>>>>> about no time at all.
>>>>>>
>>>>>> Is this running under SLURM? I put the code in there to know how
>>>>>> many procs
>>>>>> to use in SLURM but have not tested it in eons. I doubt that's
>>>>>> the
>>>>>> problem,
>>>>>> but that's one thing to check.
>>>>>>
>>>>>
>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>> code is
>>>>> working fine as it changes with different allocations.
>>>>>
>>>>>> Can you set a super-long timeout (e.g., a few minutes), and while
>>>>>> one of the
>>>>>> trivial tests is running, do some ps's on the relevant nodes and
>>>>>> see what,
>>>>>> if anything, is running? E.g., mpirun, the test executable on
>>>>>> the
>>>>>> nodes,
>>>>>> etc.
>>>>>
>>>>> Without setting a long timeout. It seems that mpirun is running,
>>>>> but
>>>>> nothing else and only on the launching node.
>>>>>
>>>>> When a test starts you see the mpirun launching properly:
>>>>> $ ps aux | grep ...
>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>> COMMAND
>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>> perl ./
>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>> [sh]
>>>>> <defunct>
>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>>> mpirun
>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>>> install collective/allgather_in_place
>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>>> srun --
>>>>> nodes=16 --ntasks=16 --
>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o
>>>>> d
>>>>> in
>>>>> 0
>>>>> 15
>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>>> num_procs 16 --vpid_start 0 --universe
>>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>> 129.79.240.107:40904"
>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>>> srun --
>>>>> nodes=16 --ntasks=16 --
>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o
>>>>> d
>>>>> in
>>>>> 0
>>>>> 15
>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>>> num_procs 16 --vpid_start 0 --universe
>>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>> 129.79.240.107:40904"
>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /
>>>>> san/
>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>> vpid_start 0
>>>>> --universe mpiteam_at_[hidden]:default-
>>>>> universe-28453 --
>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>> collective/allgather_in_place
>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>> collective/allgather_in_place
>>>>>
>>>>> But once the test finishes, mpirun seems to just be hanging out.
>>>>> $ ps aux | grep ...
>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>> COMMAND
>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /
>>>>> bin/
>>>>> bash /var/tmp/slurmd/job148126/script
>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>>> sshd:
>>>>> mpiteam [priv]
>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>>> sshd:
>>>>> mpiteam_at_pts/1
>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 -
>>>>> tcsh
>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>> perl ./
>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>> [sh]
>>>>> <defunct>
>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>>> mpirun
>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>>> install collective/allgather_in_place
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> Server Virtualization Business Unit
>>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> mtt-users mailing list
>>>>> mtt-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>
>>>> _______________________________________________
>>>> mtt-users mailing list
>>>> mtt-users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>
>>
>
> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users