From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-08-30 11:36:06


(sorry -- been afk much of this morning)

MTT directly sets environment variables in its own environment (via
$ENV{whatever} = "foo") before using fork/exec to launch compiles and runs.
Hence, the forked children inherit the environment variables that we set
(E.g., PATH and LD_LIBRARY_PATH).

So if you source the env vars files that MTT drops, that should be
sufficient.

As for setting the values on *remote* nodes, we do it solely via the
--prefix option. I wonder if --prefix is broken under SLURM...? That might
be something to check -- youmight be inadvertantly mixing installations of
OMPI...?

On 8/30/06 10:36 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:

> I'm trying to replicate the MTT environment as much as possible, and
> have a couple of questions.
>
> Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
> MTT. After MTT builds Open MPI, how does it export these variables so
> that it can build the tests? How does it export these when it runs
> those tests (solely via --prefix)?
>
> Cheers,
> josh
>
> On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:
>
>> I already tried that. However I'm trying it in a couple different
>> ways and getting some mixed results. Let me formulate the error cases
>> and get back to you.
>>
>> Cheers,
>> Josh
>>
>> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>>
>>> Well, why don't you try first separating this from MTT? Just run
>>> the command
>>> manually in batch mode and see if it works. If that works, then the
>>> problem
>>> is with MTT. Otherwise, we have a problem with notification.
>>>
>>> Or are you saying that you have already done this?
>>> Ralph
>>>
>>>
>>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>
>>>> yet another point (sorry for the spam). This may not be an MTT issue
>>>> but a broken ORTE on the trunk :(
>>>>
>>>> When I try to run in a allocation (srun -N 16 -A) things run fine.
>>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
>>>> see the same hang as in MTT. seems that mpirun is not getting
>>>> properly notified of the completion of the job. :(
>>>>
>>>> I'll try to investigate a bit further today. Any thoughts on what
>>>> might be causing this?
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>>
>>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>>> there I
>>>>> attached GDB and got the following stack trace:
>>>>> (gdb) bt
>>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>>>> flags=5) at event.c:584
>>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:514
>>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>>> opal_progress.c:
>>>>> 259
>>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>>> #6 0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at
>>>>> orterun.c:444
>>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>>> main.c:13
>>>>>
>>>>> Seems that mpirun is waiting for things to complete :/
>>>>>
>>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>>
>>>>>>
>>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>>
>>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>>>
>>>>>>>>> Does this apply to *all* tests, or only some of the tests (like
>>>>>>>>> allgather)?
>>>>>>>>
>>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>>
>>>>>>> Blah. The trivial tests are simply "hello world", so they should
>>>>>>> take just
>>>>>>> about no time at all.
>>>>>>>
>>>>>>> Is this running under SLURM? I put the code in there to know how
>>>>>>> many procs
>>>>>>> to use in SLURM but have not tested it in eons. I doubt that's
>>>>>>> the
>>>>>>> problem,
>>>>>>> but that's one thing to check.
>>>>>>>
>>>>>>
>>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>>> code is
>>>>>> working fine as it changes with different allocations.
>>>>>>
>>>>>>> Can you set a super-long timeout (e.g., a few minutes), and while
>>>>>>> one of the
>>>>>>> trivial tests is running, do some ps's on the relevant nodes and
>>>>>>> see what,
>>>>>>> if anything, is running? E.g., mpirun, the test executable on
>>>>>>> the
>>>>>>> nodes,
>>>>>>> etc.
>>>>>>
>>>>>> Without setting a long timeout. It seems that mpirun is running,
>>>>>> but
>>>>>> nothing else and only on the launching node.
>>>>>>
>>>>>> When a test starts you see the mpirun launching properly:
>>>>>> $ ps aux | grep ...
>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>> COMMAND
>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>> perl ./
>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>> [sh]
>>>>>> <defunct>
>>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>>>> mpirun
>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>>>> install collective/allgather_in_place
>>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>>>> srun --
>>>>>> nodes=16 --ntasks=16 --
>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o
>>>>>> d
>>>>>> in
>>>>>> 0
>>>>>> 15
>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>> 129.79.240.107:40904"
>>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>>>> srun --
>>>>>> nodes=16 --ntasks=16 --
>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o
>>>>>> d
>>>>>> in
>>>>>> 0
>>>>>> 15
>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>> mpiteam_at_[hidden]:default-universe-28453 --nsreplica
>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>> 129.79.240.107:40904"
>>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /
>>>>>> san/
>>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
>>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>>> vpid_start 0
>>>>>> --universe mpiteam_at_[hidden]:default-
>>>>>> universe-28453 --
>>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>> collective/allgather_in_place
>>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>> collective/allgather_in_place
>>>>>>
>>>>>> But once the test finishes, mpirun seems to just be hanging out.
>>>>>> $ ps aux | grep ...
>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>> COMMAND
>>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /
>>>>>> bin/
>>>>>> bash /var/tmp/slurmd/job148126/script
>>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>>>> sshd:
>>>>>> mpiteam [priv]
>>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>>>> sshd:
>>>>>> mpiteam_at_pts/1
>>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 -
>>>>>> tcsh
>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>> perl ./
>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>> [sh]
>>>>>> <defunct>
>>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>>>> mpirun
>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>>>>> install collective/allgather_in_place
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Server Virtualization Business Unit
>>>>>>> Cisco Systems
>>>>>>
>>>>>> _______________________________________________
>>>>>> mtt-users mailing list
>>>>>> mtt-users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>
>>>>> _______________________________________________
>>>>> mtt-users mailing list
>>>>> mtt-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>
>>>
>>
>> _______________________________________________
>> mtt-users mailing list
>> mtt-users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>
> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems