From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-30 14:28:02


So here are the results of my exploration. I have things running now.
The problem was that the user that I am running under does not set
the LD_LIBRARY_PATH variable at any point. So when MTT tries to
export the variable it does:
if (0LD_LIBRARY_PATH == 0) then
     setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
else
     setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
$LD_LIBRARY_PATH
endif

So this causes tcsh to emit the error the LD_LIBRARY_PATH is not
defined. So it is not set due to the error.

I fixed this by always declaring it in the .cshrc file to "". However
MTT could do a sanity check before trying to check the value to see
if it is defined. Something like:

<code>
if ($?LD_LIBRARY_PATH) then
else
       setenv LD_LIBRARY_PATH ""
endif

if (0LD_LIBRARY_PATH == 0) then
     setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
else
     setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
$LD_LIBRARY_PATH
endif
</code>

or something of the sort.

As another note, could we start a "How to debug MTT" Wiki page with
some of the information that Jeff sent in this message regarding the
dumping of env vars? I think that would be helpful when getting
things started.

Thanks for all your help, I'm sure I'll have more questions in the
near future.

Cheers,
Josh

On Aug 30, 2006, at 12:31 PM, Jeff Squyres wrote:

> On 8/30/06 12:10 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>>> MTT directly sets environment variables in its own environment (via
>>> $ENV{whatever} = "foo") before using fork/exec to launch compiles
>>> and runs.
>>> Hence, the forked children inherit the environment variables that
>>> we set
>>> (E.g., PATH and LD_LIBRARY_PATH).
>>>
>>> So if you source the env vars files that MTT drops, that should be
>>> sufficient.
>>
>> Does it drop them to a file, or is it printed in the debugging output
>> anywhere? I'm having a bit of trouble finding these strings in the
>> output.
>
> It does not put these in the -debug output.
>
> The files that it drops are in the scratch dir. You'll need to go
> into
> scratch/installs, and then it depends on what your INI file section
> names
> are. You'll go to:
>
> <scratch>/installs/<mpi get>/<mpi install>/<mpi_version>/
>
> And there should be files named "mpi_installed_vars.[csh|sh]" that
> you can
> source, depending on your shell. IT should set PATH and
> LD_LIBRARY_PATH.
>
> The intent of these files is for exactly this purpose -- for a
> human to test
> borked MPI installs inside the MTT scratch tree.
>
>>>
>>> As for setting the values on *remote* nodes, we do it solely via the
>>> --prefix option. I wonder if --prefix is broken under SLURM...?
>>> That might
>>> be something to check -- youmight be inadvertantly mixing
>>> installations of
>>> OMPI...?
>>
>> Yep I'll check it out.
>>
>> Cheers,
>> Josh
>>
>>>
>>>
>>> On 8/30/06 10:36 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>
>>>> I'm trying to replicate the MTT environment as much as possible,
>>>> and
>>>> have a couple of questions.
>>>>
>>>> Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
>>>> MTT. After MTT builds Open MPI, how does it export these
>>>> variables so
>>>> that it can build the tests? How does it export these when it runs
>>>> those tests (solely via --prefix)?
>>>>
>>>> Cheers,
>>>> josh
>>>>
>>>> On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:
>>>>
>>>>> I already tried that. However I'm trying it in a couple different
>>>>> ways and getting some mixed results. Let me formulate the error
>>>>> cases
>>>>> and get back to you.
>>>>>
>>>>> Cheers,
>>>>> Josh
>>>>>
>>>>> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>>>>>
>>>>>> Well, why don't you try first separating this from MTT? Just run
>>>>>> the command
>>>>>> manually in batch mode and see if it works. If that works,
>>>>>> then the
>>>>>> problem
>>>>>> is with MTT. Otherwise, we have a problem with notification.
>>>>>>
>>>>>> Or are you saying that you have already done this?
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>>
>>>>>>> yet another point (sorry for the spam). This may not be an MTT
>>>>>>> issue
>>>>>>> but a broken ORTE on the trunk :(
>>>>>>>
>>>>>>> When I try to run in a allocation (srun -N 16 -A) things run
>>>>>>> fine.
>>>>>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh)
>>>>>>> then I
>>>>>>> see the same hang as in MTT. seems that mpirun is not getting
>>>>>>> properly notified of the completion of the job. :(
>>>>>>>
>>>>>>> I'll try to investigate a bit further today. Any thoughts on
>>>>>>> what
>>>>>>> might be causing this?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Josh
>>>>>>>
>>>>>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>>>>>
>>>>>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>>>>>> there I
>>>>>>>> attached GDB and got the following stack trace:
>>>>>>>> (gdb) bt
>>>>>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>>>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>>>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>>>>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>>>>>>> flags=5) at event.c:584
>>>>>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:
>>>>>>>> 514
>>>>>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>>>>>> opal_progress.c:
>>>>>>>> 259
>>>>>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>>>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>>>>>> #6 0x0000000000402f52 in orterun (argc=9,
>>>>>>>> argv=0x7fbffff0b8) at
>>>>>>>> orterun.c:444
>>>>>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>>>>>> main.c:13
>>>>>>>>
>>>>>>>> Seems that mpirun is waiting for things to complete :/
>>>>>>>>
>>>>>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>>>>>
>>>>>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>> Does this apply to *all* tests, or only some of the tests
>>>>>>>>>>>> (like
>>>>>>>>>>>> allgather)?
>>>>>>>>>>>
>>>>>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>>>>>
>>>>>>>>>> Blah. The trivial tests are simply "hello world", so they
>>>>>>>>>> should
>>>>>>>>>> take just
>>>>>>>>>> about no time at all.
>>>>>>>>>>
>>>>>>>>>> Is this running under SLURM? I put the code in there to
>>>>>>>>>> know how
>>>>>>>>>> many procs
>>>>>>>>>> to use in SLURM but have not tested it in eons. I doubt
>>>>>>>>>> that's
>>>>>>>>>> the
>>>>>>>>>> problem,
>>>>>>>>>> but that's one thing to check.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>>>>>> code is
>>>>>>>>> working fine as it changes with different allocations.
>>>>>>>>>
>>>>>>>>>> Can you set a super-long timeout (e.g., a few minutes), and
>>>>>>>>>> while
>>>>>>>>>> one of the
>>>>>>>>>> trivial tests is running, do some ps's on the relevant nodes
>>>>>>>>>> and
>>>>>>>>>> see what,
>>>>>>>>>> if anything, is running? E.g., mpirun, the test
>>>>>>>>>> executable on
>>>>>>>>>> the
>>>>>>>>>> nodes,
>>>>>>>>>> etc.
>>>>>>>>>
>>>>>>>>> Without setting a long timeout. It seems that mpirun is
>>>>>>>>> running,
>>>>>>>>> but
>>>>>>>>> nothing else and only on the launching node.
>>>>>>>>>
>>>>>>>>> When a test starts you see the mpirun launching properly:
>>>>>>>>> $ ps aux | grep ...
>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>>> COMMAND
>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>>> perl ./
>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>> file /u/
>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>>> [sh]
>>>>>>>>> <defunct>
>>>>>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>>>>>>> mpirun
>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>> mtt-
>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>> 1.3a1r11497/
>>>>>>>>> install collective/allgather_in_place
>>>>>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>>>>>>> srun --
>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin0
>>>>>>>>> 16
>>>>>>>>> ,o
>>>>>>>>> d
>>>>>>>>> in
>>>>>>>>> 0
>>>>>>>>> 15
>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin0
>>>>>>>>> 07
>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>> 0.0.1 --
>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>> nsreplica
>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>>>>>>> srun --
>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin0
>>>>>>>>> 16
>>>>>>>>> ,o
>>>>>>>>> d
>>>>>>>>> in
>>>>>>>>> 0
>>>>>>>>> 15
>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin0
>>>>>>>>> 07
>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>> 0.0.1 --
>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>> nsreplica
>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50
>>>>>>>>> 0:00 /
>>>>>>>>> san/
>>>>>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>>>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-
>>>>>>>>> daemonize --
>>>>>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>>>>>> vpid_start 0
>>>>>>>>> --universe mpiteam_at_[hidden]:default-
>>>>>>>>> universe-28453 --
>>>>>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>>> collective/allgather_in_place
>>>>>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>>> collective/allgather_in_place
>>>>>>>>>
>>>>>>>>> But once the test finishes, mpirun seems to just be hanging
>>>>>>>>> out.
>>>>>>>>> $ ps aux | grep ...
>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>>> COMMAND
>>>>>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31
>>>>>>>>> 0:00 /
>>>>>>>>> bin/
>>>>>>>>> bash /var/tmp/slurmd/job148126/script
>>>>>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>>>>>>> sshd:
>>>>>>>>> mpiteam [priv]
>>>>>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>>>>>>> sshd:
>>>>>>>>> mpiteam_at_pts/1
>>>>>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31
>>>>>>>>> 0:00 -
>>>>>>>>> tcsh
>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>>> perl ./
>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>> file /u/
>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>>> [sh]
>>>>>>>>> <defunct>
>>>>>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>>>>>>> mpirun
>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>> mtt-
>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>> 1.3a1r11497/
>>>>>>>>> install collective/allgather_in_place
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jeff Squyres
>>>>>>>>>> Server Virtualization Business Unit
>>>>>>>>>> Cisco Systems
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mtt-users mailing list
>>>>>>>>> mtt-users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mtt-users mailing list
>>>>>>>> mtt-users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mtt-users mailing list
>>>>> mtt-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>
>>>> _______________________________________________
>>>> mtt-users mailing list
>>>> mtt-users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Server Virtualization Business Unit
>>> Cisco Systems
>>
>> _______________________________________________
>> mtt-users mailing list
>> mtt-users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems