From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-08-30 14:30:40


Bah!

This is the result of perl expanding $? To 0 -- it seems that I need to
escape $? So that it's not output as 0.

Sorry about that!

So is this just for the sourcing files, or for your overall (hanging)
problems?

On 8/30/06 2:28 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:

> So here are the results of my exploration. I have things running now.
> The problem was that the user that I am running under does not set
> the LD_LIBRARY_PATH variable at any point. So when MTT tries to
> export the variable it does:
> if (0LD_LIBRARY_PATH == 0) then
> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
> else
> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
> $LD_LIBRARY_PATH
> endif
>
> So this causes tcsh to emit the error the LD_LIBRARY_PATH is not
> defined. So it is not set due to the error.
>
> I fixed this by always declaring it in the .cshrc file to "". However
> MTT could do a sanity check before trying to check the value to see
> if it is defined. Something like:
>
> <code>
> if ($?LD_LIBRARY_PATH) then
> else
> setenv LD_LIBRARY_PATH ""
> endif
>
> if (0LD_LIBRARY_PATH == 0) then
> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib
> else
> setenv LD_LIBRARY_PATH /san/<rest_of_path>/install/lib:
> $LD_LIBRARY_PATH
> endif
> </code>
>
> or something of the sort.
>
> As another note, could we start a "How to debug MTT" Wiki page with
> some of the information that Jeff sent in this message regarding the
> dumping of env vars? I think that would be helpful when getting
> things started.
>
> Thanks for all your help, I'm sure I'll have more questions in the
> near future.
>
> Cheers,
> Josh
>
>
> On Aug 30, 2006, at 12:31 PM, Jeff Squyres wrote:
>
>> On 8/30/06 12:10 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>
>>>> MTT directly sets environment variables in its own environment (via
>>>> $ENV{whatever} = "foo") before using fork/exec to launch compiles
>>>> and runs.
>>>> Hence, the forked children inherit the environment variables that
>>>> we set
>>>> (E.g., PATH and LD_LIBRARY_PATH).
>>>>
>>>> So if you source the env vars files that MTT drops, that should be
>>>> sufficient.
>>>
>>> Does it drop them to a file, or is it printed in the debugging output
>>> anywhere? I'm having a bit of trouble finding these strings in the
>>> output.
>>
>> It does not put these in the -debug output.
>>
>> The files that it drops are in the scratch dir. You'll need to go
>> into
>> scratch/installs, and then it depends on what your INI file section
>> names
>> are. You'll go to:
>>
>> <scratch>/installs/<mpi get>/<mpi install>/<mpi_version>/
>>
>> And there should be files named "mpi_installed_vars.[csh|sh]" that
>> you can
>> source, depending on your shell. IT should set PATH and
>> LD_LIBRARY_PATH.
>>
>> The intent of these files is for exactly this purpose -- for a
>> human to test
>> borked MPI installs inside the MTT scratch tree.
>>
>>>>
>>>> As for setting the values on *remote* nodes, we do it solely via the
>>>> --prefix option. I wonder if --prefix is broken under SLURM...?
>>>> That might
>>>> be something to check -- youmight be inadvertantly mixing
>>>> installations of
>>>> OMPI...?
>>>
>>> Yep I'll check it out.
>>>
>>> Cheers,
>>> Josh
>>>
>>>>
>>>>
>>>> On 8/30/06 10:36 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>
>>>>> I'm trying to replicate the MTT environment as much as possible,
>>>>> and
>>>>> have a couple of questions.
>>>>>
>>>>> Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
>>>>> MTT. After MTT builds Open MPI, how does it export these
>>>>> variables so
>>>>> that it can build the tests? How does it export these when it runs
>>>>> those tests (solely via --prefix)?
>>>>>
>>>>> Cheers,
>>>>> josh
>>>>>
>>>>> On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:
>>>>>
>>>>>> I already tried that. However I'm trying it in a couple different
>>>>>> ways and getting some mixed results. Let me formulate the error
>>>>>> cases
>>>>>> and get back to you.
>>>>>>
>>>>>> Cheers,
>>>>>> Josh
>>>>>>
>>>>>> On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:
>>>>>>
>>>>>>> Well, why don't you try first separating this from MTT? Just run
>>>>>>> the command
>>>>>>> manually in batch mode and see if it works. If that works,
>>>>>>> then the
>>>>>>> problem
>>>>>>> is with MTT. Otherwise, we have a problem with notification.
>>>>>>>
>>>>>>> Or are you saying that you have already done this?
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> On 8/30/06 8:03 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> yet another point (sorry for the spam). This may not be an MTT
>>>>>>>> issue
>>>>>>>> but a broken ORTE on the trunk :(
>>>>>>>>
>>>>>>>> When I try to run in a allocation (srun -N 16 -A) things run
>>>>>>>> fine.
>>>>>>>> But if I try to run in batch mode (srun -N 16 -b myscript.sh)
>>>>>>>> then I
>>>>>>>> see the same hang as in MTT. seems that mpirun is not getting
>>>>>>>> properly notified of the completion of the job. :(
>>>>>>>>
>>>>>>>> I'll try to investigate a bit further today. Any thoughts on
>>>>>>>> what
>>>>>>>> might be causing this?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Josh
>>>>>>>>
>>>>>>>> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
>>>>>>>>
>>>>>>>>> forgot this bit in my mail. With the mpirun just hanging out
>>>>>>>>> there I
>>>>>>>>> attached GDB and got the following stack trace:
>>>>>>>>> (gdb) bt
>>>>>>>>> #0 0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>>>>>>>>> #1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>>>>>>>>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>>>>>>>>> #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>>>>>>>>> flags=5) at event.c:584
>>>>>>>>> #3 0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:
>>>>>>>>> 514
>>>>>>>>> #4 0x0000002a956db7c7 in opal_progress () at runtime/
>>>>>>>>> opal_progress.c:
>>>>>>>>> 259
>>>>>>>>> #5 0x000000000040334c in opal_condition_wait (c=0x509650,
>>>>>>>>> m=0x509600) at ../../../opal/threads/condition.h:81
>>>>>>>>> #6 0x0000000000402f52 in orterun (argc=9,
>>>>>>>>> argv=0x7fbffff0b8) at
>>>>>>>>> orterun.c:444
>>>>>>>>> #7 0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
>>>>>>>>> main.c:13
>>>>>>>>>
>>>>>>>>> Seems that mpirun is waiting for things to complete :/
>>>>>>>>>
>>>>>>>>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>>>>>>>>>
>>>>>>>>>>> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Does this apply to *all* tests, or only some of the tests
>>>>>>>>>>>> (like
>>>>>>>>>>>> allgather)?
>>>>>>>>>>>>
>>>>>>>>>>>> All of the tests: Trivial and ibm. They all timeout :(
>>>>>>>>>>>
>>>>>>>>>>> Blah. The trivial tests are simply "hello world", so they
>>>>>>>>>>> should
>>>>>>>>>>> take just
>>>>>>>>>>> about no time at all.
>>>>>>>>>>>
>>>>>>>>>>> Is this running under SLURM? I put the code in there to
>>>>>>>>>>> know how
>>>>>>>>>>> many procs
>>>>>>>>>>> to use in SLURM but have not tested it in eons. I doubt
>>>>>>>>>>> that's
>>>>>>>>>>> the
>>>>>>>>>>> problem,
>>>>>>>>>>> but that's one thing to check.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yep it is in SLURM. and it seems that the 'number of procs'
>>>>>>>>>> code is
>>>>>>>>>> working fine as it changes with different allocations.
>>>>>>>>>>
>>>>>>>>>>> Can you set a super-long timeout (e.g., a few minutes), and
>>>>>>>>>>> while
>>>>>>>>>>> one of the
>>>>>>>>>>> trivial tests is running, do some ps's on the relevant nodes
>>>>>>>>>>> and
>>>>>>>>>>> see what,
>>>>>>>>>>> if anything, is running? E.g., mpirun, the test
>>>>>>>>>>> executable on
>>>>>>>>>>> the
>>>>>>>>>>> nodes,
>>>>>>>>>>> etc.
>>>>>>>>>>
>>>>>>>>>> Without setting a long timeout. It seems that mpirun is
>>>>>>>>>> running,
>>>>>>>>>> but
>>>>>>>>>> nothing else and only on the launching node.
>>>>>>>>>>
>>>>>>>>>> When a test starts you see the mpirun launching properly:
>>>>>>>>>> $ ps aux | grep ...
>>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>>>> COMMAND
>>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>>>> perl ./
>>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>>> file /u/
>>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>>>> [sh]
>>>>>>>>>> <defunct>
>>>>>>>>>> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00
>>>>>>>>>> mpirun
>>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>>> mtt-
>>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>>> 1.3a1r11497/
>>>>>>>>>> install collective/allgather_in_place
>>>>>>>>>> mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00
>>>>>>>>>> srun --
>>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin0
>>>>>>>>>> 16
>>>>>>>>>> ,o
>>>>>>>>>> d
>>>>>>>>>> in
>>>>>>>>>> 0
>>>>>>>>>> 15
>>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin0
>>>>>>>>>> 07
>>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>>> 0.0.1 --
>>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>>> nsreplica
>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>>> mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00
>>>>>>>>>> srun --
>>>>>>>>>> nodes=16 --ntasks=16 --
>>>>>>>>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin0
>>>>>>>>>> 16
>>>>>>>>>> ,o
>>>>>>>>>> d
>>>>>>>>>> in
>>>>>>>>>> 0
>>>>>>>>>> 15
>>>>>>>>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin0
>>>>>>>>>> 07
>>>>>>>>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name
>>>>>>>>>> 0.0.1 --
>>>>>>>>>> num_procs 16 --vpid_start 0 --universe
>>>>>>>>>> mpiteam_at_[hidden]:default-universe-28453 --
>>>>>>>>>> nsreplica
>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>>>>>>>>> 129.79.240.107:40904"
>>>>>>>>>> mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50
>>>>>>>>>> 0:00 /
>>>>>>>>>> san/
>>>>>>>>>> homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
>>>>>>>>>> odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-
>>>>>>>>>> daemonize --
>>>>>>>>>> bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --
>>>>>>>>>> vpid_start 0
>>>>>>>>>> --universe mpiteam_at_[hidden]:default-
>>>>>>>>>> universe-28453 --
>>>>>>>>>> nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
>>>>>>>>>> "0.0.0;tcp://129.79.240.107:40904"
>>>>>>>>>> mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>>>> collective/allgather_in_place
>>>>>>>>>> mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
>>>>>>>>>> collective/allgather_in_place
>>>>>>>>>>
>>>>>>>>>> But once the test finishes, mpirun seems to just be hanging
>>>>>>>>>> out.
>>>>>>>>>> $ ps aux | grep ...
>>>>>>>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
>>>>>>>>>> COMMAND
>>>>>>>>>> mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31
>>>>>>>>>> 0:00 /
>>>>>>>>>> bin/
>>>>>>>>>> bash /var/tmp/slurmd/job148126/script
>>>>>>>>>> root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00
>>>>>>>>>> sshd:
>>>>>>>>>> mpiteam [priv]
>>>>>>>>>> mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00
>>>>>>>>>> sshd:
>>>>>>>>>> mpiteam_at_pts/1
>>>>>>>>>> mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31
>>>>>>>>>> 0:00 -
>>>>>>>>>> tcsh
>>>>>>>>>> mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06
>>>>>>>>>> perl ./
>>>>>>>>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --
>>>>>>>>>> file /u/
>>>>>>>>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>>>>>>>>> mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00
>>>>>>>>>> [sh]
>>>>>>>>>> <defunct>
>>>>>>>>>> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00
>>>>>>>>>> mpirun
>>>>>>>>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/
>>>>>>>>>> mtt-
>>>>>>>>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
>>>>>>>>>> 1.3a1r11497/
>>>>>>>>>> install collective/allgather_in_place
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>> Server Virtualization Business Unit
>>>>>>>>>>> Cisco Systems
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> mtt-users mailing list
>>>>>>>>>> mtt-users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mtt-users mailing list
>>>>>>>>> mtt-users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mtt-users mailing list
>>>>>> mtt-users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>>
>>>>> _______________________________________________
>>>>> mtt-users mailing list
>>>>> mtt-users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Server Virtualization Business Unit
>>>> Cisco Systems
>>>
>>> _______________________________________________
>>> mtt-users mailing list
>>> mtt-users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>>
>>
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>
> _______________________________________________
> mtt-users mailing list
> mtt-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems