Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Working directory isn't set properly on Linux cluster
From: Todd Gamblin (tgamblin_at_[hidden])
Date: 2008-06-23 10:17:40


That would be nice -- it might prevent morons like me from getting it
wrong. Repeatedly :-).

-Todd

On Jun 23, 2008, at 6:58 AM, Jeff Squyres wrote:

> We don't have a strong desire to fix this in 1.2.7 -- especially
> since you're the first person ever to run across this issue. :-)
>
> Looks like this is easy enough to put into v1.3, though.
>
>
>
> On Jun 23, 2008, at 9:52 AM, Todd Gamblin wrote:
>
>> Thanks for pointing this out (I'm not sure how I got that wrong in
>> the test) -- making the test program do the right thing gives:
>>
>>> (merle):test$ mpirun -np 4 test
>>> before MPI_Init:
>>> PWD: /home/tgamblin
>>> getcwd: /home/tgamblin/test
>>> before MPI_Init:
>>> PWD: /home/tgamblin
>>> getcwd: /home/tgamblin/test
>>>
>>> etc...
>>
>> -Todd
>>
>>
>> On Jun 23, 2008, at 5:03 AM, Jeff Squyres wrote:
>>
>>> I think the issue here is that your test app is checking $PWD, not
>>> getcwd().
>>>
>>> If you call getcwd(), you'll get the right answer (see my tests
>>> below). But your point is noted that perhaps OMPI should be
>>> setting PWD to the correct value before launching the user app.
>>>
>>> [5:01] svbu-mpi:~/tmp % salloc -N 1 tcsh
>>> salloc: Granted job allocation 5311
>>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 pwd
>>> /home/jsquyres/tmp
>>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 -wdir ~/mpi pwd
>>> /home/jsquyres/mpi
>>> [5:01] svbu-mpi:~/tmp % cat foo.c
>>> #include <stdio.h>
>>> #include <unistd.h>
>>>
>>> int main() {
>>> char buf[BUFSIZ];
>>>
>>> getcwd(buf, BUFSIZ);
>>> printf("CWD is %s\n", buf);
>>> return 0;
>>> }
>>> [5:01] svbu-mpi:~/tmp % gcc foo.c -o foo
>>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 foo
>>> CWD is /home/jsquyres/tmp
>>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 -wdir ~/mpi ~/tmp/foo
>>> CWD is /home/jsquyres/mpi
>>> [5:01] svbu-mpi:~/tmp %
>>>
>>>
>>>
>>> On Jun 22, 2008, at 12:14 AM, Todd Gamblin wrote:
>>>
>>>> I'm having trouble getting OpenMPI to set the working directory
>>>> properly when running jobs on a Linux cluster. I made a test
>>>> program (at end of post) that recreates the problem pretty well
>>>> by just printing out the results of getcwd(). Here's output both
>>>> with and without using -wdir:
>>>>
>>>>> (merle):~$ cd test
>>>>> (merle):test$ mpirun -np 2 test
>>>>> before MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> before MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> after MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> after MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> (merle):test$ mpirun -np 2 -wdir /home/tgamblin/test test
>>>>> before MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> before MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> after MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>> after MPI_Init:
>>>>> PWD: /home/tgamblin
>>>>> getcwd: /home/tgamblin
>>>>
>>>>
>>>> Shouldn't these print out /home/tgamblin/test? Also, this is
>>>> even stranger:
>>>>
>>>>> (merle):test$ mpirun -np 2 pwd
>>>>> /home/tgamblin/test
>>>>> /home/tgamblin/test
>>>>
>>>>
>>>> I feel like my program should output the same thing as pwd.
>>>>
>>>> I'm using OpenMPI 1.2.6, and the cluster has 8 nodes, with 2-by
>>>> dual-core woodcrests each (total 32 cores). There are 2 tcp
>>>> networks on this cluster, one that the head node uses to talk to
>>>> the compute nodes and one (Gigabit) network that the compute
>>>> nodes can reach each other (but not the head node) on. I have
>>>> "btl_tcp_if_include = eth2" in my mca params file to keep the
>>>> compute nodes using the fast interconnect to talk to each other,
>>>> and I've pasted ifconfig output for the head node and for one
>>>> compute node below. Also, if it helps, the home directories on
>>>> this machine are mounted via autofs.
>>>>
>>>> This is causing problems b/c I'm using apps that look for the
>>>> config file in the working directory. Please let me know if you
>>>> guys have any idea what's going on.
>>>>
>>>> Thanks!
>>>> -Todd
>>>>
>>>>
>>>> TEST PROGRAM:
>>>>> #include "mpi.h"
>>>>> #include <cstdlib>
>>>>> #include <iostream>
>>>>> #include <sstream>
>>>>> using namespace std;
>>>>>
>>>>> void testdir(const char*where) {
>>>>> char buf[1024];
>>>>> getcwd(buf, 1024);
>>>>>
>>>>> ostringstream tmp;
>>>>> tmp << where << ":" << endl
>>>>> << "\tPWD:\t"<< getenv("PWD") << endl
>>>>> << "\tgetcwd:\t"<< getenv("PWD") << endl;
>>>>> cout << tmp.str();
>>>>> }
>>>>>
>>>>> int main(int argc, char **argv) {
>>>>> testdir("before MPI_Init");
>>>>> MPI_Init(&argc, &argv);
>>>>> testdir("after MPI_Init");
>>>>> MPI_Finalize();
>>>>> }
>>>>
>>>> HEAD NODE IFCONFIG:
>>>>> eth0 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:90
>>>>> inet addr:10.6.1.1 Bcast:10.6.1.255 Mask:255.255.255.0
>>>>> inet6 addr: fe80::218:8bff:fe2f:3d90/64 Scope:Link
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:1579250319 errors:0 dropped:0 overruns:0
>>>>> frame:0
>>>>> TX packets:874273636 errors:0 dropped:0 overruns:0
>>>>> carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:2361367146846 (2.1 TiB) TX bytes:85373933521
>>>>> (79.5 GiB)
>>>>> Interrupt:169 Memory:f4000000-f4011100
>>>>>
>>>>> eth0:1 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:90
>>>>> inet addr:10.6.2.1 Bcast:10.6.2.255 Mask:255.255.255.0
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> Interrupt:169 Memory:f4000000-f4011100
>>>>>
>>>>> eth1 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:8E
>>>>> inet addr:152.54.1.21 Bcast:152.54.3.255 Mask:
>>>>> 255.255.252.0
>>>>> inet6 addr: fe80::218:8bff:fe2f:3d8e/64 Scope:Link
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:14436523 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:7357596 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:2354451258 (2.1 GiB) TX bytes:2218390772 (2.0
>>>>> GiB)
>>>>> Interrupt:169 Memory:f8000000-f8011100
>>>>>
>>>>> lo Link encap:Local Loopback
>>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>>> inet6 addr: ::1/128 Scope:Host
>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>>> RX packets:540889623 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:540889623 errors:0 dropped:0 overruns:0
>>>>> carrier:0
>>>>> collisions:0 txqueuelen:0
>>>>> RX bytes:63787539844 (59.4 GiB) TX bytes:63787539844
>>>>> (59.4 GiB)
>>>>>
>>>>
>>>> COMPUTE NODE IFCONFIG:
>>>>> (compute-0-0):~$ ifconfig
>>>>> eth0 Link encap:Ethernet HWaddr 00:13:72:FA:42:ED
>>>>> inet addr:10.6.1.254 Bcast:10.6.1.255 Mask:255.255.255.0
>>>>> inet6 addr: fe80::213:72ff:fefa:42ed/64 Scope:Link
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:200637 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:165336 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:187105568 (178.4 MiB) TX bytes:26263945 (25.0
>>>>> MiB)
>>>>> Interrupt:169 Memory:f8000000-f8011100
>>>>>
>>>>> eth2 Link encap:Ethernet HWaddr 00:15:17:0E:9E:68
>>>>> inet addr:10.6.2.254 Bcast:10.6.2.255 Mask:255.255.255.0
>>>>> inet6 addr: fe80::215:17ff:fe0e:9e68/64 Scope:Link
>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>>> RX packets:20 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:1000
>>>>> RX bytes:1280 (1.2 KiB) TX bytes:590 (590.0 b)
>>>>> Base address:0xdce0 Memory:fc3e0000-fc400000
>>>>>
>>>>> lo Link encap:Local Loopback
>>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>>> inet6 addr: ::1/128 Scope:Host
>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>>> RX packets:65 errors:0 dropped:0 overruns:0 frame:0
>>>>> TX packets:65 errors:0 dropped:0 overruns:0 carrier:0
>>>>> collisions:0 txqueuelen:0
>>>>> RX bytes:4376 (4.2 KiB) TX bytes:4376 (4.2 KiB)
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users