Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Working directory isn't set properly on Linux cluster
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-23 09:58:43


We don't have a strong desire to fix this in 1.2.7 -- especially since
you're the first person ever to run across this issue. :-)

Looks like this is easy enough to put into v1.3, though.

On Jun 23, 2008, at 9:52 AM, Todd Gamblin wrote:

> Thanks for pointing this out (I'm not sure how I got that wrong in
> the test) -- making the test program do the right thing gives:
>
>> (merle):test$ mpirun -np 4 test
>> before MPI_Init:
>> PWD: /home/tgamblin
>> getcwd: /home/tgamblin/test
>> before MPI_Init:
>> PWD: /home/tgamblin
>> getcwd: /home/tgamblin/test
>>
>> etc...
>
> -Todd
>
>
> On Jun 23, 2008, at 5:03 AM, Jeff Squyres wrote:
>
>> I think the issue here is that your test app is checking $PWD, not
>> getcwd().
>>
>> If you call getcwd(), you'll get the right answer (see my tests
>> below). But your point is noted that perhaps OMPI should be
>> setting PWD to the correct value before launching the user app.
>>
>> [5:01] svbu-mpi:~/tmp % salloc -N 1 tcsh
>> salloc: Granted job allocation 5311
>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 pwd
>> /home/jsquyres/tmp
>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 -wdir ~/mpi pwd
>> /home/jsquyres/mpi
>> [5:01] svbu-mpi:~/tmp % cat foo.c
>> #include <stdio.h>
>> #include <unistd.h>
>>
>> int main() {
>> char buf[BUFSIZ];
>>
>> getcwd(buf, BUFSIZ);
>> printf("CWD is %s\n", buf);
>> return 0;
>> }
>> [5:01] svbu-mpi:~/tmp % gcc foo.c -o foo
>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 foo
>> CWD is /home/jsquyres/tmp
>> [5:01] svbu-mpi:~/tmp % mpirun -np 1 -wdir ~/mpi ~/tmp/foo
>> CWD is /home/jsquyres/mpi
>> [5:01] svbu-mpi:~/tmp %
>>
>>
>>
>> On Jun 22, 2008, at 12:14 AM, Todd Gamblin wrote:
>>
>>> I'm having trouble getting OpenMPI to set the working directory
>>> properly when running jobs on a Linux cluster. I made a test
>>> program (at end of post) that recreates the problem pretty well by
>>> just printing out the results of getcwd(). Here's output both
>>> with and without using -wdir:
>>>
>>>> (merle):~$ cd test
>>>> (merle):test$ mpirun -np 2 test
>>>> before MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> before MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> after MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> after MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> (merle):test$ mpirun -np 2 -wdir /home/tgamblin/test test
>>>> before MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> before MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> after MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>> after MPI_Init:
>>>> PWD: /home/tgamblin
>>>> getcwd: /home/tgamblin
>>>
>>>
>>> Shouldn't these print out /home/tgamblin/test? Also, this is even
>>> stranger:
>>>
>>>> (merle):test$ mpirun -np 2 pwd
>>>> /home/tgamblin/test
>>>> /home/tgamblin/test
>>>
>>>
>>> I feel like my program should output the same thing as pwd.
>>>
>>> I'm using OpenMPI 1.2.6, and the cluster has 8 nodes, with 2-by
>>> dual-core woodcrests each (total 32 cores). There are 2 tcp
>>> networks on this cluster, one that the head node uses to talk to
>>> the compute nodes and one (Gigabit) network that the compute nodes
>>> can reach each other (but not the head node) on. I have
>>> "btl_tcp_if_include = eth2" in my mca params file to keep the
>>> compute nodes using the fast interconnect to talk to each other,
>>> and I've pasted ifconfig output for the head node and for one
>>> compute node below. Also, if it helps, the home directories on
>>> this machine are mounted via autofs.
>>>
>>> This is causing problems b/c I'm using apps that look for the
>>> config file in the working directory. Please let me know if you
>>> guys have any idea what's going on.
>>>
>>> Thanks!
>>> -Todd
>>>
>>>
>>> TEST PROGRAM:
>>>> #include "mpi.h"
>>>> #include <cstdlib>
>>>> #include <iostream>
>>>> #include <sstream>
>>>> using namespace std;
>>>>
>>>> void testdir(const char*where) {
>>>> char buf[1024];
>>>> getcwd(buf, 1024);
>>>>
>>>> ostringstream tmp;
>>>> tmp << where << ":" << endl
>>>> << "\tPWD:\t"<< getenv("PWD") << endl
>>>> << "\tgetcwd:\t"<< getenv("PWD") << endl;
>>>> cout << tmp.str();
>>>> }
>>>>
>>>> int main(int argc, char **argv) {
>>>> testdir("before MPI_Init");
>>>> MPI_Init(&argc, &argv);
>>>> testdir("after MPI_Init");
>>>> MPI_Finalize();
>>>> }
>>>
>>> HEAD NODE IFCONFIG:
>>>> eth0 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:90
>>>> inet addr:10.6.1.1 Bcast:10.6.1.255 Mask:255.255.255.0
>>>> inet6 addr: fe80::218:8bff:fe2f:3d90/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:1579250319 errors:0 dropped:0 overruns:0
>>>> frame:0
>>>> TX packets:874273636 errors:0 dropped:0 overruns:0
>>>> carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:2361367146846 (2.1 TiB) TX bytes:85373933521
>>>> (79.5 GiB)
>>>> Interrupt:169 Memory:f4000000-f4011100
>>>>
>>>> eth0:1 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:90
>>>> inet addr:10.6.2.1 Bcast:10.6.2.255 Mask:255.255.255.0
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> Interrupt:169 Memory:f4000000-f4011100
>>>>
>>>> eth1 Link encap:Ethernet HWaddr 00:18:8B:2F:3D:8E
>>>> inet addr:152.54.1.21 Bcast:152.54.3.255 Mask:
>>>> 255.255.252.0
>>>> inet6 addr: fe80::218:8bff:fe2f:3d8e/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:14436523 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:7357596 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:2354451258 (2.1 GiB) TX bytes:2218390772 (2.0
>>>> GiB)
>>>> Interrupt:169 Memory:f8000000-f8011100
>>>>
>>>> lo Link encap:Local Loopback
>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>> inet6 addr: ::1/128 Scope:Host
>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>> RX packets:540889623 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:540889623 errors:0 dropped:0 overruns:0
>>>> carrier:0
>>>> collisions:0 txqueuelen:0
>>>> RX bytes:63787539844 (59.4 GiB) TX bytes:63787539844
>>>> (59.4 GiB)
>>>>
>>>
>>> COMPUTE NODE IFCONFIG:
>>>> (compute-0-0):~$ ifconfig
>>>> eth0 Link encap:Ethernet HWaddr 00:13:72:FA:42:ED
>>>> inet addr:10.6.1.254 Bcast:10.6.1.255 Mask:255.255.255.0
>>>> inet6 addr: fe80::213:72ff:fefa:42ed/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:200637 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:165336 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:187105568 (178.4 MiB) TX bytes:26263945 (25.0
>>>> MiB)
>>>> Interrupt:169 Memory:f8000000-f8011100
>>>>
>>>> eth2 Link encap:Ethernet HWaddr 00:15:17:0E:9E:68
>>>> inet addr:10.6.2.254 Bcast:10.6.2.255 Mask:255.255.255.0
>>>> inet6 addr: fe80::215:17ff:fe0e:9e68/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:20 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:1280 (1.2 KiB) TX bytes:590 (590.0 b)
>>>> Base address:0xdce0 Memory:fc3e0000-fc400000
>>>>
>>>> lo Link encap:Local Loopback
>>>> inet addr:127.0.0.1 Mask:255.0.0.0
>>>> inet6 addr: ::1/128 Scope:Host
>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>> RX packets:65 errors:0 dropped:0 overruns:0 frame:0
>>>> TX packets:65 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:0
>>>> RX bytes:4376 (4.2 KiB) TX bytes:4376 (4.2 KiB)
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems