Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] testing for openMPI
From: Duke (duke.lists_at_[hidden])
Date: 2012-06-07 06:36:08


On 6/7/12 5:31 PM, TERRY DONTJE wrote:
> Can you get on one of the nodes and see the job's processes? If so
> can you then attach a debugger to it and get a stack? I wonder if the
> processes are stuck in MPI_Init?

Thanks Terry for your suggestion, but please let me know how would I do
it? I can ssh to the nodes, but how do I check the job's process? I am
new to this.

Thanks,

D.

>
> --td
>
> On 6/7/2012 6:06 AM, Duke wrote:
>> Hi again,
>>
>> Somehow the verbose flag (-v) did not work for me. I tried
>> --debug-daemon and got:
>>
>> [mpiuser_at_fantomfs40a ~]$ mpirun --debug-daemons -np 3 --machinefile
>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>> Daemon was launched on hp430a - beginning to initialize
>> Daemon [[34432,0],1] checking in as pid 3011 on host hp430a
>> <stuck here>
>>
>> Somehow the program got stuck when checking on hosts. The secure log
>> on hp430a showed that mpiuser logged in just fine:
>>
>> tail /var/log/secure
>> Jun 7 17:07:31 hp430a sshd[3007]: Accepted publickey for mpiuser
>> from 192.168.0.101 port 34037 ssh2
>> Jun 7 17:07:31 hp430a sshd[3007]: pam_unix(sshd:session): session
>> opened for user mpiuser by (uid=0)
>>
>> Any idea where/how/what to process/check?
>>
>> Thanks,
>>
>> D.
>>
>> On 6/7/12 4:38 PM, Duke wrote:
>>> Hi Jingha,
>>>
>>> On 6/7/12 4:28 PM, Jingcha Joba wrote:
>>>> Hello Duke,
>>>> Welcome to the forum.
>>>> The way openmpi schedules by default is to fill all the slots in a
>>>> host, before moving on to next host.
>>>> Check this link for some info:
>>>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
>>>
>>> Thanks for quick answer. I checked the FAQ, and tried with processes
>>> more than 2, but somehow it got stalled:
>>>
>>> [mpiuser_at_fantomfs40a ~]$ mpirun -v -np 4 --machinefile
>>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>>> ^Cmpirun: killing job...
>>>
>>> I tried --host flag and it got stalled as well:
>>>
>>> [mpiuser_at_fantomfs40a ~]$ mpirun -v -np 4 --host hp430a,hp430b
>>> ./test/mpihello
>>>
>>>
>>> My configuration must be wrong somewhere. Anyidea how I can check
>>> the system?
>>>
>>> Thanks,
>>>
>>> D.
>>>
>>>>
>>>>
>>>> --
>>>> Jingcha
>>>> On Thu, Jun 7, 2012 at 2:11 AM, Duke <duke.lists_at_[hidden]
>>>> <mailto:duke.lists_at_[hidden]>> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> Please be gentle to the newest member of openMPI, I am totally
>>>> new to this field. I just built a test cluster with 3 boxes on
>>>> Scientific Linux 6.2 and openMPI (Open MPI 1.5.3), and I wanted
>>>> to test how the cluster works but I cant figure out what was/is
>>>> happening. On my master node, I have the hostfile:
>>>>
>>>> [mpiuser_at_fantomfs40a ~]$ cat .mpi_hostfile
>>>> # The Hostfile for Open MPI
>>>> fantomfs40a slots=2
>>>> hp430a slots=4 max-slots=4
>>>> hp430b slots=4 max-slots=4
>>>>
>>>> To test, I used the following c code:
>>>>
>>>> [mpiuser_at_fantomfs40a ~]$ cat test/mpihello.c
>>>> /* program hello */
>>>> /* Adapted from mpihello.f by drs */
>>>>
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>> int *buf, i, rank, nints, len;
>>>> char hostname[256];
>>>>
>>>> MPI_Init(&argc,&argv);
>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>> gethostname(hostname,255);
>>>> printf("Hello world! I am process number: %d on host %s\n",
>>>> rank, hostname);
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>> and then compiled and ran:
>>>>
>>>> [mpiuser_at_fantomfs40a ~]$ mpicc -o test/mpihello test/mpihello.c
>>>> [mpiuser_at_fantomfs40a ~]$ mpirun -np 2 --machinefile
>>>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>>>> Hello world! I am process number: 0 on host fantomfs40a
>>>> Hello world! I am process number: 1 on host fantomfs40a
>>>>
>>>> Unfortunately the result did not show what I wanted. I expected
>>>> to see somethign like:
>>>>
>>>> Hello world! I am process number: 0 on host hp430a
>>>> Hello world! I am process number: 1 on host hp430b
>>>>
>>>> Anybody has any idea what I am doing wrong?
>>>>
>>>> Thank you in advance,
>>>>
>>>> D.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users