Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] testing for openMPI
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2012-06-07 07:04:49


Try: ps -elf | grep hello
This should list out all the processes named hello.
In that output is the pid (should be the 4th column) of the process and
you give your debugger that pid. For example if the pid was 1234 you'd
give "gdb - 1234".

Actually Jeff's suggestion of this being a firewall issue is something
to look into.

--td

On 6/7/2012 6:36 AM, Duke wrote:
> On 6/7/12 5:31 PM, TERRY DONTJE wrote:
>> Can you get on one of the nodes and see the job's processes? If so
>> can you then attach a debugger to it and get a stack? I wonder if
>> the processes are stuck in MPI_Init?
>
> Thanks Terry for your suggestion, but please let me know how would I
> do it? I can ssh to the nodes, but how do I check the job's process? I
> am new to this.
>
> Thanks,
>
> D.
>
>>
>> --td
>>
>> On 6/7/2012 6:06 AM, Duke wrote:
>>> Hi again,
>>>
>>> Somehow the verbose flag (-v) did not work for me. I tried
>>> --debug-daemon and got:
>>>
>>> [mpiuser_at_fantomfs40a ~]$ mpirun --debug-daemons -np 3 --machinefile
>>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>>> Daemon was launched on hp430a - beginning to initialize
>>> Daemon [[34432,0],1] checking in as pid 3011 on host hp430a
>>> <stuck here>
>>>
>>> Somehow the program got stuck when checking on hosts. The secure log
>>> on hp430a showed that mpiuser logged in just fine:
>>>
>>> tail /var/log/secure
>>> Jun 7 17:07:31 hp430a sshd[3007]: Accepted publickey for mpiuser
>>> from 192.168.0.101 port 34037 ssh2
>>> Jun 7 17:07:31 hp430a sshd[3007]: pam_unix(sshd:session): session
>>> opened for user mpiuser by (uid=0)
>>>
>>> Any idea where/how/what to process/check?
>>>
>>> Thanks,
>>>
>>> D.
>>>
>>> On 6/7/12 4:38 PM, Duke wrote:
>>>> Hi Jingha,
>>>>
>>>> On 6/7/12 4:28 PM, Jingcha Joba wrote:
>>>>> Hello Duke,
>>>>> Welcome to the forum.
>>>>> The way openmpi schedules by default is to fill all the slots in a
>>>>> host, before moving on to next host.
>>>>> Check this link for some info:
>>>>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
>>>>
>>>> Thanks for quick answer. I checked the FAQ, and tried with
>>>> processes more than 2, but somehow it got stalled:
>>>>
>>>> [mpiuser_at_fantomfs40a ~]$ mpirun -v -np 4 --machinefile
>>>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>>>> ^Cmpirun: killing job...
>>>>
>>>> I tried --host flag and it got stalled as well:
>>>>
>>>> [mpiuser_at_fantomfs40a ~]$ mpirun -v -np 4 --host hp430a,hp430b
>>>> ./test/mpihello
>>>>
>>>>
>>>> My configuration must be wrong somewhere. Anyidea how I can check
>>>> the system?
>>>>
>>>> Thanks,
>>>>
>>>> D.
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jingcha
>>>>> On Thu, Jun 7, 2012 at 2:11 AM, Duke <duke.lists_at_[hidden]
>>>>> <mailto:duke.lists_at_[hidden]>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> Please be gentle to the newest member of openMPI, I am totally
>>>>> new to this field. I just built a test cluster with 3 boxes on
>>>>> Scientific Linux 6.2 and openMPI (Open MPI 1.5.3), and I
>>>>> wanted to test how the cluster works but I cant figure out
>>>>> what was/is happening. On my master node, I have the hostfile:
>>>>>
>>>>> [mpiuser_at_fantomfs40a ~]$ cat .mpi_hostfile
>>>>> # The Hostfile for Open MPI
>>>>> fantomfs40a slots=2
>>>>> hp430a slots=4 max-slots=4
>>>>> hp430b slots=4 max-slots=4
>>>>>
>>>>> To test, I used the following c code:
>>>>>
>>>>> [mpiuser_at_fantomfs40a ~]$ cat test/mpihello.c
>>>>> /* program hello */
>>>>> /* Adapted from mpihello.f by drs */
>>>>>
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> int *buf, i, rank, nints, len;
>>>>> char hostname[256];
>>>>>
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>> gethostname(hostname,255);
>>>>> printf("Hello world! I am process number: %d on host %s\n",
>>>>> rank, hostname);
>>>>> MPI_Finalize();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> and then compiled and ran:
>>>>>
>>>>> [mpiuser_at_fantomfs40a ~]$ mpicc -o test/mpihello test/mpihello.c
>>>>> [mpiuser_at_fantomfs40a ~]$ mpirun -np 2 --machinefile
>>>>> /home/mpiuser/.mpi_hostfile ./test/mpihello
>>>>> Hello world! I am process number: 0 on host fantomfs40a
>>>>> Hello world! I am process number: 1 on host fantomfs40a
>>>>>
>>>>> Unfortunately the result did not show what I wanted. I
>>>>> expected to see somethign like:
>>>>>
>>>>> Hello world! I am process number: 0 on host hp430a
>>>>> Hello world! I am process number: 1 on host hp430b
>>>>>
>>>>> Anybody has any idea what I am doing wrong?
>>>>>
>>>>> Thank you in advance,
>>>>>
>>>>> D.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>