Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-05 03:03:26


same vm on all machines that is virt-manager

On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) <
nishadhankher-coaeseeit_at_[hidden]> wrote:

> opmpi version 1.4.3
>
>
> On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Okay, so if you run mpiBlast on all the non-name nodes, everything is
>> okay? What do you mean by "names nodes"?
>>
>>
>> On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>
>> no it does not happen on names nodes
>>
>>
>> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Hi Nisha
>>>
>>> I'm sorry if my questions appear abrasive - I'm just a little frustrated
>>> at the communication bottleneck as I can't seem to get a clear picture of
>>> your situation. So you really don't need to keep calling me "sir" :-)
>>>
>>> The error you are hitting is very unusual - it means that the processes
>>> are able to make a connection, but are failing to correctly complete a
>>> simple handshake exchange of their process identifications. There are only
>>> a few ways that can happen, and I'm trying to get you to test for them.
>>>
>>> So let's try and see if we can narrow this down. You mention that it
>>> works on some machines, but not all. Is this consistent - i.e., is it
>>> always the same machines that work, and the same ones that generate the
>>> error? If you exclude the ones that show the error, does it work? If so,
>>> what is different about those nodes? Are they a different architecture?
>>>
>>>
>>> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>
>>> sir
>>> smae virt-manager is bein used by all pc's.no i did n't enable
>>> openmpi-hetro.Yes openmpi version is same in all through same kickstart
>>> file.
>>> ok...actually sir...rocks itself installed,configured openmpi and mpich
>>> on it own through hpc roll.
>>>
>>>
>>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>>
>>>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>
>>>> thankyou Ralph.
>>>> Yes cluster is heterogenous...
>>>>
>>>>
>>>> And did you configure OMPI --enable-heterogeneous? And are you running
>>>> it with ---hetero-nodes? What version of OMPI are you using anyway?
>>>>
>>>> Note that we don't care if the host pc's are hetero - what we care
>>>> about is the VM. If all the VMs are the same, then it shouldn't matter.
>>>> However, most VM technologies don't handle hetero hardware very well -
>>>> i.e., you can't emulate an x86 architecture on top of a Sparc or Power chip
>>>> or vice versa.
>>>>
>>>>
>>>> And i haven't made compute nodes on direct physical nodes (pc's) becoz
>>>> in college it is not possible to take whole lab of 32 pc's for your work
>>>> so i ran on vm.
>>>>
>>>>
>>>> Yes, but at least it would let you test the setup to run MPI across
>>>> even a couple of pc's - this is simple debugging practice.
>>>>
>>>> In Rocks cluster, frontend give the same kickstart to all the pc's so
>>>> openmpi version should be same i guess.
>>>>
>>>>
>>>> Guess? or know? Makes a difference - might be worth testing.
>>>>
>>>> Sir
>>>> mpiformatdb is a command to distribute database fragments to different
>>>> compute nodes after partitioning od database.
>>>> And sir have you done mpiblast ?
>>>>
>>>>
>>>> Nope - but that isn't the issue, is it? The issue is with the MPI setup.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>> What is "mpiformatdb"? We don't have an MPI database in our system,
>>>>> and I have no idea what that command means
>>>>>
>>>>> As for that error - it means that the identifier we exchange between
>>>>> processes is failing to be recognized. This could mean a couple of things:
>>>>>
>>>>> 1. the OMPI version on the two ends is different - could be you aren't
>>>>> getting the right paths set on the various machines
>>>>>
>>>>> 2. the cluster is heterogeneous
>>>>>
>>>>> You say you have "virtual nodes" running on various PC's? That would
>>>>> be an unusual setup - VM's can be problematic given the way they handle TCP
>>>>> connections, so that might be another source of the problem if my
>>>>> understanding of your setup is correct. Have you tried running this across
>>>>> the PCs directly - i.e., without any VMs?
>>>>>
>>>>>
>>>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>
>>>>> i first formatted my database with mpiformatdb command then i ran
>>>>> command :
>>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>>>>> query.fas -o output.txt
>>>>> but then it gave this error 113 from some hosts and continue to run
>>>>> for other but with no results even after 2 hours lapsed.....on rocks 6.0
>>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>>>> ram to each
>>>>>
>>>>>
>>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>
>>>>>> i also made machine file which contain ip adresses of all compute
>>>>>> nodes + .ncbirc file for path to mpiblast and shared ,local storage path....
>>>>>> Sir
>>>>>> I ran the same command of mpirun on my college supercomputer 8 nodes
>>>>>> each having 24 processors but it just running....gave no result uptill 3
>>>>>> hours...
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>
>>>>>>> i first formatted my database with mpiformatdb command then i ran
>>>>>>> command :
>>>>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>>>>>>> query.fas -o output.txt
>>>>>>> but then it gave this error 113 from some hosts and continue to run
>>>>>>> for other but with results even after 2 hours lapsed.....on rocks 6.0
>>>>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>>>>>> ram to each
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>>>
>>>>>>>> I'm having trouble understanding your note, so perhaps I am getting
>>>>>>>> this wrong. Let's see if I can figure out what you said:
>>>>>>>>
>>>>>>>> * your perl command fails with "no route to host" - but I don't see
>>>>>>>> any host in your cmd. Maybe I'm just missing something.
>>>>>>>>
>>>>>>>> * you tried running a couple of "mpirun", but the mpirun command
>>>>>>>> wasn't recognized? Is that correct?
>>>>>>>>
>>>>>>>> * you then ran mpiblast and it sounds like it successfully started
>>>>>>>> the processes, but then one aborted? Was there an error message beyond just
>>>>>>>> the -1 return status?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113<http://biosupport.se/questions/696/error-btl_tcp_endpintc-638-connection-failed-due-to-error-113>
>>>>>>>>
>>>>>>>> In openmpi: this error came when i run my mpiblast program on rocks
>>>>>>>> cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And when
>>>>>>>> i run following command linux_shell$ perl -e 'die$!=113' this msg comes:
>>>>>>>> "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$ mpirun
>>>>>>>> --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca btl_tcp_if_include
>>>>>>>> 10.1.255.244 was also executed but it did nt recognized these
>>>>>>>> commands....nd aborted.... what should i do...? When i run my mpiblast
>>>>>>>> program for the frst time then it give mpi_abort error...bailing out of
>>>>>>>> signal -1 on rank 2 processor...then i removed my public ethernet
>>>>>>>> cable....and then give btl_tcp endpint error 113....
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>