Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-05 03:11:37


Mpirun *--mca btl ^openib --mca btl_tcp_if_include eth0* -np 16
-machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o out.txt

was the command i executed on cluster...

On Sat, Apr 5, 2014 at 12:34 PM, Nisha Dhankher -M.Tech(CSE) <
nishadhankher-coaeseeit_at_[hidden]> wrote:

> sorry Ralph my mistake its not names...it is "it does not happen on same
> nodes."
>
>
> On Sat, Apr 5, 2014 at 12:33 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
>
>> same vm on all machines that is virt-manager
>>
>>
>> On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>
>>> opmpi version 1.4.3
>>>
>>>
>>> On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> Okay, so if you run mpiBlast on all the non-name nodes, everything is
>>>> okay? What do you mean by "names nodes"?
>>>>
>>>>
>>>> On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) <
>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>
>>>> no it does not happen on names nodes
>>>>
>>>>
>>>> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>> Hi Nisha
>>>>>
>>>>> I'm sorry if my questions appear abrasive - I'm just a little
>>>>> frustrated at the communication bottleneck as I can't seem to get a clear
>>>>> picture of your situation. So you really don't need to keep calling me
>>>>> "sir" :-)
>>>>>
>>>>> The error you are hitting is very unusual - it means that the
>>>>> processes are able to make a connection, but are failing to correctly
>>>>> complete a simple handshake exchange of their process identifications.
>>>>> There are only a few ways that can happen, and I'm trying to get you to
>>>>> test for them.
>>>>>
>>>>> So let's try and see if we can narrow this down. You mention that it
>>>>> works on some machines, but not all. Is this consistent - i.e., is it
>>>>> always the same machines that work, and the same ones that generate the
>>>>> error? If you exclude the ones that show the error, does it work? If so,
>>>>> what is different about those nodes? Are they a different architecture?
>>>>>
>>>>>
>>>>> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>
>>>>> sir
>>>>> smae virt-manager is bein used by all pc's.no i did n't enable
>>>>> openmpi-hetro.Yes openmpi version is same in all through same kickstart
>>>>> file.
>>>>> ok...actually sir...rocks itself installed,configured openmpi and
>>>>> mpich on it own through hpc roll.
>>>>>
>>>>>
>>>>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>
>>>>>>
>>>>>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>
>>>>>> thankyou Ralph.
>>>>>> Yes cluster is heterogenous...
>>>>>>
>>>>>>
>>>>>> And did you configure OMPI --enable-heterogeneous? And are you
>>>>>> running it with ---hetero-nodes? What version of OMPI are you using anyway?
>>>>>>
>>>>>> Note that we don't care if the host pc's are hetero - what we care
>>>>>> about is the VM. If all the VMs are the same, then it shouldn't matter.
>>>>>> However, most VM technologies don't handle hetero hardware very well -
>>>>>> i.e., you can't emulate an x86 architecture on top of a Sparc or Power chip
>>>>>> or vice versa.
>>>>>>
>>>>>>
>>>>>> And i haven't made compute nodes on direct physical nodes (pc's)
>>>>>> becoz in college it is not possible to take whole lab of 32 pc's for your
>>>>>> work so i ran on vm.
>>>>>>
>>>>>>
>>>>>> Yes, but at least it would let you test the setup to run MPI across
>>>>>> even a couple of pc's - this is simple debugging practice.
>>>>>>
>>>>>> In Rocks cluster, frontend give the same kickstart to all the pc's
>>>>>> so openmpi version should be same i guess.
>>>>>>
>>>>>>
>>>>>> Guess? or know? Makes a difference - might be worth testing.
>>>>>>
>>>>>> Sir
>>>>>> mpiformatdb is a command to distribute database fragments to
>>>>>> different compute nodes after partitioning od database.
>>>>>> And sir have you done mpiblast ?
>>>>>>
>>>>>>
>>>>>> Nope - but that isn't the issue, is it? The issue is with the MPI
>>>>>> setup.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>>
>>>>>>> What is "mpiformatdb"? We don't have an MPI database in our system,
>>>>>>> and I have no idea what that command means
>>>>>>>
>>>>>>> As for that error - it means that the identifier we exchange between
>>>>>>> processes is failing to be recognized. This could mean a couple of things:
>>>>>>>
>>>>>>> 1. the OMPI version on the two ends is different - could be you
>>>>>>> aren't getting the right paths set on the various machines
>>>>>>>
>>>>>>> 2. the cluster is heterogeneous
>>>>>>>
>>>>>>> You say you have "virtual nodes" running on various PC's? That would
>>>>>>> be an unusual setup - VM's can be problematic given the way they handle TCP
>>>>>>> connections, so that might be another source of the problem if my
>>>>>>> understanding of your setup is correct. Have you tried running this across
>>>>>>> the PCs directly - i.e., without any VMs?
>>>>>>>
>>>>>>>
>>>>>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
>>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>>
>>>>>>> i first formatted my database with mpiformatdb command then i ran
>>>>>>> command :
>>>>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>>>>>>> query.fas -o output.txt
>>>>>>> but then it gave this error 113 from some hosts and continue to run
>>>>>>> for other but with no results even after 2 hours lapsed.....on rocks 6.0
>>>>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>>>>>> ram to each
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> i also made machine file which contain ip adresses of all compute
>>>>>>>> nodes + .ncbirc file for path to mpiblast and shared ,local storage path....
>>>>>>>> Sir
>>>>>>>> I ran the same command of mpirun on my college supercomputer 8
>>>>>>>> nodes each having 24 processors but it just running....gave no result
>>>>>>>> uptill 3 hours...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> i first formatted my database with mpiformatdb command then i ran
>>>>>>>>> command :
>>>>>>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>>>>>>>>> query.fas -o output.txt
>>>>>>>>> but then it gave this error 113 from some hosts and continue to
>>>>>>>>> run for other but with results even after 2 hours lapsed.....on rocks 6.0
>>>>>>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>>>>>>>> ram to each
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>>>>>
>>>>>>>>>> I'm having trouble understanding your note, so perhaps I am
>>>>>>>>>> getting this wrong. Let's see if I can figure out what you said:
>>>>>>>>>>
>>>>>>>>>> * your perl command fails with "no route to host" - but I don't
>>>>>>>>>> see any host in your cmd. Maybe I'm just missing something.
>>>>>>>>>>
>>>>>>>>>> * you tried running a couple of "mpirun", but the mpirun command
>>>>>>>>>> wasn't recognized? Is that correct?
>>>>>>>>>>
>>>>>>>>>> * you then ran mpiblast and it sounds like it successfully
>>>>>>>>>> started the processes, but then one aborted? Was there an error message
>>>>>>>>>> beyond just the -1 return status?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>>>>>
>>>>>>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113<http://biosupport.se/questions/696/error-btl_tcp_endpintc-638-connection-failed-due-to-error-113>
>>>>>>>>>>
>>>>>>>>>> In openmpi: this error came when i run my mpiblast program on
>>>>>>>>>> rocks cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And
>>>>>>>>>> when i run following command linux_shell$ perl -e 'die$!=113' this msg
>>>>>>>>>> comes: "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$
>>>>>>>>>> mpirun --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca
>>>>>>>>>> btl_tcp_if_include 10.1.255.244 was also executed but it did nt recognized
>>>>>>>>>> these commands....nd aborted.... what should i do...? When i run my
>>>>>>>>>> mpiblast program for the frst time then it give mpi_abort error...bailing
>>>>>>>>>> out of signal -1 on rank 2 processor...then i removed my public ethernet
>>>>>>>>>> cable....and then give btl_tcp endpint error 113....
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>