Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-08 05:37:48


and thank you very much

On Tue, Apr 8, 2014 at 3:07 PM, Nisha Dhankher -M.Tech(CSE) <
nishadhankher-coaeseeit_at_[hidden]> wrote:

> latest rocks 6.2 carry this version only
>
>
> On Tue, Apr 8, 2014 at 3:49 AM, Jeff Squyres (jsquyres) <
> jsquyres_at_[hidden]> wrote:
>
>> Open MPI 1.4.3 is *ancient*. Please upgrade -- we just released Open MPI
>> 1.8 last week.
>>
>> Also, please look at this FAQ entry -- it steps you through a lot of
>> basic troubleshooting steps about getting basic MPI programs working.
>>
>> http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>
>> Once you get basic MPI programs working, then try with MPI Blast.
>>
>>
>>
>> On Apr 5, 2014, at 3:11 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>
>> > Mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 -np 16
>> -machinefile mf mpiblast -d all.fas -p blastn -i query.fas -o out.txt
>> >
>> > was the command i executed on cluster...
>> >
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:34 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > sorry Ralph my mistake its not names...it is "it does not happen on
>> same nodes."
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:33 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > same vm on all machines that is virt-manager
>> >
>> >
>> > On Sat, Apr 5, 2014 at 12:32 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> > opmpi version 1.4.3
>> >
>> >
>> > On Fri, Apr 4, 2014 at 8:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> > Okay, so if you run mpiBlast on all the non-name nodes, everything is
>> okay? What do you mean by "names nodes"?
>> >
>> >
>> > On Apr 4, 2014, at 7:32 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >
>> >> no it does not happen on names nodes
>> >>
>> >>
>> >> On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >> Hi Nisha
>> >>
>> >> I'm sorry if my questions appear abrasive - I'm just a little
>> frustrated at the communication bottleneck as I can't seem to get a clear
>> picture of your situation. So you really don't need to keep calling me
>> "sir" :-)
>> >>
>> >> The error you are hitting is very unusual - it means that the
>> processes are able to make a connection, but are failing to correctly
>> complete a simple handshake exchange of their process identifications.
>> There are only a few ways that can happen, and I'm trying to get you to
>> test for them.
>> >>
>> >> So let's try and see if we can narrow this down. You mention that it
>> works on some machines, but not all. Is this consistent - i.e., is it
>> always the same machines that work, and the same ones that generate the
>> error? If you exclude the ones that show the error, does it work? If so,
>> what is different about those nodes? Are they a different architecture?
>> >>
>> >>
>> >> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>
>> >>> sir
>> >>> smae virt-manager is bein used by all pc's.no i did n't enable
>> openmpi-hetro.Yes openmpi version is same in all through same kickstart
>> file.
>> >>> ok...actually sir...rocks itself installed,configured openmpi and
>> mpich on it own through hpc roll.
>> >>>
>> >>>
>> >>> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>
>> >>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>
>> >>>> thankyou Ralph.
>> >>>> Yes cluster is heterogenous...
>> >>>
>> >>> And did you configure OMPI --enable-heterogeneous? And are you
>> running it with ---hetero-nodes? What version of OMPI are you using anyway?
>> >>>
>> >>> Note that we don't care if the host pc's are hetero - what we care
>> about is the VM. If all the VMs are the same, then it shouldn't matter.
>> However, most VM technologies don't handle hetero hardware very well -
>> i.e., you can't emulate an x86 architecture on top of a Sparc or Power chip
>> or vice versa.
>> >>>
>> >>>
>> >>>> And i haven't made compute nodes on direct physical nodes (pc's)
>> becoz in college it is not possible to take whole lab of 32 pc's for your
>> work so i ran on vm.
>> >>>
>> >>> Yes, but at least it would let you test the setup to run MPI across
>> even a couple of pc's - this is simple debugging practice.
>> >>>
>> >>>> In Rocks cluster, frontend give the same kickstart to all the pc's
>> so openmpi version should be same i guess.
>> >>>
>> >>> Guess? or know? Makes a difference - might be worth testing.
>> >>>
>> >>>> Sir
>> >>>> mpiformatdb is a command to distribute database fragments to
>> different compute nodes after partitioning od database.
>> >>>> And sir have you done mpiblast ?
>> >>>
>> >>> Nope - but that isn't the issue, is it? The issue is with the MPI
>> setup.
>> >>>
>> >>>>
>> >>>>
>> >>>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>> What is "mpiformatdb"? We don't have an MPI database in our system,
>> and I have no idea what that command means
>> >>>>
>> >>>> As for that error - it means that the identifier we exchange between
>> processes is failing to be recognized. This could mean a couple of things:
>> >>>>
>> >>>> 1. the OMPI version on the two ends is different - could be you
>> aren't getting the right paths set on the various machines
>> >>>>
>> >>>> 2. the cluster is heterogeneous
>> >>>>
>> >>>> You say you have "virtual nodes" running on various PC's? That would
>> be an unusual setup - VM's can be problematic given the way they handle TCP
>> connections, so that might be another source of the problem if my
>> understanding of your setup is correct. Have you tried running this across
>> the PCs directly - i.e., without any VMs?
>> >>>>
>> >>>>
>> >>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>
>> >>>>> i first formatted my database with mpiformatdb command then i ran
>> command :
>> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>> query.fas -o output.txt
>> >>>>> but then it gave this error 113 from some hosts and continue to run
>> for other but with no results even after 2 hours lapsed.....on rocks 6.0
>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>> ram to each
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>> i also made machine file which contain ip adresses of all compute
>> nodes + .ncbirc file for path to mpiblast and shared ,local storage path....
>> >>>>> Sir
>> >>>>> I ran the same command of mpirun on my college supercomputer 8
>> nodes each having 24 processors but it just running....gave no result
>> uptill 3 hours...
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>> i first formatted my database with mpiformatdb command then i ran
>> command :
>> >>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>> query.fas -o output.txt
>> >>>>> but then it gave this error 113 from some hosts and continue to run
>> for other but with results even after 2 hours lapsed.....on rocks 6.0
>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>> ram to each
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> >>>>> I'm having trouble understanding your note, so perhaps I am getting
>> this wrong. Let's see if I can figure out what you said:
>> >>>>>
>> >>>>> * your perl command fails with "no route to host" - but I don't see
>> any host in your cmd. Maybe I'm just missing something.
>> >>>>>
>> >>>>> * you tried running a couple of "mpirun", but the mpirun command
>> wasn't recognized? Is that correct?
>> >>>>>
>> >>>>> * you then ran mpiblast and it sounds like it successfully started
>> the processes, but then one aborted? Was there an error message beyond just
>> the -1 return status?
>> >>>>>
>> >>>>>
>> >>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>> >>>>>
>> >>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113
>> >>>>>>
>> >>>>>> In openmpi: this error came when i run my mpiblast program on
>> rocks cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And
>> when i run following command linux_shell$ perl -e 'die$!=113' this msg
>> comes: "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$
>> mpirun --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca
>> btl_tcp_if_include 10.1.255.244 was also executed but it did nt recognized
>> these commands....nd aborted.... what should i do...? When i run my
>> mpiblast program for the frst time then it give mpi_abort error...bailing
>> out of signal -1 on rank 2 processor...then i removed my public ethernet
>> cable....and then give btl_tcp endpint error 113....
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> users mailing list
>> >>>>>> users_at_[hidden]
>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> users mailing list
>> >>>>> users_at_[hidden]
>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> users mailing list
>> >>>>> users_at_[hidden]
>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> users mailing list
>> >>>> users_at_[hidden]
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>>
>> >>>> _______________________________________________
>> >>>> users mailing list
>> >>>> users_at_[hidden]
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users_at_[hidden]
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users_at_[hidden]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>