Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi query
From: Nisha Dhankher -M.Tech(CSE) (nishadhankher-coaeseeit_at_[hidden])
Date: 2014-04-04 10:32:26


no it does not happen on names nodes

On Fri, Apr 4, 2014 at 7:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Hi Nisha
>
> I'm sorry if my questions appear abrasive - I'm just a little frustrated
> at the communication bottleneck as I can't seem to get a clear picture of
> your situation. So you really don't need to keep calling me "sir" :-)
>
> The error you are hitting is very unusual - it means that the processes
> are able to make a connection, but are failing to correctly complete a
> simple handshake exchange of their process identifications. There are only
> a few ways that can happen, and I'm trying to get you to test for them.
>
> So let's try and see if we can narrow this down. You mention that it works
> on some machines, but not all. Is this consistent - i.e., is it always the
> same machines that work, and the same ones that generate the error? If you
> exclude the ones that show the error, does it work? If so, what is
> different about those nodes? Are they a different architecture?
>
>
> On Apr 3, 2014, at 11:09 PM, Nisha Dhankher -M.Tech(CSE) <
> nishadhankher-coaeseeit_at_[hidden]> wrote:
>
> sir
> smae virt-manager is bein used by all pc's.no i did n't enable
> openmpi-hetro.Yes openmpi version is same in all through same kickstart
> file.
> ok...actually sir...rocks itself installed,configured openmpi and mpich on
> it own through hpc roll.
>
>
> On Fri, Apr 4, 2014 at 9:25 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Apr 3, 2014, at 8:03 PM, Nisha Dhankher -M.Tech(CSE) <
>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>
>> thankyou Ralph.
>> Yes cluster is heterogenous...
>>
>>
>> And did you configure OMPI --enable-heterogeneous? And are you running it
>> with ---hetero-nodes? What version of OMPI are you using anyway?
>>
>> Note that we don't care if the host pc's are hetero - what we care about
>> is the VM. If all the VMs are the same, then it shouldn't matter. However,
>> most VM technologies don't handle hetero hardware very well - i.e., you
>> can't emulate an x86 architecture on top of a Sparc or Power chip or vice
>> versa.
>>
>>
>> And i haven't made compute nodes on direct physical nodes (pc's) becoz in
>> college it is not possible to take whole lab of 32 pc's for your work so i
>> ran on vm.
>>
>>
>> Yes, but at least it would let you test the setup to run MPI across even
>> a couple of pc's - this is simple debugging practice.
>>
>> In Rocks cluster, frontend give the same kickstart to all the pc's so
>> openmpi version should be same i guess.
>>
>>
>> Guess? or know? Makes a difference - might be worth testing.
>>
>> Sir
>> mpiformatdb is a command to distribute database fragments to different
>> compute nodes after partitioning od database.
>> And sir have you done mpiblast ?
>>
>>
>> Nope - but that isn't the issue, is it? The issue is with the MPI setup.
>>
>>
>>
>> On Fri, Apr 4, 2014 at 4:48 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> What is "mpiformatdb"? We don't have an MPI database in our system, and
>>> I have no idea what that command means
>>>
>>> As for that error - it means that the identifier we exchange between
>>> processes is failing to be recognized. This could mean a couple of things:
>>>
>>> 1. the OMPI version on the two ends is different - could be you aren't
>>> getting the right paths set on the various machines
>>>
>>> 2. the cluster is heterogeneous
>>>
>>> You say you have "virtual nodes" running on various PC's? That would be
>>> an unusual setup - VM's can be problematic given the way they handle TCP
>>> connections, so that might be another source of the problem if my
>>> understanding of your setup is correct. Have you tried running this across
>>> the PCs directly - i.e., without any VMs?
>>>
>>>
>>> On Apr 3, 2014, at 10:13 AM, Nisha Dhankher -M.Tech(CSE) <
>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>
>>> i first formatted my database with mpiformatdb command then i ran
>>> command :
>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i query.fas
>>> -o output.txt
>>> but then it gave this error 113 from some hosts and continue to run for
>>> other but with no results even after 2 hours lapsed.....on rocks 6.0
>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>> ram to each
>>>
>>>
>>> On Thu, Apr 3, 2014 at 10:41 PM, Nisha Dhankher -M.Tech(CSE) <
>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>
>>>> i also made machine file which contain ip adresses of all compute nodes
>>>> + .ncbirc file for path to mpiblast and shared ,local storage path....
>>>> Sir
>>>> I ran the same command of mpirun on my college supercomputer 8 nodes
>>>> each having 24 processors but it just running....gave no result uptill 3
>>>> hours...
>>>>
>>>>
>>>> On Thu, Apr 3, 2014 at 10:39 PM, Nisha Dhankher -M.Tech(CSE) <
>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>
>>>>> i first formatted my database with mpiformatdb command then i ran
>>>>> command :
>>>>> mpirun -np 64 -machinefile mf mpiblast -d all.fas -p blastn -i
>>>>> query.fas -o output.txt
>>>>> but then it gave this error 113 from some hosts and continue to run
>>>>> for other but with results even after 2 hours lapsed.....on rocks 6.0
>>>>> cluster with 12 virtual nodes on pc's ...2 on each using virt-manger , 1 gb
>>>>> ram to each
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 3, 2014 at 8:37 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>
>>>>>> I'm having trouble understanding your note, so perhaps I am getting
>>>>>> this wrong. Let's see if I can figure out what you said:
>>>>>>
>>>>>> * your perl command fails with "no route to host" - but I don't see
>>>>>> any host in your cmd. Maybe I'm just missing something.
>>>>>>
>>>>>> * you tried running a couple of "mpirun", but the mpirun command
>>>>>> wasn't recognized? Is that correct?
>>>>>>
>>>>>> * you then ran mpiblast and it sounds like it successfully started
>>>>>> the processes, but then one aborted? Was there an error message beyond just
>>>>>> the -1 return status?
>>>>>>
>>>>>>
>>>>>> On Apr 2, 2014, at 11:17 PM, Nisha Dhankher -M.Tech(CSE) <
>>>>>> nishadhankher-coaeseeit_at_[hidden]> wrote:
>>>>>>
>>>>>> error btl_tcp_endpint.c: 638 connection failed due to error 113<http://biosupport.se/questions/696/error-btl_tcp_endpintc-638-connection-failed-due-to-error-113>
>>>>>>
>>>>>> In openmpi: this error came when i run my mpiblast program on rocks
>>>>>> cluster.Connect to hosts failed on ip 10.1.255.236,10.1.255.244 . And when
>>>>>> i run following command linux_shell$ perl -e 'die$!=113' this msg comes:
>>>>>> "No route to host at -e line 1." shell$ mpirun --mca btl ^tcp shell$ mpirun
>>>>>> --mca btl_tcp_if_include eth1,eth2 shell$ mpirun --mca btl_tcp_if_include
>>>>>> 10.1.255.244 was also executed but it did nt recognized these
>>>>>> commands....nd aborted.... what should i do...? When i run my mpiblast
>>>>>> program for the frst time then it give mpi_abort error...bailing out of
>>>>>> signal -1 on rank 2 processor...then i removed my public ethernet
>>>>>> cable....and then give btl_tcp endpint error 113....
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>