We add recently enhanced our network with Infiniband modules on a six node
We have install all OFED drivers related to our hardware
We have set network IP like following :
- eth : 192.168.1.0 / 255.255.255.0
- ib : 192.168.70.0 / 255.255.255.0
After first tests all seems good. IB interfaces ping each other, ssh and
other king of exchanges over IB works well.
Then we started to run our job thought openmpi (building with --with-openib
option) and our first results were very bad.
After investigations, our system have the following behaviour :
- job starts over ib network (few packet are sent)
- job switch to eth network (all next packet sent to these interfaces)
We never specified the IP Address of our eth interfaces.
We tried to launch our jobs with the following options :
- mpirun -hostfile hostfile.list -mca blt openib,self
- mpirun -hostfile hostfile.list -mca blt openib,sm,self
- mpirun -hostfile hostfile.list -mca blt openib,self -mca
btl_tcp_if_exclude lo,eth0,eth1,eth2 /common_gfs2/script-test.sh
The final behaviour remain the same : job is initiated over ib and runs over
We grab performance tests file (osu_bw and osu_latency) and we got not so
bad results (see attached files).
We had tried plenty of different things but we are stuck : we don't have any
Thanks per advance for your help.