p4_error, could not write to fd=5
Tornes, Ivan E
tornesi at BATTELLE.ORG
Mon Dec 13 14:04:28 EST 2004
I'm running a small 10 node Beowulf cluster on RedHat 9.0, kernel 2.4.20-8. I recently installed a SATA hard drive on my server which is running Fedora Core 2. Each node has a single P4 processor with Hyper-Threading, so I have 20 virtual processors. I have a code that runs on all 20 processors and writes one file back to the server. This code, which I wrote is very simple and the file it writes is small. We have another code which when it runs on multiple processors writes a file to the server for each processor it runs on. Each of these file are initially about 5 MB each. If I try to run this code on more than 16 processors I get the following set of messages
p18_7862: p4_error: : 10188
p5_8935: (16.022127) net_send: could not write to fd=5, errno = 32
p5_8935: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p9_7778: (16.013863) net_send: could not write to fd=5, errno = 32
p9_7778: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p12_6457: (16.008107) net_send: could not write to fd=5, errno = 32
p12_6457: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p18_7862: (15.995930) net_send: could not write to fd=5, errno = 32
p13_6458: (16.006609) net_send: could not write to fd=5, errno = 32
p4_error: latest msg from perror: Broken pipe
p13_6458: p4_error: net_send write: -1
p17_6533: (15.998589) net_send: could not write to fd=5, errno = 32
p17_6533: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p3_11299: (16.030212) net_send: could not write to fd=5, errno = 32
p3_11299: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p7_6234: (16.025097) net_send: could not write to fd=5, errno = 32
p7_6234: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p19_7863: (16.005727) net_send: could not write to fd=5, errno = 32
p19_7863: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p15_6557: (16.016101) net_send: could not write to fd=5, errno = 32
p15_6557: p4_error: net_send write: -1
p4_error: latest msg from perror: Broken pipe
p10_6717: p4_error: net_recv read: probable EOF on socket: 1
p11_6718: p4_error: net_recv read: probable EOF on socket: 1
bm_list_12496: (16.549245) net_send: could not write to fd=5, errno = 9
bm_list_12496: p4_error: net_send write: -1
p4_error: latest msg from perror: Bad file descriptor
p3_11299: (52.203735) net_send: could not write to fd=5, errno = 32
p17_6533: (52.172481) net_send: could not write to fd=5, errno = 32
p7_6234: (52.195939) net_send: could not write to fd=5, errno = 32
p5_8935: (52.200184) net_send: could not write to fd=5, errno = 32
p12_6457: (52.187123) net_send: could not write to fd=5, errno = 32
p9_7778: (52.194215) net_send: could not write to fd=5, errno = 32
p15_6557: (52.186913) net_send: could not write to fd=5, errno = 32
p13_6458: (52.195577) net_send: could not write to fd=5, errno = 32
p19_7863: (52.186255) net_send: could not write to fd=5, errno = 32
p10_6717: (52.302025) net_send: could not write to fd=5, errno = 32
p11_6718: (52.439764) net_send: could not write to fd=5, errno = 32
mpiexec: Warning: tasks 3,5,7,9-13,15,17-19 exited with status 1.
I'm not sure if this is a PBS problem or something with mpiexec or SATA with Linux. Ocassionaly when this code is being run on 16 processors (or less) it will die with the same error at some random point in the middle of running. However, I always get these messages when I try to run on more than 16. Whenever these large files are being written to the server the server seems to hang up during the writing of these files. I never had this problem before, but I only had 16 processors than and I was using an IDE drive at the time. It could be a problem with Linux and SATA, but I guess it could be a mpiexec or PBS problem. Any suggestions would be great. Thanks.
Ivan
More information about the mpiexec
mailing list