Brian L Tierney et al 2007 J. Phys.: Conf. Ser. 78 012075 doi:10.1088/1742-6596/78/1/012075
Brian L Tierney1, Dan Gunter1 and Jennifer M Schopf2
Show affiliationsTracking failures and poor performance across a widely distributed system of resources has proven challenging for many ongoing DOE applications. An example is the Open Science Grid (OSG) project, which currently experiences a roughly 15% job failure rate. This can be an issue not only for Grid computing but for anyone performing large-scale data transfers to remote machines because of the large number of interconnected components and services.
As part of the Center for Enabling Distributed Petascale Science (CEDPS) project we have been building an infrastructure to work with current middleware and existing system tools to more easily track failures and discover anomalous behavior. This consists of a common logging format, the extension of syslog-ng for centralized collection of data, a data summarizer to more easily manage the volume of logging, and an anomaly detection system that can connect to a warning system when unexpected behaviors occur. We are currently working with OSG to deploy a prototype of the full system. The initial logs gathered will be used to extend the analysis tools and to increase the reliability of the services for the SciDAC end user community.
07.05.Bx Computer systems: hardware, operating systems, computer languages, and utilities
Issue 1 (2007)
Brian L Tierney et al 2007 J. Phys.: Conf. Ser. 78 012075
A. Glatz et al 2008 EPL 82 47002
A. Glatz and I. S. Beloborodov 2009 EPL 87 57009
Dan Milisavljevic et al. 2008 ApJ 684 1170
Guilhem Semerjian and Martin Weigt 2004 J. Phys. A: Math. Gen. 37 5525
S. Blondin et al. 2008 ApJ 682 724
G. Miknaitis et al. 2007 ApJ 666 674
Claes Fransson et al. 2005 ApJ 622 991
A. C. Becker et al 2008 ApJ 682 L53
P. A. Mazzali et al 2005 ApJ 623 L37