Abstracts for LCCWS2
BaBar Computing (Stephen J. Gowdy, SLAC)
BaBar has been taking data since 1999 and has accumulated almost 100
million BBbar decays (82fb-1 on the Upsilon(4s) peak). In order to process
and analyse this data BaBar relies primarily on computing clusters located
in four "Tier-A" sites around the world. This talk will discuss how BaBar
uses these sites and issues related to a largely distributed collaboration.
CDF forRun II (Frank Wuerthwein, FNAL)
Run II at the Fermilab Tevatron Collider began in March 2001 and will
continue to probe the high energy frontier in particle physics until the
start of the LHC at CERN. It is expected that the CDF collaboration will
store up to 10 Petabytes of data onto tape by the end of Run II. Providing
efficient access to such a large volume of data for analysis by hundreds
of collaborators world-wide will require new ways of thinking about computing
in particle physics research. In this talk, I discuss the computing model
at CDF designed to address the physics needs of the collaboration. Particular
emphasis is placed on current development of a O(1000) node PC cluster at
Fermilab serving as the Central Analysis Facility for CDF and the vision
for incorporating this into a decentralized (GRID-like) framework.
D0 and SAM for Run II (Lee Lueking, FNAL)
SAM has been developed within the Computing Division at Fermilab as
a versatile, distributed, data management system. One of its many
features is its ability to control processing and manage a distributed cache
within a cluster of compute servers. Requirements, concepts, and features
of this system are described and issues involved in interfacing it to several
batch systems are discussed. The system is used within the
Dzero experimental collaboration to distribute hundreds of Terabytes of
data for processing and analysis around the world. Several hardware
configurations deployed at Fermilab are described. Data is currently
disseminated using this system to over two dozen sites worldwide, and this
number will grow to nearly one hundred in the coming years. The planned design
evolution to accommodate this growth is discussed, and the transition of
the system to grid standard middleware is described.
Managing a mature white box cluster at CERN (Tim Smith, CERN)
A large and diverse mutli-vendor white box farm built up over the years
poses challenges which are uncharacteristic in new single acquisition large
farms. The combination of diversity and scale prevent the applicability
of either of the extremes of 'vendor supplied' or 'hand-crafted' solutions
to problems encountered. Vendor interventions are frequent as are acquisition
cycles, which feeds a constant in- and out-flow of machines, implicating
a large number of people. Additionally software provision of management
tools has been traditionally distributed over multiple teams implicating
a large and diverse number of application delivery mechanisms. I will describe
how CERN is facing these challenges whilst reducing the number of hands
on the systems, through automation, standards, procedures, and the adoption
of workflows to manage machine life-cycles.
University Multidisciplinary Scientific Computing: Experience
and Plans (Alan Tackett, Vanderbilt University)
The significant investment made in information technology research over
the past two decades has been justified in large part by the desire to enable
great science through the "transformative power of computational resources."
This investment is having the desired effect, and high end cyber infrastructure
is becoming essential for researchers in many disciplines.
At Vanderbilt, researchers in genetics, structural biology, and high energy
physics who felt that their rate of discovery was constrained by the paucity
of available computational resources have established a campus scientific
computing center. This grass roots initiative has been a signficant
success, and new disciplines are joining. We will describe the organizational
and resource allocation models for this center, our recent experience, and
plans for the future.
Building a Computer Centre (Tony Cass, CERN)
Although it looks simple, many sites find that they do not have the necessary
infrastructure to support a large number of PCs. In particular, the importance
of ensuring adequate air conditioning, and not just adequate floor space,
is often overlooked. The speaker will review such issues based on experience
gained in planning the upgrade of CERN's physical infrastructure to support
the needs of LHC computing.
A Pre-Production Update on the NSF TeraGrid (Remy Evard, ANL)
The TeraGrid Project is the National Science Foundation's next-generation
infrastructure in support of scientific computation. When completed,
the TeraGrid will include 13.6 teraflops of Linux Cluster computing power
distributed at the four TeraGrid sites, facilities capable of managing and
storing more than 450 terabytes of data, high-resolution visualization environments,
and toolkits for grid computing.
While the TeraGrid will soon be expanded to include other sites, Argonne
National Laboratory is one of the four initial TeraGrid participants, along
with the National Center for Supercomputing Applications, the San Diego
Supercomputer Center, and the California Institute of Technology.
TeraGrid is scheduled for production in spring of 2003. In this talk,
I will describe the TeraGrid project, the planned infrastructure, and the
current status of the project, including a focus on the issues we've encountered
thus far.
Grid/Fabric interaction (discussion led by Bernd Panzer Steindel,
CERN)
In principle the middleware would be a service layer 'above' the Fabric,
but it seems that in practice there are quite some interactions imposing
possible constraints on the Fabric design itself.storage cache layers, security(network
visibility of worker and storage nodes), 'home' directory requirements, support
for different Linux versions in the Fabric and on the GRID service, etc
Is this really a problem (compromises..)? who adapts to whom ? experience
? can and should one influence the developments ? should the Fabric
require strict rules ? ............
European DataGrid Fabric Management (Olof Barring, CERN)
The fabric management workpackage of the EU DataGrid project has as an
objective to provide the tools to automate the management of large computer
fabrics. This talk, which provides an update from status and plans I presented
at the LCCWS in May 2001, is structured as follows: A short introduction of
the EU DataGrid project will be followed by a detailed description of the
architecture of the automated fabric management. Thereafter, the status and
lessons learned from the first one and a half years of the project three-years
duration will be presented. Finally, I will describe our current developments
and immediate plans up to March 2003.
Evaluation of a MOSIX cluster as a group analysis facility at CDF
(T.Kim, A.Korn, Ch.Paus, M.Neubauer, D.Waters, F.Wuerthwein; FNAL)
Faced with the task of choosing a model for group computing at CDF we
built a 30 CPU MOSIX cluster using commodity PC hardware. MOSIX (Multi-computer
Operating System for UnIX) is an extension to the Linux kernel that allows
for transparent process migration and dynamic load balancing. It was
developed at the Hebrew University of Jerusalem School of Computer Science.
We have operated such a cluster for 1.5 years and it has been extensively
used for data analysis by a group of 10-15 people. We describe our
experiences with the current system, its advantages and shortcomings.
We comment on performance and scalability.
The CMS Tier 1 computing center at Fermilab (Hans Wenzel, FNAL)
Currently we are building a CMS Tier 1 computing center at Fermilab. Among
other activities we participate in distributed Monte Carlo Production,
provide resources for interactive and batch computing for user analysis,
prepare to be part of the emerging LHC computing grid and prepare to provide
computing in the LHC era.
We will describe the current installation at fermilab and describe the
technology choices we made:
- the dCache system which makes a multi-terabyte server farm look
like one coherent and homogeneous storage system and provides rate adaption
between the application and tertiary storage (Enstore).
- LVS or FBSNG are two approaches providing load balanced logins
to use a farm of Linux computers for data analysis.
- the FBSNG batch system for farms.
- Disk farm makes the data disks attached to individual farm nodes
look like one coherent and homogeneous storage system.
FBSNG and Disk Farm - parts of large cluster infrastructure (Igor
Mandrichenko, FNAL)
FBSNG is a farm batch system designed and developed at FNAL for Run II
data processing on large PC Linux farms. It has been successfully used by
Run II and fixed target experiments at FNAL for several years. Disk Farm is
a system which organizes disk space distributed over large number of farm
nodes into single logical name space. Disk Farm features include UNIX file
system - like user interface, data replication, load balancing. The article
briefly presents the two systems and summarizes plans and current status
of making these two systems Grid-aware.
The GridKa Installation for HEP Computing (Dr. Holger Marten,
Forschungszentrum Karlsruhe)
At the beginning of 2002, Forschungszentrum Karlsruhe put the first prototype
of the "Grid Computing Centre Karlsruhe" (GridKa) into operation. GridKa
is designed to attack compute and data intensive problems in high energy physics
and other natural sciences in an international network of Grid Computing Centres.
It currently serves as a test facility for LHC Computing, a testbed set-up
for the EU-project CrossGrid, and it provides a production environment for
BaBar, CDF, D0 and Compass. GridKa will pass its second prototype in October
2002, with approximately 250 Linux Processors, 50 TB online and 100 TB tape
capacity. The talk gives an introduction into the current set-up, first experiences
and future upgrades of GridKa.