HEPiX Meeting at SLAC, 27 - 29 Oct 1993

Minutes



                               Alan Silverman

                             December 15, 1993



     The  Fall  93  North  American  HEPiX  meeting  was  held  at  SLAC
from October 27th to 29th. The attendance list had over 40 names but
there was never more than 25-30 people in the meeting room at any one
time. However, it was nice to see quite a few University people present,
as  well  as  a  few  actual  users,  a  welcome  development  from  previous
European and US meetings which had been largely attended by system-
providers and largely from the major labs. The overheads of all sessions
are available in the Computer Science Library.



1     Introduction

Chuck Dickens, Director of the SLAC Computing Centre gave the open-
ing keynote address and described how they had arrived at a new com-
puting model for SLAC. He said the rate of change of technology was so
rapid that they were afraid of specifying any given precise model for fu-
ture computing. Nevertheless, their goals were client/server computing;
a centrally-switched network with the eventual goal of ATM; desktops,
primarily mass-market based (which he clarified under questioning as
mainly PC and MAC but not necessarily NT). He said the lab was too
small to build all  their own systems  and should concentrate  on  only
one or two types of fully-supported desktops. There should be common
applications and common access to servers and the network and these
should also be mass-market based.

     The general strategy should focus on applications not systems and
they should select applications which  run on  multiple  platforms.  He
stated that they should "use popular objects from profitable vendors"
which would seem to exclude purchases from IBM and DEC! Solid sup-
port structures were important and outsourcing might be used for sup-
port.  Another quote was that they should "buy adequate technology
not necessarily the best".



2     Site  Reports

2.1      SLAC

This was presented by Chuck Boeheim, Server Computing Group Man-
ager.  Central UNIX support is run by some 11 people providing UNIX
administration for the  central servers and a certain  number  of  work-
stations as well as support for local administrators.  They had a small
prototype workstation farm of three RS/6000 systems connected to 6 10-
tape Exabyte stackers. Their experiences so far, early in the evaluation,
were that code conversion from VM was not easy, that the exercise was
labour-intensive and that Exabyte was not yet production quality. They
were working with IBM's Loadleveler but some vital features were still
missing such as fair sharing among users and a policy-based resource
allocation scheme.  Work was continuing with IBM and with other US
labs on this product.

     They had established a first AFS pilot and were in the process of
acquiring ADSM from IBM for UNIX file backup. The testing of UNIX
systems connecting to tape silos was also underway and they would move
the ADSM master to an RS/6000 in due course as part of the migration
away from VM.

     SLD and the newly-approved Beauty Factory (official confirmation
came through during  the second  day of the meeting)  would  lead  to
greatly-increased data rates and they were upgrading the STK cartridges
to 4490 drives giving 24TB online. The next stage could be helical scan
devices from STK in 2-3 years and could offer 600-900 TB if the current
4 silos when re-equipped.

     ZTAGE  from  IN2P3  was  currently  in  use  but  NSL  Unitree  from
IBM would be evaluated for staging and managing experimental data
and FATMEN for tape staging and file management.



2.2      TRIUMF

It was stated that having an AT&T source licence gives unlimited user
access to UNIX systems;  to be verified in connection with the recent
Solaris licensing issue.  [In checking with Dietrich Wiegandt, this is in
fact only true for the particular system mentioned in the contract with
AT&T.] TRIUMF took a very pragmatical approach to UNIX support
- select only well-tried packages and involve the end-users as much as
possible.  Probably very wise for a small site.  Also, choose a single
vendor (DEC in their case).



2.3      FNAL

See Pisa notes. Again use of Loadleveler and evaluation of NSL Unitree
were mentioned. AFS at FNAL is discussed later.



2.4      Yale

The first University to report, the speaker was openly seeking answers
to many questions (OSF, AFS, etc) and pleading for help and guidance
from the major labs.



2.5      Notre Dame

A small US university site, partly VMS and partly UNIX, but a large
AFS installation used by their students.  Soon they will install an SP1
with 16 nodes. Running experiments at FNAL and BNL.



2.6      BNL

This was presented by Ed McFadden.  A total lab population of about
450 UNIX systems, two-thirds SUN, plus 80 VMS and over 1000 PCs;
most  UNIX  was  around  the  Computing  Centre  (CCD).  They  had  a
small computing farm, mostly RS/6000; an Alpha loaned by DEC had
been returned and a loaned SGI Challenge was likely to suffer the same
ignominious fate.  The problem in each case was lack of user interest,
possibly caused by the fact that users were required to pay CCD for
services.

     They had bought a WORM device for tests and then discovered that
the platters were too expensive ($ 325 for 6GB).



3     CVS

Paul Kunz gave a plea for people to use CVS for program source control
instead of SCCS or vanilla RCS. It was now in wide use at SLAC and
also at LBL, FNAL, CERN (Phil Defert), etc. The Internet news group
for CVS showed a healthy interest from the commercial world also; at
Thinking Machines, 100 software engineers worked on a project with a
single 1GB source directory tree under CVS control.



4     GUI  Programming

Paul Lebrun presented some work done at FNAL to develop a sort of
style guide for MOTIF-like programming to promote simple-to-learn and
consistent graphical programming. It was implemented as a set of widget
libraries.   In  passing,  Paul  mentioned  a  new  popular  WYSIWYG  X
editor in the public domain called NEDIT.



5     AFS

5.1      CERN

I presented the AFS talk prepared by Rainer Tobbicke for the October
IBM SHARE Europe meeting. When we came to the issue of long-lived
tokens for batch jobs, Keith Rich of SSC mentioned a couple of points:
Kerberos had the notion of a "promise" of authorisation which could be
kept dormant until the job was actually scheduled. Plus he had written
a wrapper round the token which could be used to prolong the life of the
token while the job executed. He agreed we could contact him for more
information.  There  was  great  interest  in  the  AFS  User  Guide  which
Rainer is currently writing.


5.2      SSC

Their prime servers were on ULTRIX but they also had SUNs.  There
was a total of 7 servers, 41GB of disc space; some 150 AFS clients; 3
AFS/NFS  translators  to  which  some  200  workstations  attached  from
time to time.  No problems seen with AFS/NFS translation but,  like
CERN, they had not stressed this.


5.3      FNAL

Severe problems to report.  There were  few real  clients  as  AFS  was
simply used to serve the CLUBS system via the NFS translator. When
this  had  been  run  on  an  RS/6000  AFS  server, it had crashed under
load.  Sometimes files had "looked" corrupt and if they were re-written
in this state, could  actually end  up  being  corrupted.  K.  Rich  (SSC)
emphasised that the NFS translator should not be run on a server for
reasons of load on a single system and noone else in the audience had
seen a similar problem.  FNAL had moved to using a beta test version
of the SGI port; a production version was not expected until SGI would
release IRIX 5.1.1, now due 1Q94. The impression was given that FNAL
were alone in their bad experiences with AFS and had perhaps, with
hindsight, taken some wrong choices in their setup.


5.4      SLAC

After initially (spring 92) deciding to wait for DFS (expected then to
be 6 to 12 months away), SLAC had recently decided to install a small
(6GB) pilot and now planned to expand this next year since DFS still
looked 6 to 12 months away (this became the key phrase of the meeting).
They used an RS/6000 for file servers, and SUNs for volume location
servers. They had even installed the NFS translator on the IBM and had
seen no problems yet but it was little used. After the FNAL experience,
they would rethink this (HEPiX CAN be useful).


5.5      Other Points

There was a big demand from all AFS sites present for more and bet-
ter  AFS  documentation  for  administrators.   ADM  was  used  by  some
sites for AFS administration tasks and a new tool, sysctl from the IBM
T.J.Watson Centre was mentioned; it might soon become freely avail-
able (see my forthcoming LISA Conference trip report). Another worry
was the price which might be charged for DFS licences.



6     ASIS

Philippe Defert presented the plans around ASIS. During this it was
pointed out that it should NOT be described as offering "public domain"
software but "freely available" software.



7     FREEHEP

A repeat of the Annecy CHEP 92 session. Later in the meeting methods
were  discussed  to  cross-link  FREEHEP  with  some  projected  HEPiX
initiatives in building up a database of useful UNIX tools for HEP sites.



8     Farming  at  FNAL

8.1      Data Mining at FNAL

Large  computational  needs  led  to  workstation  farms,  some  for  CPU-
intensive jobs, some more I/O oriented. The talk described some useful
tools used on their farms (see overheads) and some interesting bench-
marks were presented comparing farms with mainframes, supercomput-
ers and MPP systems.

     The CAP (Computing for Analysis Project) was an I/O-intensive
farm utilising  parallel I/O for  experiments  with  between  25  and 200
TB of data each, using robotics and HSM. The aspect of data mining
came from searching for specific data samples. The prototype was based
on the SP1 and three were installed, to move later to SP2s.  The data
cache was a large Exabyte robot device with 8 drives initially. Files are
transferred into the 8 I/O nodes using 20 MBps SCSI and fed to the
16 processing nodes.  They expect an overall rate of 8 to 10 MBps for
effective tape reading into the batch nodes.

     In passing, the speaker noted that out of some 100 Exabytes drives
in the Centre,  they "only" had about one drive repair per day.  The
following speaker, from the same group, spoke of "many" problems with
Exabytes, one per day. Life viewed from opposite ends!


8.2      Operator Console at FNAL

This was a tool to help with tape drive allocation and providing instruc-
tions to the tape mounting operators.


8.3      Robotic Tape Control

The other side of the coin, this was a utility to control tape mounting
inside robots, Exabyte and other. It was not (yet) linked to the previous
tool.



9     DNQS  at  BNL

BNL  had  taken  DNQS  from  McGill  and  used  it  in  CCD.  But  they
"missed" HEPVM batch and would love someone to port the best fea-
tures of this to UNIX. Meantime they were looking at DQS (also from
McGill) and Loadleveler.



10       COSE

Judy Richards arrived hot-foot from San Jose further down Route 101
and presented her impression of the CDE initiative of COSE.



11       Vendor  Talks

11.1      SGI

The first of three invited vendor speakers, Forrest Baskett is a senior
VP of SGI (a founder?)  and head of Research and Development.  He
presented the company and its products before moving on to describing
why SGI believed in shared memory for parallel programming projects.
He described a cluster of 16 20-node Challenges which had been used for
a turbulence flow calculation with impressive results.  He did however
state that SGI were also looking into distributed memory solutions using
High Performance FORTRAN for some future systems.


11.2      IBM

The most entertaining talk of the week was given by Merrit Jones of
IBM's Metro Systems Division in Houston.  He is a systems integrator
who said he could hook any cluster configuration together as long as he
had the cables and the drivers. He would happily select any of the four
most popular methods of clustering - bus, star, ring or cross-bar; he used
a huge variety of links including VME, Ethernet, HIPPI, etc, etc.

     He was involved in the project at the National Storage Lab (NSL)
to develop a network-attached high throughput storage system.  Many
vendors are involved including IBM, Ampex, DISCOS, etc., plus user
sites such as LLNL and four US National Science Foundation Super-
computing Centres. The inclusion of the Pittsburgh Centre implied the
use of AFS which would thus have to be integrated with Unitree.  It
was from the NSL initiative that IBM had developed and were making
available NSL Unitree.

     The initial target was 50MBps sustained rates moving later to 500
MBps.   He  could  handle  many  kinds  of  robot  including  IBM's  3480
cartridges, Exabytes, IGM (Summus) and "protein robots" - his term
for humans.

     A 128 node RS/6000 cluster is going into Argonne very soon linked
by a HIPPI switch and using 4 file servers connected by Fibrechannel.

     His talk was very well presented, he was eager to impart information.
When asked about tapes, he explained that he was "not a tape expert
but (he) had once sat next to a guy on a plane who had known one". He
then launched into the relative benefits of helical scan (good for price and
density but you sacrifice shelf life and reliability) against longitudinal
(long lasting and low error rates).


11.3      DEC

The  last  (poor)  talk  was  given  by  a  guy  from  the  High  Performance
Computing  group  in  Maynard  and  was  largely  marketing  about  why
Alphas were so good for HEP farming.  Strange that it was almost the
first time anyone had mentioned using Alphas in that role in the whole
meeting.



12       Closing

The next meeting was provisionally scheduled to follow, immediately if
possible, the CHEP94 conference in San Francisco next April and the
attendees were then freed to enjoy the rest of the day with its 30 degree
sunshine (a late-October record for San Francisco).

Legal Notice | Data Privacy Policy