Fundación General CSIC

Lychnos

Notebooks of the Fundación General CSIC <span>Digital Edition</span>


Go back to Articles

   Print

Information Technology

ISABEL CAMPOS

Instituto de Física de Cantabria (CSIC and Universidad de Cantabria)

Scientific computing infrastructure

In this article the author describes how the history of scientific computing infrastructure has always been linked to a symbiosis between science and technology. Using computers to perform numerical simulations has helped us understand some of the fundamental mechanisms of nature by simulating theoretical models in cases where experiments are not feasible.

Share |
 In 1985 Richard Feynman, probably the most brilliant physicist of the second half of the 20th century, gave a talk in the main lecture theatre at BELL Labs, where, in his provocative style, he told the audience of scientists and engineers: “I don’t believe in Computer Science.” Feynman argued that science is the study of the behaviour of nature, whereas HIGHLIGHTSProfile: Isabel Campos
engineering is the study of manmade things, so computing, as such, belongs to engineering.

Feynman was able to apply his enormous insight as a physicist to many areas of knowledge and was a multidisciplinary scientist. One of his last tasks was to sit on the Rogers Commission entrusted with analysing the causes of the Challenger shuttle disaster. Feynman showed, in a famous televised demonstration, how the O-rings of the shuttle’s motors became brittle at subzero temperatures (such as those the night before launch) and shattered like glass, resulting in the explosion that destroyed the shuttle, with the loss of all its crew, in 1986.

It is generally not widely known that it was Feynman, who won the Nobel Prize in Physics for a theory explaining the behaviour of light quanta, which in prin­ciple seemed quite remote from everyday applications, who was able to resolve the puzzle of the Challenger tragedy.

In particular, in relation to the subject of this article, Feynman also worked on the design of the first parallel supercomputers which appeared in the late 1970s, and was a key member of the team building the Connection Machine, a massively parallel computer.


Computers in general
A computer is a device able to take information as an input, process it and return it as output after processing. A modern computer is able to perform a range of tasks as wide as beating the world champion at chess or forecasting tomorrow’s weather. Nevertheless, all computers are basically the same. They all replicate the same architecture, based on a digital machine architecture designed in 1945 by the Hungarian-American mathematician John von Neumann (see Figure 1).

The so-called Von Neumann machine includes memory in which to store information, a control unit to manage information inputs and outputs, and an arithmetical unit able to carry out logical operations on this information.

This technology is implemented by building digital transistors on a silicon substrate, and hardware designers have striven to find ever more compact solutions. Packing more transistors on a chip allows more operations to be performed per second, as it enables the processor clock frequency to be increased. And it is the clock speed that defines the rate at which the processor completes each logical operation.




Over the last 30 years these efforts have allowed the number of transistors on the chips making up the com­puter’s internal components to be doubled. This is the so-called Moore’s law, named after Gordon Moore, one of the founders of Intel. However, it is a prediction rather than a natural law, based on Moore’s observation of how transistor integrated circuit technology was developing back in 1965.

Moore’s prediction has proven to be reasonably accurate. However, for some years now it has not quite matched the reality. It was known from the start that there are practical limits to making a standalone computer reach sufficient speed to solve arbitrarily complex problems.

Essentially, all computers are the same. They all replicate the same architecture, based on a digital machine architecture designed in 1945 by the Hungarian-American mathematician John von NeumannThese limits relate to the maximum speed at which information can be transmitted: the speed of light (30 cm/nanosecond) or the speed at which data can move along a copper wire (9 cm/nanosecond). Moreover, at very small scales the rules of classical physics no longer hold and electrons, the information carriers in transistors, stop behaving like particles and start to behave like waves.

This brings us into the field of research known as quantum computing, which is still in its infancy.

Miniaturising transistors faces economic limits as well as physical ones, as it pushes up costs all the time. Therefore, the most recent architectures have preferred to replicate internal CPU (central processing unit) structures rather than take miniaturisation to the limits. This is the philosophy of multicore architectures, which are now used in almost all CPUs.

Improvements in CPU performance no longer come from making chip areas smaller but from adding more than one calculating unit or core to each CPU so that the overall processing speed is raised by aggregation. Indeed, core frequencies have levelled off at around 3.4 GHz in recent years.

The downside, however, is that the economic savings a multicore architecture enables come at the cost of increased programming complexity: the workload has to be distributed between the cores. In other words, it means a lot more work for operating systems developers and programmers.


Parallel computers
Since the late 1970s it has been clear to the scientific community that solving very complex problems would require a number of com­puters working together to share the processing load of the simulation.

In the mid-80s, the Thinking Machines Corporation, a pion­eer in the design of parallel computers, began to produce machines based on the aggregation of thousands of processors literally wired together. These so-called Connection Machines (CM) went through five generations, the last of which was in 1993. The processors transmitted the information necessary to cooperate on calculations over the cables linking them together.

The CM demonstrates that unity is strength: the individual processors were not very high frequency (between 10 and 30Hz) but their advantage lay in the powerful internal communications network that linked 64,000 processors to enable them to cooperate to solve problems.

The router, which is the piece of the hardware allowing the processors to communicate with each other, was analysed by Feynman, who was in charge of calculating the Computers based on highly specialised hardware are at the frontier of technological know-howoptimal topology for a network of 64,000 interconnected pro­cessors by algebraic methods (Hopkins networks) so as to avoid the system crashing as a result of a data overload on the cables.

Why was Richard Feynman interested in a topic so remote from theoretical physics? The answer is that he wanted to study quantum chromo­dynamics (QCD), so he applied his skills to developing the computer he needed.

QCD is a theory describing the behaviour of the world of sub­atomic particles, such as protons. QCD makes it possible to calculate the values of physically measurable quantities, such as the mass of the proton, using computer simulations. In practice, calculating the mass of the proton would keep any supercomputer from Feynman’s time busy for many years, and would still take several years on one of today’s models.

Feynman wrote the code to simulate QCD on the CM. He used his knowledge of the BASIC programming language to develop a parallel version of BASIC and then ran the simulation “by hand” to estimate how long it would take on the computer.

Incidentally, the CM machines had a striking visual design. A CM-5 appeared in Jurassic Park (Photo 1) where it featured as the computer in the island’s control room.

The history of the Connection Machine is an example of the frequent symbiosis in the history of scientific computing: whereby engineers and physicists team up to design hardware optimised for scientific applications. The most extreme case of this collaboration is the case of dedicated computers, which are machines in which the electronics is devoted to solving specific scientific problems, although the price is their being less efficient at more general calculations. The team efforts to design the machines dedicated to solving QCD stand out: these include the APE group, based at the University of Rome, with an offshoot in Spain at the University of Zaragoza, and the QCDOC group at the University of Columbia, in the US. These groups have been developing computers dedicated to solving QCD since the 1980s.

In the 80s and 90s CRAY also successfully produced supercomputers, although its approach was somewhat different, using fewer, but more powerful processors. For example, the CRAY-XMP (1982-1985) consisted of computers with between one and four processors at most, running at a clock frequency of around 120 MHz.

In 1988 the state-owned company Construcciones Aeronáuticas S.A. (today, EADS España) bought Spain’s first supercomputer, a CRAY 1-S/2000. The Spanish scientific community was allowed to use 975 hours a year of computer time. The remainder of the time it was dedicated to studying aircraft aerodynamics as a substitute for extremely costly wind-tunnel experiments.

In 1985 a CRAY-XMP cost about 15 million dollars and had a calculating power of 420 floating point operations a second (420 megaflops). Obviously, a purchase of this type was only possible at national level.





  Photo 1.
Left, Connection Machine-5. The red LEDs indicate active connections between processors.
Right, its contemporary, the CRAY-XMR.



The age of the megaclusters
Computers based on highly specialised hardware are at the cutting edge of technology. They are costly to build and, above all, limited to highly specific operations. Alongside the hardware, the manufacturer had to develop an ad hoc programming language, compilers and even an operating system to take the hardware’s specifics into account.

It was therefore necessary to learn a new programming language to use each machine and for simulation code to be rewritten each time. It is not surprising, then, that the hardware/software of these systems Scientific computing today is largely done on Beowulf clusters. These link several computers built using commercially available components via a switch enabling them to exchange datahas evolved towards less specialised systems, generally running UNIX-based compatible operating systems supporting standard programming languages such as C, although at the cost of a 10-20% performance penalty with respect to languages specially adapted to the hardware.

In parallel, software has evolved in the same direction. Two fundamental milestones were the development of open-source code in the GNU project and the spread of Linux as an operating system, which has now become the de facto standard in the scientific world.

This process of simplification was to a large extent the consequence of the boom in commercial computing that began in the mid-90s. This led to hardware becoming much cheaper to produce as a result of manufacturers competing to offer higher performance products, aimed at the computer games market in particular.

Scientific computing today is largely done using Beowulf clusters. This is the generic term for computers made by networking several commercially available commodity-grade units via a switch enab­ling them to exchange data. The first Beowulf cluster was put together in the US in 1994.

Manufacturers continue to include special features in their products aimed at customers in the scientific sector, above all focusing on improving the efficiency of floating-point calculations. But essentially all today’s supercomputers, such as Beowulf clusters, consist of more processors with a faster switch.

At the time of writing, the world’s biggest Beowulf cluster is called KComputer, at the RIKEN research centre in Japan. K-Computer comprises 550,000 cores, able to reach a combined power of eight petaflops, which is to say, multiplying eight trillion floating point numbers a second.

In Spain, MareNostrum, the flagship computer at the National Supercomputing Centre, has 10,000 cores and is the biggest system in the country. Research centres with a significant computation component also house Beowulf clusters, typically comprising between 1,000 and 4,000 cores.





Distributed computing infrastructure
The progress of basic science usually runs parallel to the development of innovative solutions allowing scientists to tackle ever more complex problems. If scientists need infrastructure that is not currently available to run a scientific project, they often develop an ad hoc solution.

A paradigmatic example was the invention of the World Wide Web. In 1990 a group led by Tim Berners-Lee at the European Particle Physics Laboratory (CERN) designed a system to allow scientists to exchange files, which evolved into what we now know as the World Wide Web.

In the late 90s another visionary idea emerged in the field: just as the web could be used to share information via an Internet connection, why not use this same infrastructure to share computing power?

This led to the idea of creating global distributed computing infrastructure, a Grid that would initially serve the world’s particle physics researchers, who needed to analyse data from CERN’s new accelerator, the Large Hadron Collider (LHC).

The progress of basic science usually runs parallel to the development of innovative solutions allowing scientists to tackle ever more complex problemsThe way in which the Grid’s infrastructure is organised has evolved: it would be simplistic to consider sharing information (the Web) and sharing computing resources (the Grid) as equivalent, given that the cost models are very different. What is important is that Grid technology makes it possible to share infrastructure, if the need arises.

In Europe we have a single infrastructure supporting European scientists: the European Grid Infrastructure (EGI), which interconnects the national Grid infrastructure of 38 European countries.

The EGI comprises over 250,000 cores and more than 150 million gigabytes. The Iberian Peninsula accounts for approximately 10% of this infrastructure. In 2011 the EGI infrastructure provided more than 1,500 million CPU hours. Among other things this was used to analyse LHC data: the biggest machine ever built also requires the biggest computing infrastructure ever designed, to help scientists look for the fundamental structure of matter.

Computer technology has progressed astonishingly over the last 30 years. This is borne out by the fact that a mobile phone today has as much computing power as the first supercomputer installed in Spain in 1988.

Looking to the future, I will end where I began, quoting Feynman, introducing conceptually what we today call quantum computing: «... there is a lot of scope for making computers smaller, I have seen nothing in the laws of physics that prevents us making a computer at atomic level.”

Profile: Isabel Campos

Isabel Campos has a doctorate in Physics from the University of Zaragoza (1998) and has been a tenured CSIC scientist at the Cantabria Physics Institute since 2008. During her career she has worked at the Deutsches Elektronen Synchrotron (Hamburg) and the Brookhaven National Laboratory (New York) developing scientific applications in high performance computing environments. Before joining the CSIC she was in charge of managing scientific applications and Grid computing at the Leibniz Computing Centre, Munich, Germany’s largest and one of the leading centres of its kind in Europe. Her career has included contributions to the fields of simulation of particle physics, complex systems and nuclear physics.

She has published over 40 articles in high impact journals and delivered around 100 presentations at international conferences. She is currently director of the Spanish Grid Computing Infrastructure (es-NGI) and is on the executive board of the European Grid Infrastructure (EGI) foundation.

Published in No. 07


  • ® Fundación General CSIC.
    All rights reserved.
  • Lychnos. ISSN: 2171-6463 (Spanish print edition),
    2172-0207 (English print edition), 2174-5102 (online edition)
  • Privacy and legal notices
  • Contact

Like what we do? Keep up to date
with our latest
news and activities
on Facebook, Twitter or YouTube

Search options