HISTORY OF RISC
Even before computers became as ubiquitous as they are now, they occupied a
place in students’ hearts and a place in engineering buildings, although it was
usually under the stairs or in the basement. Before the advent of the personal computer,
mainframes dominated the 1980s, with vendors like Amdahl, Honeywell,
Digital Equipment Corporation (DEC), and IBM fighting it out for top billing in
engineering circles. One need only stroll through the local museum these days for
a glimpse at the size of these machines. Despite all the circuitry and fans, at the
heart of these machines lay processor architectures that evolved from the need
for faster operations and better support for more complicated operating systems.
The DEC VAX series of minicomputers and superminis—not quite mainframes,
but larger than minicomputers—were quite popular, but like their contemporary
architectures, the IBM System/38, Motorola 68000, and the Intel iAPX-432, they
had processors that were growing more complicated and more difficult to design
efficiently. Teams of engineers would spend years trying to increase the processor’s
frequency (clock rate), add more complicated instructions, and increase the
amount of data that it could use. Designers are doing the same thing today, except
most modern systems also have to watch the amount of power consumed, especially
in embedded designs that might run on a single battery. Back then, power
wasn’t as much of an issue as it is now—you simply added larger fans and even
water to compensate for the extra heat!
The history of Reduced Instruction Set Computers (RISC) actually goes back
quite a few years in the annals of computing research. Arguably, some early work in
the field was done in the late 1960s and early 1970s by IBM, Control Data Corporation
and Data General. In 1981 and 1982, David Patterson and Carlo Séquin, both at the
University of California, Berkeley, investigated the possibility of building a processor
with fewer instructions (Patterson and Sequin 1982; Patterson and Ditzel 1980),
as did John Hennessy at Stanford (Hennessy et al. 1981) around the same time.
Their goal was to create a very simple architecture, one that broke with traditional
design techniques used in Complex Instruction Set Computers (CISCs), e.g., using
microcode (defined below) in the processor; using instructions that had different
4 ARM Assembly Language
lengths; supporting complex, multi-cycle instructions, etc. These new architectures
would produce a processor that had the following characteristics:
• All instructions executed in a single cycle. This was unusual in that many
instructions in processors of that time took multiple cycles. The trade-off
was that an instruction such as MUL (multiply) was available without having
to build it from shift/add operations, making it easier for a programmer,
but it was more complicated to design the hardware. Instructions in
mainframe machines were built from primitive operations internally, but
they were not necessarily faster than building the operation out of simpler
instructions. For example, the VAX processor actually had an instruction
called INDEX that would take longer than if you were to write the operation
in software out of simpler commands!
• All instructions were the same size and had a fixed format. The Motorola
68000 was a perfect example of a CISC, where the instructions themselves
were of varying length and capable of containing large constants along with
the actual operation. Some instructions were 2 bytes, some were 4 bytes.
Some were longer. This made it very difficult for a processor to decode the
instructions that got passed through it and ultimately executed.
• Instructions were very simple to decode. The register numbers needed for
an operation could be found in the same place within most instructions.
Having a small number of instructions also meant that fewer bits were
required to encode the operation.
• The processor contained no microcode. One of the factors that complicated
processor design was the use of microcode, which was a type of “software”
or commands within a processor that controlled the way data moved internally.
A simple instruction like MUL (multiply) could consist of dozens of
lines of microcode to make the processor fetch data from registers, move
this data through adders and logic, and then finally move the product into
the correct register or memory location. This type of design allowed fairly
complicated instructions to be created—a VAX instruction called POLY,
for example, would compute the value of an nth-degree polynomial for an
argument x, given the location of the coefficients in memory and a degree
n. While POLY performed the work of many instructions, it only appeared
as one instruction in the program code.
• It would be easier to validate these simpler machines. With each new generation
of processor, features were always added for performance, but that
only complicated the design. CISC architectures became very difficult to
debug and validate so that manufacturers could sell them with a high degree
of confidence that they worked as specified.
• The processor would access data from external memory with explicit
instructions—Load and Store. All other data operations, such as adds, subtracts,
and logical operations, used only registers on the processor. This differed
from CISC architectures where you were allowed to tell the processor
to fetch data from memory, do something to it, and then write it back to
An Overview of Computing Systems 5
memory using only a single instruction. This was convenient for the programmer,
and especially useful to compilers, but arduous for the processor
designer.
• For a typical application, the processor would execute more code. Program
size was expected to increase because complicated operations in older
architectures took more RISC instructions to complete the same task. In
simulations using small programs, for example, the code size for the first
Berkeley RISC architecture was around 30% larger than the code compiled
for a VAX 11/780. The novel idea of a RISC architecture was that
by making the operations simpler, you could increase the processor frequency
to compensate for the growth in the instruction count. Although
there were more instructions to execute, they could be completed more
quickly.
Turn the clock ahead 33 years, and these same ideas live on in most all modern
processor designs. But as with all commercial endeavors, there were good RISC
machines that never survived. Some of the more ephemeral designs included DEC’s
Alpha, which was regarded as cutting-edge in its time; the 29000 family from AMD;
and Motorola’s 88000 family, which never did well in industry despite being a fairly
powerful design. The acronym RISC has definitely evolved beyond its own moniker,
where the original idea of a Reduced Instruction Set, or removing complicated
instructions from a processor, has been buried underneath a mountain of new, albeit
useful instructions. And all manufacturers of RISC microprocessors are guilty of
doing this. More and more operations are added with each new generation of processor
to support the demanding algorithms used in modern equipment. This is referred
to as “feature creep” in the industry. So while most of the RISC characteristics found
in early processors are still around, one only has to compare the original Berkeley
RISC-1 instruction set (31 instructions) or the second ARM processor (46 operations)
with a modern ARM processor (several hundred instructions) to see that the
“R” in RISC is somewhat antiquated. With the introduction of Thumb-2, to be discussed
throughout the book, even the idea of a fixed-length instruction set has gone
out the window!
1.2.1 ARM Begins
The history of ARM Holdings PLC starts with a now-defunct company called
Acorn Computers, which produced desktop PCs for a number of years, primarily
adopted by the educational markets in the UK. A plan for the successor to the
popular BBC Micro, as it was known, included adding a second processor alongside
its 6502 microprocessor via an interface called the “Tube”. While developing an
entirely new machine, to be called the Acorn Business Computer, existing architectures
such as the Motorola 68000 were considered, but rather than continue to
use the 6502 microprocessor, it was decided that Acorn would design its own. Steve
Furber, who holds the position of ICL Professor of Computer Engineering at the
University of Manchester, and Sophie Wilson, who wrote the original instruction
6 ARM Assembly Language
set, began working within the Acorn design team in October 1983, with VLSI
Technology (bought later by Philips Semiconductor, now called NXP) as the silicon
partner who produced the first samples. The ARM1 arrived back from the fab
on April 26, 1985, using less than 25,000 transistors, which by today’s standards
would be fewer than the number found in a good integer multiplier. It’s worth noting
that the part worked the first time and executed code the day it arrived, which
in that time frame was quite extraordinary. Unless you’ve lived through the evolution
of computing, it’s also rather important to put another metric into context,
lest it be overlooked—processor speed. While today’s desktop processors routinely
run between 2 and 3.9 GHz in something like a 22 nanometer process, embedded
processors typically run anywhere from 50 MHz to about 1 GHz, partly for power
considerations. The original ARM1 was designed to run at 4 MHz (note that this is
three orders of magnitude slower) in a 3 micron process! Subsequent revisions to the
architecture produced the ARM2, as shown in Figure 1.3. While the processor still
had no caches (on-chip, localized memory) or memory management unit (MMU),
multiply and multiply-accumulate instructions were added to increase performance,
along with a coprocessor interface for use with an external floating-point accelerator.
More registers for handling interrupts were added to the architecture, and one
of the effective address types was actually removed. This microprocessor achieved
a typical clock speed of 12 MHz in a 2 micron process. Acorn used the device in
the new Archimedes desktop PC, and VLSI Technology sold the device (called the
VL86C010) as part of a processor chip set that also included a memory controller,
a video controller, and an I/O controller.
Registers
The register is the most fundamental storage area on the chip. You can put most
anything you like in one—data values, such as a timer value, a counter, or a coefficient
for an FIR filter; or addresses, such as the address of a list, a table, or a stack in
memory. Some registers are used for specific purposes. The ARM7TDMI processor
has a total of 37 registers, shown in Figure 2.2. They include
• 30 general-purpose registers, i.e., registers which can hold any value
• 6 status registers
• A Program Counter register
The general-purpose registers are 32 bits wide, and are named r0, r1, etc. The
registers are arranged in partially overlapping banks, meaning that you as a programmer
see a different register bank for each processor mode. This is a source of
confusion sometimes, but it shouldn’t be. At any one time, 15 general-purpose registers
(r0 to r14), one or two status registers, and the Program Counter (PC or r15) are
visible. You always call the registers the same thing, but depending on which mode
you are in, you are simply looking at different registers. Looking at Figure 2.2, you
36 ARM Assembly Language
can see that in User/System mode, you have registers r0 to r14, a Program Counter,
and a Current Program Status Register (CPSR) available to you. If the processor
were to suddenly change to Abort mode for whatever reason, it would swap, or bank
out, registers r13 and r14 with different r13 and r14 registers. Notice that the largest
number of registers swapped occurs when the processor changes to FIQ mode. The
reason becomes apparent when you consider what the processor is trying to do very
quickly: save the state of the machine. During an interrupt, it is normally necessary
to drop everything you’re doing and begin to work on one task: namely, saving the
state of the machine and transition to handling the interrupt code quickly. Rather
than moving data from all the registers on the processor to external memory, the
machine simply swaps certain registers with new ones to allow the programmer
access to fresh registers. This may seem a bit unusual until we come to the chapter
on exception handling. The banked registers are shaded in the diagram.
While most of the registers can be used for any purpose, there are a few registers
that are normally reserved for special uses. Register r13 (the stack pointer or SP)
holds the address of the stack in memory, and a unique stack pointer exists in each
mode (except System mode which shares the User mode stack pointer). We’ll examine
this register
to be continued
HOME PAGE