Welcome to the lecture 2. Let us just go briefly through what we did
in the last lecture. We are mainly concerned with what might be called high level programming
languages of which there are three kinds. Imperative and functional programming languages
are the most important. Logic also can be used as a language but our main concern is
with the imperative and functional languages. They are also similar in certain respects.
Imperative languages might be called state-based languages where state updation is the main
action. However, functional languages are really value based languages much closer to
our mathematical languages. The notion of variables in functional languages is very
much like the notion of variables in mathematics whereas the notion of variables in the imperative
languages is more like quantities in physics which can change over time such as acceleration
and velocity. State based languages means that there is a state change along a time
axis. Unlike physics we are not talking of continuous time. We are talking of discrete
time in the case of functional languages. The notion of the variable is really a variable
in mathematics which means that a variable is just a name for a value and it cannot change
in time during the execution of a program.
If we look at the history of languages, most of the work on high level languages is really
concentrated on the imperative languages. There are hundreds and hundreds of programming
languages. So firstly, it is impossible to study all of them. If you look at history,
you will find that a large portion of the time during the fifties and sixties was concentrated
on what might be called the basic features. This means that these represent the basic
control structures in the basic data structures in order to obtain clean readable programs
efficiently. Later once they were fixed in the 70s and
80s, most of the exploration of programming languages was in terms of what might be called
features. If you were to take Modula, which is just
an extension of Pascal, its most important feature is that of a module.
If you were to take Ada, it combines the module features of Modula and adds more features
like concurrency as an important feature like in exception-handling, generics or polymorphism.
Then you have CLU, which is a module based language very much like Modula but with a
different syntax. But they are all extensions of Pascal in the sense that the basic control
structure remains the same and then you extend the language.
The basic control structures in these languages could be different too. They are not very
similar except that these arrow marks denote the decendency in terms of similarity of even
the basic control structures. Even though the Smalltalk-80 control structures
or syntax are different from that of Simula, the basic extensional feature comes from a
new feature of Simula which was extended and that was the notion of class or objects.
Most of the languages in the early 60s were sequential, whether concurrent or modular
distributed or parallel etc.
Nowadays by and large the exploration of most of the languages in the early 60s is mostly
in terms of new features. I have listed the various kinds of features and in the current
state of art, large amount of work is into trying to make them more efficient, more readable,
more comprehendible etc. The basic functional languages also have this feature that the
early functional languages were all an exploration of. They were very basic data structures and
control structures. Languages like ML and CAML actually signify the addition of new
features.
In fact the syntax of ML expressions etc. is much closer to that of Pascal than that
of LISP. But philosophically, it is a more LISP like language because it is an applicative
or functional language. It is not a state based language and it goes far beyond LISP
in the sense that there are new features like the introduction of modules, introduction
of exceptional handling, the introduction of very powerful data abstraction mechanisms
and a type checking. LISP has no type checking at all.
Let us first study the basic features of languages and look at the issues in language design.
Consider the kinds of issues you would really have to study if you were to design a language.
One very major issue that the implementation of the language Pascal has taught us is that
it is a good idea for a language to have a simple, clear and a small set of unified primitives
for expressing all your algorithms and data structures.
One of the most convenient characteristics about Pascal or in general about what are
known as Algol 60 like languages is that most algorithms are written in some crude variations
of the dialect of Pascal or Algol system. So, the Pascal is a small language easily
learnable which is why most people are taught Pascal initially. One issue could be that
you should start with a small language with a small set of primitive operations over which
you can build.
The next issue would be that you should have an absolutely clear syntax. There should be
no ambiguity and you should be able to get highly readable programs.
It is very important to have readability meaning that we have to be able to read the source
code of the program like a book and a very good reason for that is that no piece of software
that you write is ever permanently fixed. Most pieces of software have bugs in them.
Bugs might be detected years and years after their software has been commissioned. Somebody
else should be able to read the source code and modify it. In order to be able to read
the source code, one should be able to understand the algorithms of the source code contained.
In fact consideration of efficiency comes much later. Readability is the most important
aspect because it includes maintainability of the software. Normally, those who have
written this software are not likely to be present to maintain the software. Software
in general has to be maintained by somebody else, which means that the source code must
be readable and that comes first. The second point is that over the years the software
actually has more and more users who feel their need for its extension which is why
you have to be able to extend the software by adding new features or by adding more conveniences
for specific reasons.
Part of the maintainability of a piece of software is not just the detection and correction
of bugs but also the extensibility of the software as years go by and more and more
needs are felt. It is necessary for somebody who is usually not the original programmer
or the original team that wrote the software to be able to extend it. For all that, readability
is most important and secondly the language should provide what is known as a support
for abstraction.
You are already aware of the basic abstractions. Control abstractions are procedures, functions
in Pascal, loop statements etc. There are also data abstractions. The most primitive
kinds of data abstraction that Pascal provides are the record structures and the arrays which
almost all languages including the early languages have.
Arrays are one data abstraction. You think of a whole sequence of elements as a single
logical unit like records and variant records. Over time it was felt that you should have
data abstraction which allows you to not only take sets of data together as a single unit
but also to operate on them as a single unit.
There are combinations of operations and data abstractions on a module basis or further
abstractions for which you might require languages like ML and Ada that provide the notion of
generality or polymorphism where you can use types themselves as variables and change types
and instantiate the same kinds of algorithms. For example; in stacks, it does not matter
whether you are talking of a stack of integers, a stack of characters, a stack of reals or
a stack of some complicated data element. It may be a stack of some record or stacks
of arrays. The basic operation on stacks is like pop
push, checking for emptiness etc. They should all be available in one form and it should
not be necessary for us to repeat the code depending on the type of the element. An abstraction
should be one where the basic element of the stack itself is a variable which can be instantiated
to a different type, each time the same piece of code is carefully written, verified, or
thoroughly tested and the type instantiated. We get different kinds of stacks with the
same kinds of operations which is why the support for abstraction is an important modern
language design issue. The other kinds of issues that have come up are that you should
be able to verify your programs. So, the language should also provide support for verification,
provability of programs (not necessarily machine based provability but possibly hand based
provability) or a mixture or a user interactive provability of programs. Portability was felt
even in the 60s where a lot of time and effort was expended in the case of FORTRAN and COBOL
compilers.
Nowadays portability means that the language design should be oriented towards an end user.
It is not architecture or machine-specific simply because a certain machine has a certain
kind of assembly instruction. It does not mean that you make that available in the language.
Other machines may not have that. You should be able to provide as far as possible an abstract
form which is not related to the machine that uses only the basic instruction set available
in all machines. If you ensure that your language is really architecture or machine-independent
in the sense that your main concern is one of convenience of the user or the abstractions
required for the user, then they are not specific to particular machine instruction sets of
a particular architecture. You might have register based architectures or stack based
architectures. If you can design the language in such a fashion that it is not architecture-specific
or machine-specific then you can move the entire language to another machine with a
different architecture and with the minimum amount of effort.
There are certain machine specific details which still have to be changed when I move
an entire language implementation from one machine or architecture to another. But the
whole idea of the language design or the design of its implementation should be that the amount
of changes to be made should be minimal. Then there are considerations that include the
ability to easily implement it. You cannot compromise everything for the sake of portability.
Ease of implementation and the availability of ready algorithms for implementing the language
should be considerations. Ease of implementation was perhaps the most important reason for
this and the fact that it used very low level primitives is perhaps the most important reason
for the success of the C language. Efficiency is another important reason. C programs run
very fast and they are very efficient.
If you have a language and you want it to be generally acceptable then it should have
a clear definition of what its constructs do. The reason being that if it has to be
widely acceptable there might be many different implementers trying to implement it on various
machines and only if there is a common clear syntax, common clear semantics or clear specifications
of the effects of each language construct, you can expect to get wide applicability.
We will get to the notion of semantics later. Of course it has to be run time efficient.
By run time efficient I mean that the programs have to run efficiently. This is the efficiency
of the implementation which implies very often compile time efficiency. This means how fast
you can compile programs written in the language. Whereas this runtime efficiency implies that
it should have an excellent runtime support and the program should run as fast as the
programs written in the language. Ease of maintenance is about the maintenance
of programs as well as the maintenance of the language by an implementation of the language.
There might be bugs in the language implementation. Fast compilation translation and support for
extensibility are in fact the most important reasons why Pascal is used as the basic language
from which you add new features to get newer languages. This is one controversial feature,
which is the support for subsets. Most books on programming languages specified this feature.
It is that every language should support subsets of the language in the sense that it should
be necessary for you to use only a smaller set of operations or features of the language
than is actually necessary. It should be possible to divide up the language so that there is
a small kernel and large sorts of extensions. All programs written for the small kernel
run on all machines regardless of the subset supported. However, the language of Ada for
example in the 80s has clearly specified that there should be no support for subsets and
their reason is that it would then affect the portability of programs written in the
language.
Support for subsets is a controversial aspect. It is not at all clear whether it is desirable
especially in the most important language for embedded systems.
They are real time systems that control sensors, various kinds of hardware, ballistic missiles
etc. It could affect their portability because you would have a feature which is used in
a specific implementation and then you move the program to another implementation which
does not support that feature and your programs do not run.
The portability of actually programs written in the language gets affected if you allow
arbitrary subsets. If you look at all these features, there are hundred and hundreds of
programming languages and it is impossible to study all of them and moreover it is not
necessary to do so. Therefore we would like to divide up the study of programming languages
into a few small parts.
What does a theory of programming languages contain? A general theory of programming languages
is based on three aspects. What might be the most important aspect or the most basic aspect
is known as the syntax of the language. A programming language is really like a very
highly simplified natural language; it has a certain syntax which implies that it has
a grammar. Certain elements of a sentence can occur only in certain ways. You cannot
arbitrarily form sentences of the language. As I said, a program written in that language
is the sentence of the language like in a natural language. You take a sentence of a
natural language and you find it has various parts.
For example; in almost all natural languages any full sentence should have a predicate.
A syntactic category called predicate in general may not be words; a predicate could be a clause
or a phrase and so it might have a subject and an object.
Let us take a sentence in natural language. Supposing I say 'run'; that is a complete
sentence because it has a predicate. No complete sentence in the natural language is without
a predicate. So 'run' is a complete sentence because it has a predicate. Optionally, it
might have a subject and may be also an object clause. Subject clauses have to be of a grammatical
form called subject phrases or may be noun phrases which means they might have nouns
qualified by adjectives or an article etc.
You can divide up any grammatically correct sentence in a natural language into firstly
various clauses and each clause into various phrases and all these phrases have certain
grammatical properties in a very much similar manner in a greatly simplified form that might
be called the parts of speech of a program. Every programming language has a certain grammar,
which specifies various parts of speech. It has a certain vocabulary and it is possible
to take a sentence of this programming language and parse it. We will get into the meaning
of these points more clearly later where artificial languages have similarities with natural languages;
(a lot of the theory of the syntax was actually inspired by natural languages) where the construction
of artificial languages did not have a lot of problems in the natural language.
The next aspect is 'semantics', very often called the semantics of a programming language,
which is really specified by its reference manual.
In your study of Pascal programming one of the important references would be the ISO
standard Pascal reference manual by Janson Edward. That reference manual really specifies
for each syntactic entity of the language the effect to be expected by executing that
syntactic entity which is the meaning associated for each language construct.
So, we have to look at the notion of meanings. Unlike natural language the notion of meaning
can be specified mathematically in a machine-independent fashion.
When we are talking about the semantics of this programming language, we are talking
about a pure meaning associated with the language regardless of any machine on which the language
is implemented. In general we are talking about meanings in an abstract setting in the
sense that you assume for practical purposes that you have no restrictions on the word
lengths, no restrictions on memory and you have no restrictions on computational power
except that you can only do a finite number of operations at any instance.
The programming language itself can be thought of as a mathematical entity quite independent
from all its implementations and in general when you think of it that way then you are
thinking of some kind of an ideal machine which has no restrictions except one that
at any instant of time only a finite number of operations can be performed.
You cannot do an infinite number of operations at any instant of time. If you consider such
an ideal machine then you are really looking at the programming language as a mathematical
entity in some ideal environment and the meanings of the constructs of the language in that
ideal environment, form, what might be called the actual reference manual.
If you were to go through the reference manual of Pascal, most of the reference manual is
independent of any machine. There are certain specific machine paragraphs or what are known
as implementation dependent features. But if you look at the language itself, it is
capable of looking at the language as an entity devoid of any machine.
In general the semantics would follow the syntax. The syntax has a certain structure.
The syntax has got some basic elements and some compound forming operations. They are
connectives which allow you to form compound sentences from simpler ones.
The meanings in general should be such that you give the meanings of the basic elements
and then you give the meanings of the connectives in terms of the basic elements.
A programming language is really a finitary object which allows you the construction of
the infinite set of programs.
It is not in general possible to predict the behavior of the language of a program written
in the language unless you have this basic discipline where you express the effects of
connectives in terms of the effects of the basic elements of the language.
It is not in general possible to predict the behavior of the language of a program written
in the language unless you follow this discipline.
The basic feature of any kind of semantics is that it should allow for the derivation
of the meanings of an infinite number of programs which means the only way to derive the meaning
of the program should be from the meanings of its finitary elements.
You have certain basic elements from which complex elements are formed. So, the meaning
of the complex element should be derivable in terms of the meanings of the basic elements
and the meanings associated with the connectives that formed a complex element from the basic
elements. It means that semantics is going to be intimately related to syntax.
The syntax is really a finitary way of specifying an infinite set of allowable objects.
The infinite set of allowable objects are the programs of the language or the sentences
of the language and the syntax gives you a finitary mechanism for specifying all possible
allowable programs very much like a set theory notation, which allows you to give a finitary
specification for an infinite set. What is really analyzable for an arbitrary program
is its structure in terms of its syntax. The meaning of the program has to be derived
from its finitary specification to be expressed in terms of its finitary specification.
Lastly, do not worry about machine constraints or about architecture because the programming
language has to be portable. Do not worry about word lengths or about limits and also
do not worry about memory constraints. Assume infinite amount of memory available
so that you get into what are known as pragmatic considerations. Most of these implementation
dependant features are really pragmatics, for example; the Pascal reference manual does
not tell you how to associate a disc file with a file variable inside the program. That
is one example of an implementation dependant feature which will vary depending upon the
operating system interface you have. Then there are various simple implementations.
The pragmatics firstly involves really all the algorithms that are going to be used for
the implementation of the language and all kinds of machine and architectural constraints
for example the MAXINT. The maximum integer allowable in a Pascal program is a typically
implementation dependant feature because it really depends upon word length or byte length
or on whether it uses two bytes for representing integers etc.
The value of the MAXINT can vary from machine to machine. The amount of memory that is available
for a program can vary whether it is a stack based machine or a register based machine.
All those are implementation dependent features.
Normally, the way to make the language portable even in our implementation is to separate
the basic algorithms of implementation from the architectural specific nature of the implementation.
When we read through a small compiler you will see this happening. So, there are certain
basic algorithms which are really machine-independent and then there are some machine dependant
features and the actual code.
Then you have the OS interface, the nature of the input and output that is file-based,
terminal based, sensor based, signal based, etc. The OS interface also includes the file
server, how the language has to interact with the file server and how the language has to
interact in general with the directory service of the machine etc.
Lastly, what is to be done about errors introduced by users in their programs? Errors could be of a syntactic nature or of
runtime nature and we need to know what to do about them. There is a blanket policy where
you could just abort but you know that does not help anybody in particular. What is the
nature of error reporting? Is there some error correction that can be
done? Is there some way of recovering from errors so that as soon as the first error
comes, you just throw out the program? - No, if you can you should at least be able to
point out all the errors in the program or all possible errors so that the user gets
to know all the errors at one time. It reduces the amount of compilation effort and similarly
you cannot do that at run time but at least there should be some decent error reporting
mechanism.
But error is a very dicey subject and dicey policy. Different languages have taken different
attitudes and different implementations take different attitudes towards it. So, it is
a very pragmatic feature. I will just briefly go through the notion of syntax. There are
three issues and it is better to study all three issues separately. The semantic and
pragmatic issues will closely depend upon the syntax and it is preferable that they
depend upon the syntax. Let us look at what constitutes syntax.
Syntax has to do with a form or a physical representation of possibly an abstract object.
If you look at numbers, they really are not very physical. At least the 20th century attitude
towards numbers is that they are a conception of the mind and are not physical.
They do have a physical representation in the form of numerals. What you write out and
think of as a number is actually a numeral so while you have various ways of representing
the same number let us say the number 126 you can have what is known as the positional
representation which is what we normally use.
In the basic form all these three representations of the number 126 are really the same.
The roman representation differs from the DEVNAGRI representation in the sense that
the characters used are different. What might be called the basic alphabet for the positional
representation of numbers is different in the two.
It is just that the character set is different but that is incidental. It is a positional
representation but in basic forms all these are unified by the fact that they are all positional representations.
You go in units: 16s...64s... Then you have the roman numerals which is non positional.
Firstly, it has a different alphabet. You could represent the same number 126 with different
character sets but there is something unifying about this. There is some fundamental difference
about this representation.
If you disregard the change in character set what makes these two classes different is the grammar
of the numeral of the language used for representing them.
In both these cases the grammar is exactly identical, the character sets are different
and by and large the grammar of all these three is the same except for the character
sets but the grammar is different in the sense the form of representation is fundamentally
different. How you represent compound forms from simpler forms is different between the
roman and Arabic case. What that grammar is, is what we have to study.
Let us look at this in a more general setting of the programming language.
We might think of every programming language as having a vocabulary. A vocabulary is a
complete dictionary of words of the language. A word of a language is formed from a fixed
character set and you identify certain strings of characters as allowable words as part of
the vocabulary of the language. A complete dictionary of the language is what states
that vocabulary. If you take languages, for example natural languages like Konkani or
Sindhi they have different character sets but the same collection of words.
Sindhi for example is written by different people. Some people write in Devanagari, some
people write it in the Urdu script or in the Arabic script.
But the collection of the words is the same, so, a person who knows Devanagari can communicate
with the person who does not know Devanagari but knows the Urdu script by speech because
the words are the same although you cannot communicate by letters.
There is a certain fixed collection of words whose actual form might depend on the character
set. But given the words there are formation rules. Given a vocabulary there are ways of
combining words of the language to form sentences of the language and there is a finite set
of formation rules called productions which allow you to generate all possible sentences
in the language.
Let us quickly go through one example. The character set really depends upon the kind
of codes you use. Nowadays most of the time we use ASCII codes but then we have 8 bit
ASCII's and PC's and 7 bit ASCII's on the main frame.
The character sets are different. There are all these kinds of differences but let us
disregard them and look at what constitutes grammar.
I would say grammar is really a four tuple of objects where there is a set N which is
called the set of non terminals and this set of non terminals really specifies various
kinds of grammatical categories of the language like the parts of speech, noun phrase, verb
phrase, adjectival phrase, noun clause subject clauses, subject phrases, object clauses,
predicates etc. This set N consists of the basic grammatical
categories and all these sets are finite. Then the terminal T is the set of what are
known as terminal symbols or terminal words and T is the complete vocabulary of the language.
P is the collection of formation rules or what are known as productions and S is what
is called as the start symbol but S really represents a grammatical category in N called
as sentence. So, the basic element of the language is what
constitutes a sentence.
Here is a simple grammar specifying Boolean expressions. There is a start symbol S,
(every grammar should have a start symbol S) and some grammatical categories. The grammatical
categories that I have chosen are A, V and C which actually stand for an odd Boolean
expression or a complement expression.
The vocabulary of the language consists of all possible Boolean variables that we might
take including the two left and right parentheses and the three connectors and 'not' and 'or'.
Any Boolean variable is a Boolean expression meaning you take any b belonging to the set
of Boolean variables. The sentences of the language are Boolean expressions that are
more than fully parenthesized Boolean expressions.
Any sentence is either a complement of a Boolean expression or of the 2 Boolean expressions
'and' and 'or'. If an 'and' expression is one form which consists of two Boolean expressions
enclosed in parenthesis and separated by the word 'and', an 'or' Boolean expression similarly
is two Boolean expressions enclosed in parenthesis and separated by an 'or' and it is similarly
the case with 'not'. These productions are really replacement rules
so whenever you find an S you can replace it either by an A, a V or a C or one of the
Boolean variables. Whenever you find an A you have to replace
it by this. There is no other way you can replace A by anything else. Similarly, whenever
you see a C you have to replace it and so here is a simple sentence generation.
You start from S and one of the possibilities is that you can replace S by C. Whatever I
am replacing, I have circled in orange. C has to be replaced and now I am replacing
S leaving everything else intact. I have chosen here to replace S by A and once I have got
an A there, the only possibility is to replace it by this form S. I have chosen to replace
this S rather than this S first and I have chosen to replace it by a Boolean variable.
Let us assume that there are only two variables b1 and b2 and I proceed. Lastly, I get a complete
sentence of the language which consists of only the terminal symbols which is generated
by the grammar.
We talk of a language as being generated from a grammar as a set of all possible sentences
that may be generated from the start symbol S.
Important warnings are that the set of non terminals and the set of terminal symbols
are disjoint and a production is really a replacement. It replaces a non terminal by
a string consisting of terminals or non terminals and a sentence is just a string of terminal
symbols generated from the start symbol.