Building a Lexical Analyzer in Java

May 21, 2018 7-minute read

Building Your Own Programming Language: Part 1

If you are a computer geek like me, building your own programming language is probably on your bucket list. In this post I will go through the first steps in building a programming language.

Interpreter or Compiler

The first hurdle that we need to overcome in our quest to build a programming language is to decide whether we would like the language to be interpreted or compiled. What’s the difference? An interpreter is a program that takes in source code and then executes it (does an action). A compiler is a program that takes in source code and then translates it into machine language. These are very basic definitions of the two concepts. The way I truly understood the difference was by realizing that anyone building a compiler would need to understand machine language and algorithms and a lot of very complex computer science. Whereas, someone building an interpreter just needs to know how to program. Is the difference clearer? Well for my language I decided to build an interpreter because I did not care about speed and optimization, I just wanted to have a working programming language. I will probably build a compiler in the future.

Parts of an Interpreter

Lexical Analyzer (Tokenizer)
Parser
Evaluator
Printer

I chose to build my interpreter in Java, because it is the language that I am most comfortable. You can apply these concepts to any language.

Lexical Analyzer

The lexical analyzer takes the source code and creates tokens. Example

[BinaryOperator, /]

It is the tokens that we feed into our parser. Think about your favorite programming language, in my case Java. Every word or character of significance is a token. Take a look at the code below.

int num = 10;
Boolean name = True;

We would want our lexer (short for lexical analyzer) to create a token for each important word or character for our parser to understand. The exact implementation is up to the developer and how they want the work flow of their programming language to be. Looking at the Java code above, here is an example of the tokens that we would want our lexer to produce.

[Keyword, int]
[Variable, num]
[Assign, =]
[Number, 10]
[Semicolon, ;]
[Keyword, Boolean]
[Variable, name]
[Assign, =]
[BooleanOperator, True]
[Semicolon, ;]

Obviously there is more than one way to create a lexer, so the tokens might look slightly different but in the end it should still yield the same result.

My Lexical Analyzer: readFile

It took me about two days to build the lexer. The first thing that I had to figure out was how to read a file. Java has about three different ways to read files. I chose to use the FileReader and BufferReader because it allowed me to read the whole file at once. Here is my example:

private static String readFile (String filename){

           // The name of the file to open.
           String fileName = filename;

           // This will reference one line at a time
           String line = null;
           String result = "";
           StringBuffer stringBuffer = new StringBuffer();

           try {
               // FileReader reads text files in the default encoding.
               FileReader fileReader =
                       new FileReader(fileName);

               // Always wrap FileReader in BufferedReader.
               BufferedReader bufferedReader =
                       new BufferedReader(fileReader);

               while ((line = bufferedReader.readLine()) != null) {
                   stringBuffer.append(line).append("\n");
                   result = stringBuffer.toString();
               }

               // Always close files.
               bufferedReader.close();
           } catch (FileNotFoundException ex) {
               System.out.println(
                       "Unable to open file '" +
                               fileName + "'");
           } catch (IOException ex) {
               System.out.println(
                       "Error reading file '"
                               + fileName + "'");
               // Or we could just do this:
               // ex.printStackTrace();
           }
           return result;
       }

I created a method named readFile that takes in the name of the file that contains the source code of the programming language and returns a string with all the content. In this post, I do not go through creating the rules and syntax for the programming language because anyone trying to build their own language should work on that by themselves.

My Lexical Analyzer: Tokens

Once we have read the file, the next step is to create the lexer that will produce the tokens for the words and characters. I decided to create a token class. Here is my example:

import java.util.*;
public class Token {
    protected TokenType type;
    protected String value;

    public Token(TokenType type, String value) {
        this.type = type;
        this.value = value;
    }

    public String toString() {
        return String.format("(%s %s)", this.type.name(), this.value);
    }

    public TokenType getTokenType() {
        return this.type;
    }

    public void setTokenType(TokenType type) {
        this.type = type;
    }

    public String getValue() {
        return this.value;
    }

    public void setValue(String value) {
        this.value = value;
    }

    public static enum TokenType {
        PRINT("(print)"),
        ASSIGN("[ \t\f\r\n]{1}(=)"),
        NAME("([a-zA-Z_]+)"),
        BOOLOP("(True)|(False)"),
        LOGICALOP("(&&)|[!|||]"),
        RELATIONOP("(!=)|(>=)|(<=)|[=|<|>]"),
        NUMBER("-?[0-9]+"),
        BINARYOP("[-|+|*|/|%]"),
        WHITESPACE("[ \t\f\r\n]+"),
        COLON("[:]"),
        SEMICOLON("[;]");

        public String pattern;

        private TokenType(String pattern) {
            this.pattern = pattern;
        }

    }
}

My token class has a constructor that takes two parameters. A string value and a TokenType that I created within the Token class using Java’s enum. Like I said previously, every word or character that we need our parser to understand must be tokenized. The TokenType class has every token that I need, and since the amount and types of tokens will not change I used Java’s enum. For those of you that have never used the enum before it allows you to create a class with predefined parameters. The TokenType class I created will not take any other parameters than the ones I defined. The get and set methods are needed so that the class variables can be accessed outside the class. If you haven’t noticed yet the TokenTypes are defined using regular expressions. Obviously you could define your TokenType without regular expressions, I just found that using regular expressions was much easier than any other method.

My Lexical Analyzer: Lexer

Now that we have made the skeletons for the tokens, we know need to create a way to identify and export the tokens. Remember the readFile method we created, go back to that file and we will create another method called lexer. Our method will return an ArrayList of Tokens and take a String input. Here is my example:

import java.util.regex.*;
import java.io.*;

public static ArrayList<Token> lexer (String input)
        {   //Tokens to return
            ArrayList<Token> tokens = new ArrayList<Token>();

            // Lexer logic begins here
            StringBuffer tokenPatternsBuffer = new StringBuffer();
            for (Token.Type tokenType : Token.Type.values())
                tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
            Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

            // Begin matching tokens
            Matcher matcher = tokenPatterns.matcher(input);
            while (matcher.find()) {
                if (matcher.group(Token.Type.NUMBER.name()) != null)
                {
                    tokens.add(new Token(Token.Type.NUMBER, matcher.group(Token.Type.NUMBER.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.BINARYOP.name()) != null)
                {
                    tokens.add(new Token(Token.Type.BINARYOP, matcher.group(Token.Type.BINARYOP.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.RELATIONOP.name()) != null)
                {
                    tokens.add(new Token(Token.Type.RELATIONOP, matcher.group(Token.Type.RELATIONOP.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.LOGICALOP.name()) != null)
                {
                    tokens.add(new Token(Token.Type.LOGICALOP, matcher.group(Token.Type.LOGICALOP.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.BOOLOP.name()) != null)
                {
                    tokens.add(new Token(Token.Type.BOOLOP, matcher.group(Token.Type.BOOLOP.name())));
                    continue;
                }

                else if (matcher.group(Token.Type.ASSIGN.name()) != null)
                {
                    tokens.add(new Token(Token.Type.ASSIGN, matcher.group(Token.Type.ASSIGN.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.PRINT.name()) != null)
                {
                    tokens.add(new Token(Token.Type.PRINT, matcher.group(Token.Type.PRINT.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.SEMICOLON.name()) != null)
                {
                    tokens.add(new Token(Token.Type.SEMICOLON, matcher.group(Token.Type.SEMICOLON.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.COLON.name()) != null)
                {
                    tokens.add(new Token(Token.Type.COLON, matcher.group(Token.Type.COLON.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.NAME.name()) != null)
                {
                    tokens.add(new Token(Token.Type.NAME, matcher.group(Token.Type.NAME.name())));
                    continue;
                }
                else if (matcher.group(Token.Type.WHITESPACE.name()) != null)
                    continue;
            }


            return tokens;
        }

(As you can see a lot of if statements, not very good, my professors would be mad!) When the String is input into the method, we create an ArrayList of Tokens. Then using the stringBuffer and regex matching we check the input string and every time it matches one of our TokenTypes we add the new token to the ArrayList. Since I never intended for the language to be whitespace sensitive we do not add the whitespace token to the ArrayList, but if we were creating a lexer for a language like Python then whitespace would be important and need to be tokenized.

Till Next Time

Well I built the whole lexical analyzer in about two days, so I do not believe it should be much of a pain. I hope I was able to be helpful on your journey to create your own programming language.