Building a compiler in Python

Have you ever wondered how your high-level code gets transformed into machine instructions? Or perhaps pondered the magic behind turning…

Luis Soares

19 Sep 2023 — 3 min read

Have you ever wondered how your high-level code gets transformed into machine instructions? Or perhaps pondered the magic behind turning human-readable text into a format a machine understands and executes? If so, you’re in for an enlightening journey! Building a compiler is a rite of passage for computer scientists and an unparalleled learning experience for every developer.

Why delve into compiler construction?

Deepened Understanding: The art of compiler design goes beyond the surface, delving deep into the intricate ballet of how code is parsed, optimized, and translated. This deeper understanding makes you appreciate the nuances of programming languages, enabling you to write more efficient and cleaner code.
Problem Solving: Designing and implementing a compiler will test and refine your problem-solving skills. From lexical analysis to code generation, each stage presents its unique challenges and rewards.
Diverse Knowledge Gain: It’s not just about translation; it’s about optimization, data structure utilization, algorithms, and understanding the architecture of underlying machines. This diverse knowledge base can be applied in numerous software development and computer science areas.
Low-Level Appreciation: In an era dominated by high-level languages and frameworks, understanding the intricacies at a lower level sets you apart. You get a clearer picture of performance implications, memory management, and how different constructs affect machine instructions.

By the end of this guide, you won’t just have built a simple compiler; you’ll have embarked on a journey that brings clarity to many of the abstract concepts in software development. Even if you don’t end up professionally in the niche field of compiler design, the insights gained can elevate your coding prowess in myriad unexpected ways.

So, are you ready to embark on this illuminating journey? Let’s start by understanding the language we’ll be compiling!

Step 1: Choose a Language Specification

For this guide, we’ll define a simple language called MiniLang. It will support:

Integer literals
Addition and subtraction

The grammar is:

<expression> ::= <number> 
              | <expression> + <expression> 
              | <expression> - <expression>

Step 2: Lexical Analysis (Tokenizing)

This step converts source code text into a list of tokens.

import re 
 
def tokenize(code): 
    return re.findall(r'\d+|\+|-', code)

Step 3: Parsing

Convert the flat list of tokens into a structured format (often an Abstract Syntax Tree, AST). For simplicity, we’ll use nested lists.

def parse(tokens): 
    token = tokens.pop(0) 
     
    if token == '+': 
        return ['+', parse(tokens), parse(tokens)] 
    elif token == '-': 
        return ['-', parse(tokens), parse(tokens)] 
    elif token.isdigit(): 
        return ['INT', int(token)]

Step 4: Code Generation

We’ll target a hypothetical stack machine. Operations pop their operands off the stack and push results.

def generate(ast): 
    if ast[0] == 'INT': 
        return [f"PUSH {ast[1]}"] 
    else: 
        left_code = generate(ast[1]) 
        right_code = generate(ast[2]) 
        if ast[0] == '+': 
            return left_code + right_code + ["ADD"] 
        elif ast[0] == '-': 
            return left_code + right_code + ["SUB"]

Step 5: Bring It All Together

Create a compile function that combines the tokenizing, parsing, and code generation steps.

def compile(code): 
    tokens = tokenize(code) 
    ast = parse(tokens) 
    return generate(ast)

Example Usage:

code = "+ 5 - 3 2" 
print("\n".join(compile(code)))

This should produce:

PUSH 5 
PUSH 3 
PUSH 2 
SUB 
ADD

The output produces an assembly-like low-level code conceptually similar to what machine code looks like.

Final Notes:

This is a very basic and limited compiler. Real-world compilers have many more stages, handle more sophisticated language features, and produce more optimized output.
Error handling is omitted for clarity. Usually, you’d add checks to handle invalid syntax or other issues.
This compiler targets a hypothetical machine for simplicity. In a real scenario, you’d generate code for an actual or virtual machine, like the JVM or LLVM.
If you’re serious about writing compilers, consider diving into resources like the “Dragon Book” (Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman).

Stay tuned, and happy coding!

Visit my Blog for more articles, news, and software engineering stuff!

Follow me on Medium, LinkedIn, and Twitter.

All the best,