Graph Viz
I’ve been trying out a new approach for helping students debug their Abstract Syntax Trees in the Compiler Design and Implementation class where I’m a TA. The compiler they’re building is for a subset of C89 and the first two assignments are flex and bison which culminate in an AST for their compiler to perform the compilation. A method for visualizing the graph goes a long way in identifying where you have incorrect rules in flex or bison or the supporting code used to construct the tree from a simple program. I recalled that GraphViz had a simple language for representing a graph so I decided to try it out.
Installing GraphViz
I favor MacPorts and was happy to see that there’s an installer available for GraphViz as well as its GUI. The following was taken from the graphviz.org install page:
- log in & download Xcode an Xcode Command Line Tools from https://developer.apple.com/downloads/
- install Xcode and the Xcode Command Line Tools
- agree to Xcode license in terminal: sudo xcodebuild -license
- get MacPorts pkg installer for your version of osx from https://www.macports.org/install.php#installing
- install MacPorts for your version of osx
- in terminal: sudo port -v selfupdate
- install graphviz via MacPorts. in terminal: sudo port install graphviz-gui
- installed gui application can be found here: /Application/Macports/Graphviz.app
DOT
DOT is a text based graph representation. Each of the nodes has an entry where you can configure its shape and label when rendering. I kept the format really simple by using the default shape and the node type for the the label. In a few cases, I modified the label for something like an operator.
The hello world DOT example
digraph AST {
node1 [label="hello"]
node2 [label="world"]
node1 -> node2
}
Generate a PNG from this snippet using the dot command line like so:
dot -O -Tpng helloworld.dot
-O
tells DOT to create an output file with the same name as the input file and the appropriate suffix
-Tpng
tells DOT to set the type of the output file to PNG. There are a number of other formats supported, see their docs.
The first line tells the DOT program what type of graph to produce. I’m using a directed graph so the edges will have arrows on them.
The nodes are defined with an id and then their attributes withing the square brackets. If there isn’t a label for the node then it’ll use the id of the node for the label. I’ve created two nodes here node1
and node2
and applied labels to them.
The last bit is connecting the nodes. I have node1
pointing to node2
.
Converting AST to DOT
The first step is determining how to name the nodes. The only requirement here is for our nodes to have unique names. Since the parser is written in C, the easiest thing to do is to use the struct node * as the identifier. Thus, where I have node1, node2, etc in the Hello World example, I’ll use a simple printf and “node%p” in order to produce a unique name for the node.
Consider a simple binary operation like addition. The expression a + b
will result in a root node for the plus operation with a left node for the identifier a
and a right node for the identifier b
. This graph in DOT would be:
digraph AST {
node1 [label="+"]
node2 [label="a"]
node3 [label="b"]
node1 -> node2
node1 -> node3
}
A slightly larger example from C code:
int max(int a, int b) {
return a > b? a : b;
}
The resulting formalized version source (the parser fully specifies the types and adds some clarifying parens.
signed int max(signed int a, signed int b) {
return (a > b)? a : b;
}
The resulting graph and DOT file:
FYI - the AST is a rough draft at this point and needs some tweeking
How to read the graph:
Start from the top and move down and to the left. Here’s a traversal step by step:
- NODE_TUNIT : the root translation unit
- NODE_FUNC_DEF : the function definition for our simple max function
- NODE_SIGNED_INT : the return type for the function
- NODE_FUNC_DECL : the declaration of the function (which would be matched to its declaration if available)
- max : the identifier for the function (its name)
- NODE_PLIST : the list of formal parameters for the function
- NODE_PDECL : first param which is a signed int named
a
- NODE_PDECL : second param which is a signed int named
b
- NODE_SCOPE : the function body
- return : the return statement
- NODE_TERNARY : c’s ternary expression
- > : relational expression
- a : result of the ternary if its expression is true
- b : result of the ternary if its expression is false
digraph AST {
node0x7fc1cac2d5c0 [label="NODE_TUNIT"]
node0x7fc1cac2d490 [label="NODE_FUNC_DEF"]
node0x7fc1cac2bf30 [label="NODE_SIGNED_INT"]
node0x7fc1cac2d490 -> node0x7fc1cac2bf30
node0x7fc1cac2c9e0 [label="NODE_FUNC_DECL"]
node0x7fc1cac2c060 [label="max"]
node0x7fc1cac2c9e0 -> node0x7fc1cac2c060
node0x7fc1cac2c8b0 [label="NODE_PLIST"]
node0x7fc1cac2c3f0 [label="NODE_PDECL"]
node0x7fc1cac2c190 [label="NODE_SIGNED_INT"]
node0x7fc1cac2c3f0 -> node0x7fc1cac2c190
node0x7fc1cac2c2c0 [label="a"]
node0x7fc1cac2c3f0 -> node0x7fc1cac2c2c0
node0x7fc1cac2c8b0 -> node0x7fc1cac2c3f0
node0x7fc1cac2c780 [label="NODE_PDECL"]
node0x7fc1cac2c520 [label="NODE_SIGNED_INT"]
node0x7fc1cac2c780 -> node0x7fc1cac2c520
node0x7fc1cac2c650 [label="b"]
node0x7fc1cac2c780 -> node0x7fc1cac2c650
node0x7fc1cac2c8b0 -> node0x7fc1cac2c780
node0x7fc1cac2c9e0 -> node0x7fc1cac2c8b0
node0x7fc1cac2d490 -> node0x7fc1cac2c9e0
node0x7fc1cac2d360 [label="NODE_SCOPE"]
node0x7fc1cac2d230 [label="return"]
node0x7fc1cac2d100 [label="NODE_TERNARY"]
node0x7fc1cac2cd70 [label=">"]
node0x7fc1cac2cb10 [label="a"]
node0x7fc1cac2cc40 [label="b"]
node0x7fc1cac2cd70 -> node0x7fc1cac2cb10
node0x7fc1cac2cd70 -> node0x7fc1cac2cc40
node0x7fc1cac2cea0 [label="a"]
node0x7fc1cac2cfd0 [label="b"]
node0x7fc1cac2d100 -> node0x7fc1cac2cd70
node0x7fc1cac2d100 -> node0x7fc1cac2cea0
node0x7fc1cac2d100 -> node0x7fc1cac2cfd0
node0x7fc1cac2d230 -> node0x7fc1cac2d100
node0x7fc1cac2d360 -> node0x7fc1cac2d230
node0x7fc1cac2d490 -> node0x7fc1cac2d360
node0x7fc1cac2d5c0 -> node0x7fc1cac2d490
}
Next Steps
The above is the result of a couple of hours of playing with DOT and producing an alternate format for the AST. One obvious next step would be to improve the DOT file to include labels for the edges between nodes to better explain their relationships. For example, the function declaration’s label to its identifier could be labeled “name” and the ternary’s labels to its children could be “test”, “iftrue”, “iffalse”.
Check out the gallery in GraphViz, there’s a ton of formatting options!