Categories
Uncategorized

Stanford open class “compiler theory” study notes (2) recursive descent

table of Contents

  • 一. Parse阶段
    • CFG
    • Recursive Descent (recursive descent traversal)

  • 二. 递归下降遍历

      2.1 Preliminaries

      More than 2.2 lines roadmap statements

      2.3 Simple grammar definition

      2.4 grammar production transcoding

      2.5 parsed line by line

      2.6 Check calculation

  • III. Summary

Sample code is hosted: http: //www.github.com/dashnowords/blogs

Blog Park address: “the history of living in the big front-end” original blog directory

Huawei cloud community Address: [you want the front Daguai upgrade guide]

B station Address: [] compiler theory

Stanford open class: open class [Stanford University official website]

Curriculum content related to speaking of is still very clear, but in some places a little out of touch, with the advice of their own class classic “Compilers-priciples, Techniques and Tools” (that is, the famous dragon book) as supplementary reading.

A. Parse stage

Lexical analysis phase of the mission is to string to Token group, while the target is to phase Parse Token become Parse Tree, Benpian only the most basic part of this part.

CFG

CFG that is context free grammer, define a CFG grammar rules need to declare the following characteristics:

    A set of termination symbol set, also known as “lexical units”

    A set of non-terminating symbols, also known as “syntax variables”

    A set of starting symbols

    A number of production rules (Generator means that under the current CFG syntax, generating symbols -> left and right sides can be substituted for each other)

The basic CFG transformation process is as follows:

Start from the set belonging to S starts, try string nonterminals X alternative form of termination set (X-> Y1Y2 … Yn), this step is repeated until the character sequence contains no more nonterminals. This process is called Derivation (derivative), which is the sequence number of the transformation process, can be converted to a tree, the root of the tree is the initial set member S, each terminating in a sub-set of nodes after the conversion form mounted under the root node, the tree is called tree generated Parse tree, the final result can be seen that leaf node is actually Parse tree traversal results.

When the need to convert the characters have a plurality of non-terminal, needs to be deduced by one in a certain order, the process may be derived in accordance with left-most or right-most, but sometimes get different legal conversions tree, usually by modifying the conversion set syntax or set the priority to solve.

Recursive Descent (recursive descending traversal)

Recursive Descent is a parse tree traversal strategy is a typical recursive backtracking algorithm starts at the root of the tree, one by one attempts to generate a non-termination character rules in the current record on the parent node can support, and determine whether its child nodes in line with this form until the child nodes fits a certain production rules, and then recursively traverse the deep, trying to finish if all production rules in a non-terminating node can not continue down so that the sub-trees leaf nodes are in line with the termination symbol set, then back to the need to try the next node and a parent node of a production rule, so that the program can continue to be backwards loop. Curriculum with a lot of mathematical symbols defined process and pseudo code to describe recursive traversal, and if that is not well understood too abstract can skip. Note left recursive grammar will make recursive traversal fall into an infinite loop, the grammar should be designed to avoid, dragon book also provides a common resolution methods to solve this problem.

II. Recursive descending traversal

[Statement] in due course and did not see the whole picture from tokens to parse tree, and can only be gradually digested the basics. Process just below the author’s own understanding (especially in the form of progressive analysis, because it has not involve any structural syntax, so versatile yet to be considered), for reference, also welcomed the exchange correction. But for intuitive understanding the recursive descent method is sufficient.

2.1 Preparatory knowledge

This section uses JavaScript to implement recursive descent traversal object code remains on the sample code in a blog post:

var b3 = 2;
a = 1 + ( b3 + 4);
return a;

The following sequence of words can be obtained after the word separator in the previous section:

[ 'keywords', 'var' ],
[ 'id', 'b3' ],
[ 'assign', '=' ],
[ 'num', '2' ],
[ 'semicolon', ';' ],
[ 'id', 'a' ],
[ 'assign', '=' ],
[ 'num', '1' ],
[ 'plus', '+' ],
[ 'lparen', '(' ],
[ 'id', 'b3' ],
[ 'plus', '+' ],
[ 'num', '4' ],
[ 'rparen', ')' ],
[ 'semicolon', ';' ],
[ 'keywords', 'return' ],
[ 'id', 'a' ],
[ 'semicolon', ';' ]

The syntax analysis is based on grammatical rules, called grammar rules, generally refers to a series of production CFG represented, most developers do not have the ability to design a set of grammatical rules, draw directly here in Mozilla SpiderMonkey Javascript engine in grammar definition for basic production, due to the very large grammar Javascript language involved in this section only the selected portion of the simplified syntax rules associated with the target analytic formula (marked in blue part in the figure):

You can view the complete syntax rules [SpiderMonkey_ParserAPI] to understand.

More than 2.2 lines roadmap statements

We the above object is a piece of parsing code as Javascript code, when the top-down analysis, root Program type, it may be constituted by a plurality of nodes Statement (statement node), the present embodiment is simplified to SEMICOLON ( semicolon) as the batch processing morpheme boundary point, each part is read into a buffer between the two semicolons analysis, since a single line in the example sentence, it is relatively simple to understand.

In a more complex case, the code includes a conditional statement, there may be multi-line sentence when some structural loop of keywords such case the buffer can first morpheme queue basic structural analysis before recursive descent If the structure of the matching pattern found, read in the sequence of tokens into the buffer from the next row (or rows), until all the tokens in the buffer in compliance with a certain structure, before starting the recursive decline.

2.3 Simple grammar definition

To facilitate understanding, in this case we are abbreviated using keywords to indicate possible grammatical rule set, if you have some knowledge of Javascript language, they are very easy to understand

/**
 * 文法定义-生产规则
 * Program -> Statement
 * P -> S
 * 
 * 语句 -> 块状语句 | if语句 | return语句 | 声明 | 表达式 |......
 * Statement -> BlockStatement | IfStatement | ReturnStatement | Declaration | Expression |......
 * S -> B | I | R | D | E
 * 
 * B -> { Statement }
 * 
 * I -> if ( ExpressionStatement ) { Statement }
 * 
 * R -> return Expression | null
 * 
 * 声明 -> 函数声明 | 变量声明
 * Declaration -> FunctionDeclaration | VariableDeclaration
 * D -> F | V
 * 
 * F -> function ID ( SequenceExpression ) { ... }
 * 
 * V -> 'var | let | const' ID [= Expression | Null] ?
 * 
 * 表达式 -> 赋值表达式 | 序列表达式 | 一元运算表达式 | 二元运算表达式 |......
 * Expression -> AssignmentExpression | SequenceExpression | UnaryExpression | BinaryExpression | BracketExpression......
 * E -> A | Seq | U | BI | BRA |...
 * 
 * A -> E = E //赋值表达式
 * 
 * Seq -> ID,ID,ID//类似形式
 * 
 * //一元表达式
 * U -> "-" | "+" | "!" | "~" | "typeof" | "void" | "delete" E
 * 
 * //二元表达式
 * BI -> E "==" | "!=" | "===" | "!=="
         | "<" | "<=" | ">" | ">="
         | "<<" | ">>" | ">>>"
         | "+" | "-" | "*" | "/" | "%"
         | "|" | "^" | "&" | "in"
         | "instanceof" | ".."  E
 * 
 * //括号表达式
 * BRA -> ( E )
 * 
 * N -> null  
 */

Need extra attention to that expression Expression assignment expression AssignmentExpression of production, judge rules E’s need to determine A, and the logic A’s again called E, there is a kind of left recursion, if you do not perform any treatment, the code runs will be caught in an endless loop and then burst stack, which is earlier emphasized the need to eliminate left-recursive grammar scene in the production design. This is not to say spiderMonkey of parserAPI is wrong, because the elimination of left-recursive grammar transformation is equivalent to just a form of conversion, it is to prevent the production of infinite recursion (or enter an infinite recursion of the program to achieve an infinite loop) and one form of treatment done, the process of transformation may be just introduced an intermediate collection to eliminate the effects of this scenario, and will not have an impact on the final syntax ideographic.

The sample code below does not conduct rigorous “left recursion elimination”, but simply used a set of E_, some subtle differences distinguish the original E, thus avoiding an endless loop.

2.4 grammar production transcoding

The syntax rules in the previous section is translated code below (only a partial derivation production, complete code in this example may be obtained from a cartridge or code demo):

//判断是否为Statement
function S(tokens) {
    //把结尾的分号全部去除
    while(tokens[tokens.length - 1][0] === TT.semicolon){
        tokens.pop();
    }
    return B(tokens) || I(tokens) || R(tokens) || D(tokens) || E(tokens);
}

//判断是否为BlockStatement  B -> { Statement } (本例中并不涉及本方法,故暂不考虑末尾分号和文法递归的情况)
function B(tokens) {
     //本例中不涉及,直接返回false
    return false;
}

//判断是否为IfStatement I -> if ( ExpressionStatement ) { Statement }
function I(tokens) {
    //本例中不涉及,直接返回false
    return false;
}
//判断是否为ReturnStatement  R -> return Expression | null
function R(tokens) {
    return isReturn(tokens[0]) && (E(tokens.slice(1)) || N(tokens.slice(1)[0]));
}

//判断是否为声明语句 Declaration -> FunctionDeclaration | VariableDeclaration
function D(tokens) {
    return F(tokens) || V(tokens);
}

//判断是否为函数声明  F -> function ID ( SequenceExpression ) { ... }
function F(tokens) {
    //本例中不涉及,直接返回false
    return false;
}

//判断是否为变量声明  V -> 'var | let | const' ID [= Expression | Null] ?
function V(tokens) {
    //判断为1.单纯的声明 还是 2.带有初始值的声明
    if (tokens.length === 2) {
        return isVariableDeclarationKeywords(tokens[0]) && tokens[1][0] === TT.id;
    }
    return isVariableDeclarationKeywords(tokens[0]) && (A(tokens.slice(1))) || N(tokens.slice(1));
}

//....其他代码形式雷同,不再赘述

2.5 parsed line by line

The default parsing each represents the end of a statement, it has been previously mentioned encounters a semicolon ideas for dealing with multi-line statement. Only need to read a bit sequence of tokens into the buffer array implementing the method, and starting the analysis from the top S, to complete the top-down process of reasoning.

 /**parser */
function parse(tokens) {
    let buffer = nextStatement(tokens);
    let flag = true;

    while (buffer && flag){

       if (!S(buffer)) {
           console.log('检测到不符合语法的tokens序列');
           flag = false;
       } 
       buffer = nextStatement(tokens);
    }   
    
    //如果没有出错则提示正确
    flag && console.log('检测结束,被检测tokens序列是合法的代码段');
}

//将下一个Statement全部读入缓冲区
function nextStatement(tokens) {
    let result = [];
    let token;

    while(tokens.length) {
        token = tokens.shift();
        result.push(token);
        //如果不是换行符则
        if (token[0] === CRLF) {
            break;
        }
    }

    return result.length ? result : null;
}

2.6 Check calculation

Calculation step execution process can help us better understand the process of performing a recursive descent method:

Open a command line in the input directory demo: node –inspect-brk recursive-descent.js, then step through the code is easy to see how to implement backtracking recursion and in the implementation process:

III. Summary

When the prompt simply the end result of recursive descent only to find out the rules of grammar does not meet any statement or final statements are in line with all the rules of grammar, but did not get the object in a tree structure, nor the next link providing output, it remains to be exploring how to connect with the subsequent links in the compilation process.

Leave a Reply