Parsing Rules

    This latest version of Machine English is based around a multi-level rules engine. That means the natural language parsing logic is encapsulated in chunks of logic that can be easily added, updated, removed, for rapid development and customization.

    Currently there are 4 types of rules. SymbolRules are at the character level. TokenRules govern the organization of text and how they are interpreted at the word, number, complex word, or punctuation levels. MeaningRules allow you to add logic to pick from multiple meanings the Token can have. Finally, AnswerRules apply logic to retrieve an answer based on the patterns built up in the TokenGraph.

    For experimentation purposes, you can turn on and off individual Symbol and Token rules and see how that affects the output of both the Demo page, where you can enter custom text, and Tests page, where you can run a battery of tests. The site sets an anonymous profile cookie to remember your choices. Clearing your cookies for the site will remove your profile information.

     
    Current Profile: c03016ca-09e0-4871-ad40-b5525d67c75f
    SelectedSymbolRules: Default SelectedTokenRules: Default
    Symbol Rules

    When input text is first entered into ME, a SymbolGraph of characters is built. The individual character and its properties is called a Symbol and includes the whitespace following the character. A series of SymbolRules are run to determine as much information about a Symbol as possible. For instance, a period character '.' could also possibly be a decimal point, or indicate an abbreviation. The decimal needs to be considered as part of a number, while a period is given a property of 'IsToken' and a property that it represents an end mark. Symbol Rules do not change the structure of the SymbolGraph, but only add properties to describe the individual Symbols that will be used when building the TokenGraph. This information is then used for processing further down the Rule pipeline.

    Comma   SortOrder: 1.00
    The main function of the rule is to identify the function of a comma, either as NumericSymbol in a number or as a word break token.
    ComboPunctuation   SortOrder: 1.00
    DEV. This rule is a rule to combine the Comma, EndMark, Dash, Apostrophe, and IsPunctuation rules.
    EndMark   SortOrder: 2.00
    This rule identifies the function of symbols '.' , '?', '!' as a sentence EndMark tokens, or whether a period is part of an Ellipse or is a NumericSymbol as a Decimal Point.
    Dash   SortOrder: 3.00
    Identifies whether Dash symbol is either a NumericSymbol of a number, a minus operator MathSymbol, or a dash.
    Apostrophe   SortOrder: 4.00
    Determines whether the Apostrophe symbol is a single quote or an apostrophe. It is important that this rule runs before the PairedSymbolsRule to correctly group single quotes.
    Currency   SortOrder: 5.00
    Identifies that a symbol is a CurrencySymbol. Currently, is only set to identify US currencies '$' and '¢'.
    Math   SortOrder: 6.00
    This rule identifies the MathSymbols '<', '>', '=', and '*', setting the property IsToken on the symbol so that it will be recognized as a single character Token.
    IsNumeric   SortOrder: 7.00
    A catch-all rule that picks up any number characters and identifies them as NumericSymbols.
    IsPunctuation   SortOrder: 8.00
    A catch-all rule that picks up any Punctuation marks that have not yet been by caught by other rules and identifies them as Punctuation and as a single character Token
    PairedSymbols   SortOrder: 9.00
    This rule finds the ordinal of the first character of a Paired Symbol then finds the matching ordinal. Currently works with Single / Double Quotes, 3 types of parenthesis, and less than / greater than characters.
    Token Rules

    A TokenGraph is built on top of the SymbolGraph and is focused on determining the type and organization of a word or group of words. TokenRules takes infomation from the SymbolGraph and uses it to structure Symbol groups as Tokens. Unlike the SymbolGraph, Tokens can be built up of other Tokens and can represent Words, Numbers, multi-word Complex Words, Punctuation, or raw (unidentified) Text. TokenRules begin identifying the primary meanings of the Token and adds properties that help describe the additional qualities.

    IdentifyNumeric   SortOrder: 1.00
    This rule converts a TextToken into a NumericToken if all symbols are NumericSymbols.
    IdentifyWords   SortOrder: 3.00
    If TextToken is contained in the word dictionary, it converts it to a WordToken with possible MeaningRules, then runs the MeaningRules to determine which is best meaning.
    IdentifyPlurals   SortOrder: 3.50
    Tests any TextTokens to see if they are the pluraized version of a word in the word dictionary.
    ComplexWord   SortOrder: 4.00
    Searches all the WordTokens (only) in TokenGraph and combines them into a ComplexWordToken with a single AbstractId. (Currently has a small vocabulary)
    IdentifyMeanings   SortOrder: 5.00
    Loads the MeaningRule and any the MeaningProps from the datasource corresponding to the MeaningRuleId of the WordToken or ComplexWordToken.
    NumberMeanings   SortOrder: 6.00
    Merges the any NumericTokens and WordTokens that have a NumericValue and calculates the numeric DoubleValue of the combined Tokens, being mindful of PlaceValues such as a hundred, a thousand, a million, etc.
    IdentifyPunctuation   SortOrder: 7.00
    Converts any TextTokens that are single character punctuation Tokens into PunctuationTokens.
    SentenceEnd   SortOrder: 7.20
    Adds SentenceEnd values to appropriate PunctuationTokens, usually with EndMark symbols but occasionally quote marks or other Grouping tokens.
    RepeatingPunctuation   SortOrder: 7.50
    Merges consecutive Punctuations of same type into a single PunctuationToken
    Domain   SortOrder: 8.00
    Identifies domains and combines other tokens into emails and URLs.
    Article   SortOrder: 9.00
    Identifies the articles "a", "an", and "the" and tags whether it is definitive or not
    CurrencyMeanings   SortOrder: 9.00
    Merges currency Abstracts with numeric values
    ZipCode   SortOrder: 10.00
    Identifies NumericTokens that represent a ZipCode or a ZipCode+4
    ListMembers   SortOrder: 11.00
    Groups together comma separated values (including and / or values) into a Members list
    Meaning Rules

    MeaningRules provide the last step of understanding for the Meaning of a Token. Much of the structural meaning of the Tokens will have been established by this point. This where we further narrow the most appropriate Meaning from the possible meanings that a Token can have. For instance, is the WordToken "HI" a greeting, or is being used in the context of the abbreviation of a US state?

    (0) Default Meaning
    HI UsStateHawaii (1411) Hawaii US State
    OH UsStateOhio (1435) Ohio US State
    OK UsStateOklahoma (1436) Oklahoma US State
    Answer Rules

    Once the TokenGraph is complete and the most likely Meanings have been calculated, we can generate a Pattern of the structure of the Tokens. Based on the best match with the list of Patterns in our data store, an appropriate AnswerRule is returned. This rule can run any arbitrary code and will return an appropriate string response.

    Default (0) : Default Answer Rule
    SimpleEquation (2) : Solve for a simple equation.
    Currency (3) : Solve for simple currency math.
    Equation (4) : Use cancelling out terms to balance equation if needed
    CommonType (5) : Find a common type for a member list
    WhatIs (6) : Identify what type something is