MACHINE ENGLISH

This latest version of Machine English is based around a multi-level rules engine. That means the natural language parsing logic is encapsulated in chunks of logic that can be easily added, updated, removed, for rapid development and customization.

Currently there are 4 types of rules. SymbolRules are at the character level. TokenRules govern the organization of text and how they are interpreted at the word, number, complex word, or punctuation levels. MeaningRules allow you to add logic to pick from multiple meanings the Token can have. Finally, AnswerRules apply logic to retrieve an answer based on the patterns built up in the TokenGraph.

For experimentation purposes, you can turn on and off individual Symbol and Token rules and see how that affects the output of both the Demo page, where you can enter custom text, and Tests page, where you can run a battery of tests. The site sets an anonymous profile cookie to remember your choices. Clearing your cookies for the site will remove your profile information.

Current Profile: fe1e6ec8-1b64-4c98-a7fb-2b3e0c4a9ba0

SelectedSymbolRules: Default SelectedTokenRules: Default

Symbol Rules

When input text is first entered into ME, a SymbolGraph of characters is built. The individual character and its properties is called a Symbol and includes the whitespace following the character. A series of SymbolRules are run to determine as much information about a Symbol as possible. For instance, a period character '.' could also possibly be a decimal point, or indicate an abbreviation. The decimal needs to be considered as part of a number, while a period is given a property of 'IsToken' and a property that it represents an end mark. Symbol Rules do not change the structure of the SymbolGraph, but only add properties to describe the individual Symbols that will be used when building the TokenGraph. This information is then used for processing further down the Rule pipeline.

Comma SortOrder: 1.00

The main function of the rule is to identify the function of a comma, either as NumericSymbol in a number or as a word break token.

ComboPunctuation SortOrder: 1.00

DEV. This rule is a rule to combine the Comma, EndMark, Dash, Apostrophe, and IsPunctuation rules.

EndMark SortOrder: 2.00

This rule identifies the function of symbols '.' , '?', '!' as a sentence EndMark tokens, or whether a period is part of an Ellipse or is a NumericSymbol as a Decimal Point.

Dash SortOrder: 3.00

Identifies whether Dash symbol is either a NumericSymbol of a number, a minus operator MathSymbol, or a dash.

Apostrophe SortOrder: 4.00

Determines whether the Apostrophe symbol is a single quote or an apostrophe. It is important that this rule runs before the PairedSymbolsRule to correctly group single quotes.

Currency SortOrder: 5.00

Identifies that a symbol is a CurrencySymbol. Currently, is only set to identify US currencies '$' and '¢'.

Math SortOrder: 6.00

This rule identifies the MathSymbols '<', '>', '=', and '*', setting the property IsToken on the symbol so that it will be recognized as a single character Token.

IsNumeric SortOrder: 7.00

A catch-all rule that picks up any number characters and identifies them as NumericSymbols.

IsPunctuation SortOrder: 8.00

A catch-all rule that picks up any Punctuation marks that have not yet been by caught by other rules and identifies them as Punctuation and as a single character Token

PairedSymbols SortOrder: 9.00

This rule finds the ordinal of the first character of a Paired Symbol then finds the matching ordinal. Currently works with Single / Double Quotes, 3 types of parenthesis, and less than / greater than characters.

Token Rules

A TokenGraph is built on top of the SymbolGraph and is focused on determining the type and organization of a word or group of words. TokenRules takes infomation from the SymbolGraph and uses it to structure Symbol groups as Tokens. Unlike the SymbolGraph, Tokens can be built up of other Tokens and can represent Words, Numbers, multi-word Complex Words, Punctuation, or raw (unidentified) Text. TokenRules begin identifying the primary meanings of the Token and adds properties that help describe the additional qualities.

IdentifyNumeric SortOrder: 1.00

This rule converts a TextToken into a NumericToken if all symbols are NumericSymbols.

IdentifyWords SortOrder: 3.00

If TextToken is contained in the word dictionary, it converts it to a WordToken with possible MeaningRules, then runs the MeaningRules to determine which is best meaning.

IdentifyPlurals SortOrder: 3.50

Tests any TextTokens to see if they are the pluraized version of a word in the word dictionary.

ComplexWord SortOrder: 4.00

Searches all the WordTokens (only) in TokenGraph and combines them into a ComplexWordToken with a single AbstractId. (Currently has a small vocabulary)

IdentifyMeanings SortOrder: 5.00

Loads the MeaningRule and any the MeaningProps from the datasource corresponding to the MeaningRuleId of the WordToken or ComplexWordToken.

NumberMeanings SortOrder: 6.00

Merges the any NumericTokens and WordTokens that have a NumericValue and calculates the numeric DoubleValue of the combined Tokens, being mindful of PlaceValues such as a hundred, a thousand, a million, etc.

IdentifyPunctuation SortOrder: 7.00

Converts any TextTokens that are single character punctuation Tokens into PunctuationTokens.

SentenceEnd SortOrder: 7.20

Adds SentenceEnd values to appropriate PunctuationTokens, usually with EndMark symbols but occasionally quote marks or other Grouping tokens.

RepeatingPunctuation SortOrder: 7.50

Merges consecutive Punctuations of same type into a single PunctuationToken

Domain SortOrder: 8.00

Identifies domains and combines other tokens into emails and URLs.

Article SortOrder: 9.00

Identifies the articles "a", "an", and "the" and tags whether it is definitive or not

CurrencyMeanings SortOrder: 9.00

Merges currency Abstracts with numeric values

ZipCode SortOrder: 10.00

Identifies NumericTokens that represent a ZipCode or a ZipCode+4

ListMembers SortOrder: 11.00

Groups together comma separated values (including and / or values) into a Members list

Meaning Rules

MeaningRules provide the last step of understanding for the Meaning of a Token. Much of the structural meaning of the Tokens will have been established by this point. This where we further narrow the most appropriate Meaning from the possible meanings that a Token can have. For instance, is the WordToken "HI" a greeting, or is being used in the context of the abbreviation of a US state?

(0) Default Meaning

HI UsStateHawaii (1411) Hawaii US State

OH UsStateOhio (1435) Ohio US State

OK UsStateOklahoma (1436) Oklahoma US State

Answer Rules

Once the TokenGraph is complete and the most likely Meanings have been calculated, we can generate a Pattern of the structure of the Tokens. Based on the best match with the list of Patterns in our data store, an appropriate AnswerRule is returned. This rule can run any arbitrary code and will return an appropriate string response.

Default (0) : Default Answer Rule

SimpleEquation (2) : Solve for a simple equation.

Currency (3) : Solve for simple currency math.

Equation (4) : Use cancelling out terms to balance equation if needed

CommonType (5) : Find a common type for a member list

WhatIs (6) : Identify what type something is

Parsing Rules