Labels

Thursday 31 March 2016

Owning a data format

When designing a software system, it's important to be  careful with the dependencies we are taking, especially those we don't own.
But what if we introduce a dependency which that is not so easily recognizable?



Let's assume that the data is more important than a piece of software producing the data artifacts. It is so for many projects, and it is definitely so for the project I'm working at right now.

It's not uncommon for a business to dispose of a piece of software but keep the data which is still valuable for the business.

Even more often, the data produced by a legacy software piece must be used in a newly developed software. Still often, this new development project could use a totally different platform.

From this perspective, a dependency on a technology brings the following two questions, among the more obvious ones:
  • How long will it last for?
  • What does it take to get rid of?

Every software project deals with various data formats, and those introduce dependencies.
If we don't own the source code for the parser for a data type, we take on an external dependency.
For example, if we store the data in XML format, we depend on an XML writer/parser and that is fine for the most part, because it's an established standard, and it's always possible to replace that piece of software with another almost in any programming environment. This form of ownership is usually acceptable because it can be treated as a commonwealth.

Sometimes we take a dependency of a different kind, and this dependency is not so easily recognizable.

Let's take a look at a simple example.
Say we need a trivial scripting capability in our software. A client wants to be able to provide a formula to customize computation of some Initial Payment value for a consumer loan system, which needs to be customized for different banking products:

InitialPayment >= (TotalCost -  ApprovedCreditLimit) × 0.3

The formula is trivial, but it could be much more complicated, with multiple additional parameters and conditional logic in a ternary operator.

The simplest approach

The simplest approach coming to mind is to use some kind of an 'eval' function, available in dynamic scripting languages:

var initialPaymentAcceptable = (decimal) SuperSciptingLanguage.Eval(userInput, parameters);

Great, we just introduced quite a few issue bullets:

Injection Attack Surface

  • We are lucky if the package runs the expression code in some kind of a sandbox that would protect our program's process scope from an injection attack.
  • We are lucky if the interpreter can be run in 'just an expression' mode, and will not running the infinite loop constructs in the user's code.
Side note (quote from the OWASP Top 10 Threats and Mitigations Exam)

2) Your application is created using a language that does not support a clear distinction between code and data. Which vulnerability is most likely to occur in your application?
  1. Injection Correct
  2. ..

Performance

  • We are lucky if the interpreter is performant enough. It could be running for every transaction happening in our system multiple times, and few milliseconds to spin up a language environment of an interpreter for every expression evaluation could tally up to a major server performance issue.

Project Lifecycle and Maintenance

  • We lucky if there are suitable packages for all the client platforms for this code to run.
Well, if we compute the overall level of luck we need, it's pretty high. With 33% luck for every of the above bullets, the overall is about 98.88%.

Are you generally feeling like a 99.88% lucky person? I'd say, as a software developer you shouldn't.

Just score your dynamic language execution environment of choice with the bullets above and enough of the joke.

A funny thing that Grammarly suggests changing the word 'eval' to 'evil' as I type, and I agree.

TL;DR;

Now that we a are aware of such dreaded consequences of taking on such a dependency, let's move to the more severe aspects:

Data Format Dependency

What data type you would use to store a simple expression in a database?
Do you think a string data type would be the most appropriate?

Think twice.

Let's return to the two questions we already discussed in a bit more detail. I find that the answers to those questions are mandatory when choosing things to introduce in a project:
How long will it last for?
What does it take to get rid of?

More precisely, would it be possible to replace a thing with an alternative later, especially if we don't know the alternatives yet?

For this to happen, a thing must be interpretable to be replaced, and it must be not more complicated than the alternative.

AST - an interpretable thing

What makes things interpretable?

Consider the following simple example. A name in and a surname are stored in a database in separate fields. This makes this data interpretable. We can easily derive a full name from that:

string.Format("{0} {1}", name, surname);

But if we choose to store the full name, it's not easy to interpret it as the name and surname in separate, because it could be as complicated as Mobutu Sese Seko Kuku Ngbendu Wa Za Banga.

The same applies to an expression in a string: it's a full thing, and it's not trivial to interpret (remember, we needed an entire dreaded package to use an eval function).

At this point, it's time to think of the role the expression plays in our project. It's a means of providing a Specification. Analyzing the samples for the specification pattern we can see that such a specification code is trivial to have in our project in terms of maintainability issues we discussed and we know that a specification can be represented by its separate building blocks which in turn could be stored in a serialized form. As for the expression, used as a specification, it's predicates can be represented by an Abstract Syntax Tree, and an AST is trivial in terms of computation.

Using the Ast as a model

Parsers are known as being basically non-trivial, we should use parsers sparingly. For example, a parser could be used just at design time when a user enters an expression in a code editor window. This way we limit the scope for of the dependencies associated with the expression language by just the editor environment.

Here is how an AST could look like when the expression is parsed:

CompilationUnit                  : InitialPayment>=(TotalCost-ApprovedCreditLimit)*0.3
 GlobalStatement                 : InitialPayment>=(TotalCost-ApprovedCreditLimit)*0.3
  ExpressionStatement            : InitialPayment>=(TotalCost-ApprovedCreditLimit)*0.3
   GreaterThanOrEqualExpression  : InitialPayment>=(TotalCost-ApprovedCreditLimit)*0.3
    IdentifierName               : InitialPayment
    MultiplyExpression           : (TotalCost-ApprovedCreditLimit)*0.3
     ParenthesizedExpression     : (TotalCost-ApprovedCreditLimit)
      SubtractExpression         : TotalCost-ApprovedCreditLimit
       IdentifierName            : TotalCost
       IdentifierName            : ApprovedCreditLimit
     NumericLiteralExpression    : 0.3

Belo is the exact code used to produce the listing using the Roslyn parsing capabilities.
It's parsed as a C# language which syntax allows for simple math and boolean constructs.

using System.Diagnostics;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;

namespace ParsingWithRoslyn
{
    class Program
    {
        static void Main(string[] args)
        {
            var expression = "InitialPayment>=(TotalCost-ApprovedCreditLimit)*0.3";

            var tree = CSharpSyntaxTree.ParseText(
                expression,
                new CSharpParseOptions(
                    LanguageVersion.CSharp6,
                    DocumentationMode.Parse,
                    SourceCodeKind.Script));

            Debug.Listeners.Add(new ConsoleTraceListener());

            VisitExpression(tree.GetRoot(), 0);
        }

        static void VisitExpression(SyntaxNode node, int level)
        {
            Debug.WriteLine("{0} : {1}", (new string(' ', level) + node.Kind()).PadRight(32), node.ToFullString());
            foreach (var childNode in node.ChildNodes())
            {
                VisitExpression(childNode, level + 1);
            }
        }
    }
}

For this program to compile we need to install a dependency:

PM> Install-Package Microsoft.CodeAnalysis.CSharp -Version 1.1.1
Attempting to resolve dependency 'Microsoft.CodeAnalysis.Common (= 1.1.1)'.
Attempting to resolve dependency 'System.Collections.Immutable (≥ 1.1.37)'.
Attempting to resolve dependency 'System.Reflection.Metadata (≥ 1.1.0)'.
Attempting to resolve dependency 'Microsoft.CodeAnalysis.Analyzers (≥ 1.1.0)'.

This is not a post about using Roslyn or about any particular technology. This code is here just to show an example of code used to work with the expression trees. This is a parser code, not associated with any kind of execution environment for the code, so it's not a subject for an injection attack mentioned earlier. I will try to elaborate on the techniques to parse and process the abstract syntax trees in one of the future posts. For now, let's assume we can figure out that we can visit the AST expression to produce something meaningful for the model, including extracting the specification in a form of an AST. A separate AST model is required for that, one without a dependency on a parser.

You might question, what's the difference? We just introduced a few external dependencies with the above code.

The difference is that now we are not introducing a dependency on a data format. The scope of this parsing technology is limited by design time - a session of the user entering the expression in UI.

From now on we are not dealing with the expression code anymore.

What exactly can be achieved with such an approach?

  • We are not dealing with any code as a data in our runtime environment so our project is not a subject for injection attacks. We are not taking a dependency of a scripting execution environment at all.
  • The specification is stored in our own format; we are free to transform it to any representation later.
  • We can interpret the AST to create its representation in another language. We can even give the user a choice of the scripting languages, or provide a different kind of the editor UI, working with a visual tree representation, or a Flowchart.
  • One useful property of an AST is that they are really fast to visit to compute a value or to compile to another representation, close to the performance of a compiled code, so you can come up with a performant solution for a server component under a heavy load.
  • You make an investment in your own code, avoiding a long-term dependency on alien data format such a scripting language and possibly few evil third party packages.

No comments:

Post a Comment