Validation Baked Into RDBMS? How Would That Work?

— The original post was written on Aug 5, 2015.  —

When you develop applications for the web (or software in general), there is one source of information that is always problematic (or at least annoying) to deal with – namely user input.

The Problem:

The vast majority of web-applications are nothing more than CRUD interfaces between the user and the data storage unit (RDBMS in this case such as MySQL). Once the user enters some data into the application via a login form or a search box, the data is then validated, sanitized and operated on by the application as well as the RDBMS before outputting some result back to the user. For that reason, most of our applications end up looking something like this:

public function registerUserAction($username, $password){
    if(username is more than 4 characters & alphanumeric & some other requirements...){
        if(password meets requirement 1 & requirement 2 & some other requirements...){
            #do DB insertion in case the username meets the Unique key requirement else return error
        }
        else{...}
    }
    else {
        #output error msg: Username is not 4+ characters or not alphanumeric 
    }
}

And in its simplest form, a user table schema would look something like this:

CREATE TABLE IF NOT EXISTS `user` (
  `username` varchar(40) NOT NULL,
  `password` varchar(60) NOT NULL,
  PRIMARY KEY (`username`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The data entered by the user goes through the sanitation and validation portion of the application and later on gets inserted into the table expressed above. While this process is pretty straight forward and rather easy to do, what we as developers end up having to do is repeating ourselves throughout every project. We add the criteria each field needs to match, a lot of maintenance code around it as well as creating error messages based on what, and most importantly why, an insertion failed (ex. username must be 4+ characters). And god, lets not forget having to write tests for all this stuff. This process for me is rather tedious and annoying. For that reason, I have been thinking a lot about simplifying this process and would like to suggest an idea that might change all of that for the better.

The Solution:

Sometimes, the validation and sanitation requirements are expressed using regular expressions. Using RegEx allows all those requirements to be captured inside a single if statement. Example:

public function registerUserAction($username, $password){
    if(preg_match("/^[a-z0-9_-]{4,20}$/", $_POST["username"])){
        ...
    }
}

The question that I have been asking myself about the snippet above for a while now is: why can’t we insert these RegEx requirements right into the RDBMS as column criteria.

We commonly use Varchar, Int, Text and other data types to determine what goes into each column and how the internal engine should process such data in the future. Wouldn’t it be cool in case we could declare the database schema as follows instead?

CREATE TABLE IF NOT EXISTS `user` (
  `username` "/^[a-z0-9_-]{4,20}$/" NOT NULL,
  `password` "/^.{4,60}$/" NOT NULL,
  PRIMARY KEY (`username`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

or maybe we could create a completely new engine that runs using a JSON schema

{
    "user": {
        "username": {
            "rules": ["/^[a-z0-9_-]{4,20}$/"],
            "index": true,
            "unique": true
        },
        "password": {
            "rules": ["/^.{4,60}$/"]
        }
    }
}

This approach makes the assumption that all data, regardless of type, is handled as a string. Which makes sense in the context of the web despite what others might say because all HTTP traffic is nothing more than strings being passed around cables from one side of the network to the other. We choose to type-cast these strings as various forms of data at the application layer. Putting requirements on each column/field in terms of regex right into the schema of the DB table would enable us to save a whole lot of time when writing applications. How would this help you ask? consider the following:

using the schema above, we can now rewrite the same code as we did at the beginning of the article a little bit differently

public function registerUserAction(){
    try{
        $DBinstance->InsertIntoTable("users", $_POST);
    } Catch(Exception $e){
        #outputs error to the user: USERNAME Does Not Match /^[a-z0-9_-]{4,20}$/
        echo $e->getMessage();
    }
}

Regardless of how many columns, regex requirements or fields you have this will always work. Reason being is that we are passing everything that came in from the client side (using $_POST) straight to the database engine which looks for the fields and validates according to the rules using the schema specified above. So in essence the database engine will be doing something similar to this internally:

foreach($datasetSchema as $field => $properties){
    if(!isset($_POST[$field]))
        throw new DataKeyMissing("They Key $field Is Missing!");

    foreach($properties["rules"] as $rule){
        if(preg_match($rule, $_POST[$field]) !== 1)
            throw new DataValueNotAllowed($_POST[$field] . " Does Not Match " . $rule);
    }
    $validFields[$field] = $_POST[$field];
    
    #After the above steps are done, we can store the data on the harddrive!
}

What this does is simply loop through the schema and check for the existence of that particular field in $_POST. So if the $_POST["username"] field does not exist, the engine will issue a DataKeyMissing Exception. Using this approach, we can also loop through the regex rules and apply each one to its required column.

The reason we don’t do this today is because passing the data straight into the storage engine without sanatizing and validating it first at the application layer causes major security problems. However, if the storage engine can handle said validation internally without having to rely on the application layer to do so before passing the data, we would end up with enhanced security and less work on the part of the application developers.

If all of the data passes these validation checks, the data is later on saved to disk. Also, prior to insertion, we check the schema requirements in case a unique field exists. We also check for and create hash / linked list indexes for each specified column so that we can have faster data access time. Basically, everything is the same as in a regular RDBMS. The only difference is that we specify using RegEx what we can and cannot enter into each column.

The Benefits:

  1. Validation is taken care of by the RDBMS layer rather than the application layer.
  2. Flexible schema using regex which means that we can basically force a column to only have a certain type of content that is not native data-type related.
  3. Schema operation rather than data operation: If we wanted for instance to know of all the columns that contain a digit, all we need to do is to look at the schema rather than searching through the entire dataset. In MySQL you have to loop through every row to find in case a Varchar column of data contains any digits. Using this method, we can determine in case such a digit exists just by looking at the enforced RegEx rules.
  4. Performance wise, my guess would be that the this approach could have a similar performance outcome as any traditional RDBMS system. Insertion might be more costly due to the regex rules that have to be checked, however, access and read time should not be any different.

Leave a Reply