Linking Data Across The Web

— The original post was written on May 7, 2015. —

For the past few days, I have been reading up on the underlying concepts behind the Linked Data movement [Wikipedia]. And I gotta tell ya folks, the more I read about it the more I question whether JSON-LD will ever be adopted as a standard.

In this article I will try to describe what I believe a good Linked Data standard should look like. One that could be useful for ME as a developer and business owner to implement. Please keep in mind that these are just a few thoughts that I have had for a while so don’t take my criticism of JSON-LD as an absolute truth.

Before I dive into my solution to this problem, allow me to first state the reasons as to why I believe the JSON-LD format is not attractive enough.

1. Verbose Integration:
The main goal/requirement for JSON-LD was ease of integration. The idea here was that we (the developers) would be able to transform our JSON data into JSON-LD without much effort. However, the standard actually makes it harder for us to migrate our existing APIs. Consider the following:

Our API Endpoint GET https://someservice.com/event/10105124 currently produces the following output:

{
    "eventID": "10105124",
    "summary": "Lady Gaga Concert",
    "location": "New Orleans Arena, New Orleans, Louisiana, USA",
    "start" : "2015-04-09 12:00"
}

If I wanted to rework this API response data into the proposed JSON-LD format, I have to restructure my JSON so that it looks something like the following:

{
  "@context": {
    "ical": "http://www.w3.org/2002/12/cal/ical#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "ical:dtstart": {
      "@type": "xsd:dateTime"
    }
  },
  "ical:summary": "Lady Gaga Concert",
  "ical:location": "New Orleans Arena, New Orleans, Louisiana, USA",
  "ical:dtstart": "2011-04-09T20:00Z"
}

// This sample was taken from: http://json-ld.org/playground/index.html

Now, I don’t know about you but for me this syntax gives me a headache. Not only did we abstract a lot of the information into the @context object, but we also changed the naming and the way the columns behave. This is not, by any stretch of the imagination, what I would call “easy integration”. Even with the usage of something like Fractal, we would end up breaking the API for all of the clients currently utilizing it.

Our only option here is to issue a new version of the API with these changes in place. Even if we ignore the naming of the keys (which actually isn’t mandatory according to the JSON-LD spec) what will end up happening is that all of our API clients and API documentation needs to be updated to reflect the changes in the schema.

2. Payload:
The verbose nature of JSON-LD makes it extremely unattractive for high payload APIs. Just take a look at some of the examples here and notice how an API response containing a few key – value pairs ends up containing more information about the object than the actual data it serves. This in itself is not bad or wrong in any way. My concern here however is that as your service scales and your endpoints start delivering more complex and nested data (such as with financial APIs), this becomes a maintinence nightmare. And yes, I know that nested data is not the best approach. But sometimes it is far better than having to deal with multiple inbound requests. For that reason, when it comes to large payload APIs, the standard’s syntax becomes a pain point in itself that should be managed, which adds extra maintinence and restrictions that developers will have to put up with.

3. Benefits:
What exactly is the benefit of going through the hassle of implementing JSON-LD? The benefits of linked data that I can see emerges when the entirety of the network uses the same standard. At that point, it becomes attractive for me to use the same standard in order to take advantage of other providers data-sets with a common structure/format. However, this becomes a case of catch 22. Everyone needs to be connected in order for us to benefit yet no one will connect until everyone else is connected. The only short-term winners in this scenario are search engines. They get to categorize and map your data in relation to the network’s (provided you allow them to crawl it).

I might be wrong about all of this though. My main concern is that it is not feasible for most of us and our employers to go through the hassle of implementing this. With that in mind, I would like to suggest that we tackle this problem a bit differently.

The Solution:

First off I would start by creating something similar to Packagist.org. But rather than listing packages, we would list namespaces of pre-defined JSON components. However, we need to add a little twist to it in order to make it functional and attractive. To illustrate this, lets call our JSON component manager for JSONPAC.org.
JSONPAC would start with a bunch of predefined simple components that looks like the following:

std/user -->
    id : STRING | REQ
    email : STRING | REQ
    password : STRING | OPT
    reg_date : DATETIME | OPT
    ...

std/event -->
    id: STRING | REQ
    title : STRING | REQ
    location : STRING | REQ 
    start : DATETIME | REQ
    sponsor : []std/user | OPT

...

The std namespace would have a ton of these components that reflect the basic set of objects any API might. In order to understand how this differs from the JSON-LD approach, lets look at the following example:

Bob owns an API with the endpoint: GET https://bobservice.com/event/123123. Bob’s response, after implementing this standard, could then look like this:

{
    "@namespace" : "std/event",
    "id": "123123",
    "title": "Lady Gaga Concert",
    "location": "New Orleans Arena, New Orleans, Louisiana, USA",
    "start" : "2015-04-09 12:00"
}

Meanwhile, Alice owns a different API with the following endpoint: GET https://aliceservice.com/event/111222. Alice’s response could then look like this:

{
    "@namespace" : "std/event",
    "id": "111222",
    "title": "John Mayer in RIO",
    "location": "Somewhere in Canada",
    "start" : "2015-04-19 12:00",
    "sponsor" : [
        {
            "id" : 13584,
            "email" : "somedude@asd.com",
            "reg_date": "2011-01-01"
        },
        {
            "id" : 471548,
            "email" : "anotheruser@asd.com"
        }
    ]
}

Notice how Bob’s API has no sponsor field in his response because the sponsor field in JSONPAC std/event is OPTional. Meanwhile, Alice’s API uses an array of sponsors each of which have the @namespace std/user implicitly defined in the JSONPAC schema. If Alice tries to embedd something else into the sponsor field other than what is allowed in the JSONPAC std/event --> sponsor manifest, she should get an error when checking the API response against the pre-defined schema. This way, we enforce the that the integrity of the data goes hand in hand with the standardized schema that is set for us by JSONPAC.

Your question now is: but waaaaaaaaaait a minute. What if Alice wants to return her own type of sponsor that isn’t the same as the []std/user schema? at that point, Alice should employ the OOP decree of Extend and Override. You simply create your own namespace at JSONPAC and extend the std/event object and override whatever you want. Thus the JSONPAC library would look like the following:

std/user -->
    id : STRING | INT | REQ
    email : STRING | REQ
    password : STRING | OPT
    reg_date : DATETIME | OPT
    ...

std/event -->
    id: STRING | INT | REQ
    title : STRING | REQ
    location : STRING | LONG-LAT | REQ 
    start : DATETIME | REQ
    sponsor : STRING | []std/user | OPT

...

alice/event --> std/event
    sponsor : []alice/sponsor

alice/sponsor -->
   id : STRING | REQ
   name : STRING | REQ

After creating her own namespace at JSONPAC, Alice can now return the following JSON result:

{
    "@namespace" : "alice/event",
    "id": "111222",
    "title": "John Mayer in RIO",
    "location": "Somewhere in Canada",
    "start" : "2015-04-19 12:00",
    "sponsor" : [
        {
            "id" : "userID1",
            "name" : "Dan Gilbert"
        },
        {
            "id" : "userID2",
            "name" : "Dan Gilbert The Second"
        }
    ]
}

As you can see, we extend and override namespacing. This way JSONPAC can handle all linked data relationships in one place. However, there is ONE problem with this approach. What if the developer gives the wrong @namespace or forgets to include it for whatever reason? What happens then? Well, an easy way to solve for this would be to write a tiny little test-executable that handles this. Once you are done building your API response you run the tool from your terminal using:

JSONPAC-TOOL validate https://aliceservice.com/event/111222

The tool gets the JSON result from the url, validates it against the provided namespacing from JSONPAC. If all goes well, the tool will display a success message, else the tool will provide the developer with the row/place where the error occurs. The tool should even recursively go through every field including the sponsor[] array and see if each object matches alice/sponsor (the provided namespace). What this would do is add another layer of testing – namely output testing. If the validation test passes, then you guarantee your data-structure’s integrity and provide a way for all your API users to lookup your objects using JSONPAC. All a developer would need to do in their documentation is to write following: Endpoint X returns alice/event. Notice how this entire testing process needed only 1 line: @namespace : "alice/event".

The front-end developer implementing your API can also download the JSONPAC-TOOL and verify your data’s integrity. This way any developer would know for sure in case an API is reliable or not. In addition, the JSONPAC-TOOL could also implement some cool and much needed functionality such as:

JSONPAC-TOOL extract https://aliceservice.com/event/111222 alice/sponsor

This command could for instance loop through the response from the API endpoint and extract all alice/sponsor objects without needing to write a custom script every time you want to extract some data from an API endpoint. I can go on and on about the features could be implemented but the main point is this:

This approach gives us easier integration, centralized management, data-structure integrity testing, output reliability, built in data-extraction capability, extensiblity, and much more. In addition to all of that, the data structures are all linked in one central place with very non-verbose syntax.

— The original post was written on May 7, 2015. —

Share this:

Related

Leave a Reply Cancel reply