The boundary between "Data" and "Code" is blurring

In the corporate world, the usage of Excel files is still enormous. Many business tools are still running in Excel, often extended with VBA macros and applications.

As an Excel fan myself, I can relate to that, as nothing else allows in such a short time to build a powerful and flexible tool, adapted to your specific business needs.

However as a software engineer, I also see the issues with these decentrally managed business tools built in Excel. Typically there are 4 main issues with these tools, i.e.

The lack of central storage
The lack of collaboration possibilities offered by Excel (we all tried to work on a shared workbook, but don’t think anyone was really satisfied with the result)
The lack of versioning
The (non-)functional limitations of a tool like Excel. Numerous examples exist of an Excel tool growing continuously in complexity and size, until one day it becomes impossible to maintain, it becomes unacceptably slow and/or you reach the functional (e.g. certain logic which cannot be implemented) or technical (e.g. row or column size limit) limitations.

Microsoft has done already considerable efforts to resolve these limitations (e.g. with Office 365) and also other firms are trying to offer alternatives while keeping the flexibility of a spreadsheet (e.g. Airtable or Trello), but some limitations are inherently linked to the way a spreadsheet works. This is caused, by the fact that in a spreadsheet, data and software are not isolated from each other, which means the software part of a spreadsheet cannot be versioned, maintained, optimized and/or upgraded separately from the data.

In general data and code scale very differently. When code scales, it can be modularized (via classes, methods, functions…) at different abstraction levels, allowing to keep the scaling of code under control.
Data however needs to stay in one place or at least a single access point to all data has to be provided (e.g. a database might store data in multiple files, even on different machines in a distributed architecture, but for the user this should be abstracted away).
This difference in scaling behavior means that data and code should be maximum separated.

Unfortunately in many cases, the boundary between data and code is not that clear (a whole grey zone exists), which is an issue not only for a spreadsheet but for every application in general.

A good metric to define the positioning on this "data versus code" scale, is looking at the "Production update frequency" of an object. A higher frequency of production updates means a stronger positioning towards "data".
This will typically lead to 3 large categories, i.e. code, soft-code and data, although these categories will still blur into each other.

Code: the actual programming language code (although domain-specific languages should also be considered), created by a software engineer and known at the moment of writing the code (known at design time). This is typically delivered to production according to a release schedule and through a controlled deployment pipeline.
Data: any data created by the application itself, so only known at the moment the application is running (i.e. not known at design time, so impossible to hard-code in the code). This can be as from a user interaction (i.e. operational data) or automatically derived by the application itself (e.g. data created by calculations done on the operational data or logging and monitoring data).
Soft-code: any mix between code and data, which can be adapted by a software engineer or business product owner. The adaptation can be done directly in production, via an administration account, or can be delivered via the deployment pipeline. As risk of regressions are lower, the deployment restrictions are often less severe than for code. The value of such a customization can be known at the moment of coding, only when releasing (e.g. environment specific settings) or even only when the application is already running.

More and more these barriers are blurring, making it more difficult to provide a correct definition of these terms, due to evolutions like:

Self-evolving code and machine learning, where the behavior of the application itself adapts automatically based on the provided (training or production) data (cfr. blog "https://bankloch.blogspot.com/2020/09/ai-in-financial-services-buzzword-that.html")
DevOps, resulting in continuous delivery, meaning that code can get also a very high "Production update frequency" (cfr. blog "https://bankloch.blogspot.com/2020/10/devops-for-dummies.html")
RPA, Low- and No-code platforms, which allow business people to develop complex tooling (cfr. blogs "https://bankloch.blogspot.com/2020/07/low-and-no-code-platforms-will-it.html" and "https://bankloch.blogspot.com/2020/05/rpa-miracle-solution-for-incumbent.html")
Business package software (like SAP, SalesForce…), which allow extensive customizations in the business application itself via parameterization, drag & drop options, scripting languages…
…

Nonetheless the positioning of each object on the "code vs. data" ax should be carefully investigated, as it has some important consequences, like:

How a change is released to production: for "data" this is done by making a direct update in the production environment, while for "code" this is normally done via a deployment pipeline
Storage of the latest version: for "data" the latest version will always be the (master) production database, while for "code" this is typically a development branch in a source code repository.
Testing: while code will typically transition through a (automated) testing cycle, "data" will normally not.
Collaboration: while code collaboration is usually done in a decentralized way, i.e. by checking out the code, making individual adaptations and then checking in and merging to a central repository (and potentially resolving merge conflicts), "data" collaboration is usually handled more centrally, via a locking mechanism, locking records for update, when a user is making an update on them.
Search capabilities: as code is usually stored in text files, typically coding tools will work with text searches. Data on the other hand can be stored in different structures (like a structured data model, a JSON object…), allowing more structured querying (like joins, grouping, complex filtering…).

For the "Soft-code" category, usually a choice between the "data" and "code" extremes should be made for each of those above characteristics. These decisions will determine the overall positioning on the "data vs. code" scale.

Typical examples of elements in this "Soft-code" category are:

Parameterization (or Configuration): this consists of a set of parameters or configuration values determining the specifics of an application. This can take different forms, i.e. from different variables determining the behavior of the application, all the way to the configuration of screens, reports, lists, table structures and meta-dictionary updates (like user defined fields or tables). The result of a parameterization is either stored as data in a database, as structured files (e.g. in JSON, XML or YAML format) or as code written in an SDK.
Usually this kind of soft-code is partially handled as code (i.e. the default configurations delivered by the product team) and partially handled as data (i.e. the specific modifications made by business owners to the behavior of the application in production).
Feature flags: with DevOps and Agile pushing more and more towards gradual roll-out and data-driven product management (i.e. trying out new features to a small subset of users in production and monitoring the result), feature flags become a vital part of a modern application. These flags allow to enable or disable specific features of an application and often allow to segment users, allowing to enable features only for some users.
Referential, static data: these are usually tables, storing static data. Typical examples are the list of countries, currencies, postal codes, MCC sector codes, languages and more specific code tables (enumerations). As there is often no application screen foreseen to manage this data and also no end-user owning this data, the data will usually be updated by the product team via the standard release procedure. This makes that even if this is clearly data, it gets a lot of characteristics of "code".
Translations: almost any multi-lingual application works with hardcoded technical keys for each label and text appearing on the application screens. Those technical keys are then replaced by the corresponding value in a translation file in the language of the user. As updating translations often requires complex manipulations and also ideally requires some testing (e.g. a too long translation might give lay-out issues on a screen), the deployment of translations is usually done in the same way as "code". The ownership of the actual translations is however typically a business owner, like the marketing department, which gives it the characteristics of "data".
Static content: this consists of rich texts (with lay-out), pictures and other media files, which appear on the application screen or are generated as an output of an application (like a report, email, SMS, notification…). This type of content is usually managed in a content management system, but has often a strong link with the code, as this content should be retrieved and included in the right place in the application. Furthermore this content often includes variables, which are dynamic and therefore need to be replaced by the actual value at run-time.
Another typical issue is that the CMS tool produces static webpages, while business often wants a dynamic behavior on the pages. As a result, software engineers need to plug in specific code to have the dynamic behavior on the static page.
AI models: these have also the notion of being both "code" and "data". The models are "code", as they contain business logic, are usually created by software engineers and normally delivered to the production environment in a controlled way. However they are also "data", as these models are actually coefficients in large matrices and as such cannot be debugged like regular code and can evolve directly in production based on new data (if desired).

The above examples are becoming more and more prominent in modern software applications. As such the category of "Soft-code" seems to grow considerably, due to

A lack of IT resources combined with an ever-growing demand for digitalization: as a result business resources want to be able to make certain changes themselves, allowing to bypass the capacity issues within an IT department.
The ever-increasing need for flexibility: in the fast-changing world we are in, waiting for the next release is often simply not an option. As a result more and more possibilities are incorporated in modern applications to modify the application in a controlled way at run-time.
A growing need to be as efficient as possible: the overhead in communication from a business owner up to a developer can be enormous in many organizations (as in many organizations there can be 3 or more intermediaries facilitating this communication). This overhead leads to higher lead times, more work for everyone and errors due to miscommunication. As a result business owners want to be able to do more themselves and "Soft-code" can provide a solution to this.

Although moving more "Code" to the "Soft-code" category seems to be a good idea at first, it does raise several concerns:

Definition of the master copy: while for "data" and "code" there is a single master copy (for "code" the source code versioning system and for "data" the production database), this is not the case for "customizations". On the one hand there is the production platform containing the current situation, while the source code management system and/or one of the test environments can contain the future situation, not to mention any environment specific "customizations".
Collaboration: as there is no central unique storage, collaboration becomes also difficult, as it is not possible to lock/check-out a specific customization in/from the central unique storage.
Versioning: versioning becomes also very difficult due to this lack of a single master copy. As a result, it becomes very difficult to analyse and reproduce bugs, audit changes (who did which change) and identify what had changed over time.
Due to this lack of versioning, it occurs regularly that IT releases a configuration change (a soft-code change) overwriting something that has been done by a business owner.
Non-functional limitations: "soft-code" is usually the result of an abstraction level (like a CMS, translation labels…), which of course also limits the flexibility. As such any requirement not matching the prescriptions of the abstraction, can result in a serious refactoring of both the "soft-code" and the code implementing the abstraction.
Additionally soft-code is usually not created to support large volumes. When the volume grows, it becomes very hard to maintain, but also performance issues occur. Clearly "code" will execute several factors faster than soft-code, as soft-code needs to be retrieved, converted into application code and then to machine instructions.
Operational risks due to misuse: in many organisations soft-code is also misused. As in many large organizations releasing software is an extremely hard and difficult process, which can only be done a few times a year, developers start building in all kind of configuration and customization possibilities, which allow to modify the behavior of the application without a release. This is however a very risky situation, as certain configuration changes can have such a severe impact on the application’s behavior, that extensive testing is also an absolute necessity. Usually this step is however skipped in this type of changes.
Lack of documentation: due to the flexible nature of soft-code and the specific way of storing it, documenting soft-code is more complex than documenting code (where all kind of tools, notations and frameworks exist for). As a result, soft-code is usually not documented at all, meaning it often becomes a source of errors and concern after a few years.
Flexibility has a cost price: as indicated above, supporting soft-code, requires building an abstraction layer in your code base to support the soft-code. Building such an abstraction layer increases however the complexity and cost of an application. It should therefore always be discussed if the flexibility is really required and if hard-coding is not an option as well. Usually hard-coding will be several factors cheaper than supporting soft-code.

Clearly there is a danger in this growth of soft-code. As such, it would be better to push certain soft-code back to the side of "code" or "data".
For this to work, we need to find solutions for the reasons of the rise of soft-code, i.e. the lack of IT capacity, the need for more flexibility and the difficulties in communication, but also work on making "data" more controllable.

A few ideas can be:

Treat soft-code more as "code" by hardcoding in the code or by enforcing the same governance to soft-code as to "code" (i.e. source control, versioning, testing, releasing…). This means it would no longer be possible to change the soft-code elements directly in production via an administrative access, but instead everything should be included in the source control system and delivered via the standard deployment pipeline. In order for this to work, we need however
- Continuous deployment: with DevOps allowing to deploy daily or even continuously, "code" can get the same flexibility as soft-code, resolving already one of the reasons to introduce soft-code.
- Easy ways for business people to also commit certain changes into the source control system (and thus trigger the CI/CD pipeline): of course this requires additional tooling, as business people are not acquainted working with a source control system and the risk of corrupting the code is big. Some kind of user-interface allowing to update source files in a controlled, restricted and convenient way is therefore a necessity.
- Stronger integration of teams, i.e. the creation of cross-functional teams responsible for a business functionality. This allows for business people to interact much more directly with IT, e.g. the business could also be using JIRA to create business tasks and as such easily assign a ticket to IT for an update of the soft-code.
Infrastructure as a Code (IaaC): some soft-code is the result of a need to run the same application in different environments. At that moment, a number of parameters need to be adapted to the environment, such as disabling certain external communication on test environments or adapting URLs, tokens and ports of integrated applications and external partners.
Obviously this should be handled via soft-code, as nobody wants to maintain multiple versions of your source code depending on the environment. It is important however to manage these environment specific variables fully separately as part of your IaaC code basis. Ideally these are handled via environment variables and secrets. A tool like Kubernetes has a whole setup to correctly manage these types of environment specific configurations in a good way.
Soft-code duplication in a "code" and "data" part: instead of handling soft-code purely as "code", you can also duplicate the soft-code in a "code" part and a "data" part. The "code" part is considered as the "default" settings and is fully managed by the IT product team. This ensures that the application can fully work.
Additionally there is a file or table, which can be adapted by a business owner, to override these defaults. The file/table will only store the overrides and is fully handled as "data".
This allows a good separation of concerns, but good regression and functional testing remains an issue as the IT product team might not have same setup of the application as in production.
Specialized tooling, which allow a business user to manage specific soft-code in an easy and controlled way, while the tooling provides easy methods for IT to deliver the changes in a controlled way to different environments. Examples are feature-flag software (such as Split, Optimizely, A/B Tasty, Adobe Target…), translation management software (e.g. Transifex, Phrase, Lokalise, Loco, Crowdin…), CMS software (Wordpress, Dreamweaver, Drupal, Joomla!, Contentful, Contentstack,…)…
Treat soft-code more as "data": soft-code can also be treated more as data. Some of the concerns can then be resolved by:
- Regular creation of snapshots (after every change or on a frequent basis), by extracting all soft-code data and automatically storing this snapshot in a source control repository. The source control repository will automatically identify differences compared to the previous snapshot, which the user can document. As such some kind of versioning can be obtained.
- Easy replication of (scrambled) a production database to the development and test environments
Creation of IDE tooling specifically for data: such tooling allows to create data in a user-friendly way (e.g. via syntax highlighting, auto-completion, bulk updates…), while providing all tooling available to manage code, like versioning (through a source control system), deployment pipelines and debugging (e.g. for AI allow to see immediately on which data your model performs the worst. This can be a sign of bad labelling or the model having insufficient data points) and syntax validation (e.g. add easy syntax rules for structured data representing soft-code).
- Documentation: develop standard mechanisms to easily document data, with possibilities to extract the full documentation or be automatically alerted when specific documentation should be updated (as the linked data element has been modified)
- Automatic auditing: automatic storing of a full audit trace (who did which update at which time and from which channel). A tool like Basedash, which allows to interface with a database for support teams to make specialist updates, while automatically auditing all those updates can provide a solution to this.
…

Clearly there is a need for high flexibility in software and business owners to be able to rapidly create changes to existing software. As such we can either make software engineering departments more Agile and faster, or we can hand-over a number of IT tasks to business people in a controlled way. As the need is so high, both trends are happening in parallel. On the IT side, there is the rise of DevOps, Agile, microservices based architecture and cloud native architecture, all allowing IT departments to become more flexible and faster. At business side, all kind of tools help business people to implement business logic and other changes in a controlled way (RPA tools, Low- and No-code platforms, CMS tools, packaged software…). These tools empower business people to be more independent of IT. Clearly both trends work towards the same goal, but it will be interesting to see which trend takes the upper hand and in which context.

Bankloch

Search This Blog

The boundary between "Data" and "Code" is blurring

Labels

Comments

Post a Comment

Popular posts from this blog

Transforming the insurance sector to an Open API Ecosystem

RPA - The miracle solution for incumbent banks to bridge the automation gap with neo-banks?

IoT - Revolution or Evolution in the Financial Services Industry

AI in Financial Services - A buzzword that is here to stay!

A bank account - A concept of the past

An overview of 1-year blogging

Peer-to-peer payments - A crucial component towards a cashless society

The UPI Phenomenon: From Zero to 10 Billion

From app to super-app to personal assistant

Marketplaces in the financial industry - Here to stay?