on duplicated code

This is a loaded topic. Duplicated code can arise from multiple causes and has different forms that affect in subtle but damaging ways. Though not all duplication is bad per se but it depends in how it affects development processes.

Some of these duplication are made by authors trying to cut corners or replicating the same work but on different parts of the system.

Copy paste is rampant everywhere and it might be done on any project and technology for multiple reasons, it might be that html tags need to be repeated, SQL queries that have pretty similar structure with different values, CSS definitions on multiple files. The problem here is how to know this is happening and that it is negatively affecting the team. Files can be statically analyzed and have the abstract syntax tree compared with many tools, some of those can be plugged as checks to the CI/CD pipelines, but this will only give you a number that must be evaluated to know if it is on a harmful range or not. For a given language and framework 10% duplication might be OK but for others that will already be causing issues.

If it were only copy paste the main problem, then reduction of duplication would be easy, the problem is that it is never that simple as the behavior almost always is Copy Paste Mutate. This small mutations or changes make detecting this problem more difficult as the variations can be on multiple ways. Order of operations, number of parameters, layers of code where the same pattern repeats over and over with some variation. Not all tools can catch this kind of mutations and most likely it will require a person to have to grok this and to have an eureka moment while doing something completely different.

Also duplication can happen outside of a given component such as new project with similar characteristics. Every time a new module, component or system is started from scratch it is never really from a blank slate. Some of the foundation or bootstrapping that all components have on an organization are very similar and are very prone to mutations and improvements on each iteration.

Making it even more troublesome is that they might be on different source code repositories or even maintained by different teams after a while thus deviating even more from the original but still maintaining the same responsibility. It is harder to decide if it makes sense to reduce duplication of this type.

Then there is also duplication caused by libraries and frameworks, for instance hooks for web frameworks tend to force code duplication for each exposed endpoint. Also inversion of control or dependency injection frameworks tend to have this secondary effect of code duplication on constructors or wiring functions. The one I've seen more seriously affecting teams has to do with configuration files and initialization of components per environment; it is a good practice to extract all configuration from deployed artifacts and have distinct means for configuring them for each environment or use case in which they most likely end up as different sets of configuration different percentages of duplication. In configuration files the danger lies in not propagating a required value to all configuration sets and thus causing very weird bugs.

Now we are on a new era where infrastructure and tools are adding more duplication to code bases. Initialization scripts, container definitions and infrastructure as code will have many lines and blocks duplicated, mutated and maintained by different groups. Additionally configuration files now have a different layer where to live and also roles and permissions become also a target for duplicity by environment. Here is the new problem we are now facing and it is just becoming worse by means of code completion and GPT tools that make it even simpler to add code to an existing project.

How to tackle this is the crux, in the past we tended to push this down on the stack to libraries and frameworks. CORBA/RMI/COM used to be one of the blatant offenders on duplication but then we became wiser and hid all code generators beneath application servers and integration tools via instrumentation and reflection. Even though generated code solve lots of problems it ends up being more lines that a human must maintain. This is harmful on the long term.

We need to add abstraction layers or anti corruption layers where we can add all these duplications and if possible create libraries an frameworks that make them unnecessary, again.

Musings on IT and programming

Search This Blog

on duplicated code

Labels

Popular Posts

Logffillingitis

Are we truly engineers? or just a bunch of hacks...

Qualifications on IT projects. Random thoughts

Job interviews