Skip to main content

on duplicated code





This is a loaded topic. Duplicated code can arise from multiple causes and has different forms that affect in subtle but damaging ways. Though not all duplication is bad per se but it depends in how it affects development processes.

Some of these duplication are made by authors trying to cut corners or replicating the same work but on different parts of the system.

Copy paste is rampant everywhere and it might be done on any project and technology for multiple reasons, it might be that html tags need to be repeated, SQL queries that have pretty similar structure with different values, CSS definitions on multiple files. The problem here is how to know this is happening and that it is negatively affecting the team. Files can be statically analyzed and have the abstract syntax tree compared with many tools, some of those can be plugged as checks to the CI/CD pipelines, but this will only give you a number that must be evaluated to know if it is on a harmful range or not. For a given language and framework 10% duplication might be OK but for others that will already be causing issues.

If it were only copy paste the main problem, then reduction of duplication would be easy, the problem is that it is never that simple as the behavior almost always is Copy Paste Mutate. This small mutations or changes make detecting this problem more difficult as the variations can be on multiple ways. Order of operations, number of parameters, layers of code where the same pattern repeats over and over with some variation. Not all tools can catch this kind of mutations and most likely it will require a person to have to grok this and to have an eureka moment while doing something completely different.

Also duplication can happen outside of a given component such as new project with similar characteristics. Every time a new module, component or system is started from scratch it is never really from a blank slate. Some of the foundation or bootstrapping that all components have on an organization are very similar and are very prone to mutations and improvements on each iteration.

Making it even more troublesome is that they might be on different source code repositories or even maintained by different teams after a while thus deviating even more from the original but still maintaining the same responsibility. It is harder to decide if it makes sense to reduce duplication of this type.

Then there is also duplication caused by libraries and frameworks, for instance hooks for web frameworks tend to force code duplication for each exposed endpoint. Also inversion of control or dependency injection frameworks tend to have this secondary effect of code duplication on constructors or wiring functions. The one I've seen more seriously affecting teams has to do with configuration files and initialization of components per environment; it is a good practice to extract all configuration from deployed artifacts and have distinct means for configuring them for each environment or use case in which they most likely end up as different sets of configuration different percentages of duplication. In configuration files the danger lies in not propagating a required value to all configuration sets and thus causing very weird bugs.

Now we are on a new era where infrastructure and tools are adding more duplication to code bases. Initialization scripts, container definitions and infrastructure as code will have many lines and blocks duplicated, mutated and maintained by different groups. Additionally configuration files now have a different layer where to live and also roles and permissions become also a target for duplicity by environment. Here is the new problem we are now facing and it is just becoming worse by means of code completion and GPT tools that make it even simpler to add code to an existing project.

How to tackle this is the crux, in the past we tended to push this down on the stack to libraries and frameworks. CORBA/RMI/COM used to be one of the blatant offenders on duplication but then we became wiser and hid all code generators beneath application servers and integration tools via instrumentation and reflection. Even though generated code solve lots of problems it ends up being more lines that a human must maintain. This is harmful on the long term.

We need to add abstraction layers or anti corruption layers where we can add all these duplications and if possible create libraries an frameworks that make them unnecessary, again.

Popular Posts

Logffillingitis

I'm not against of leaving a trace log of everything that happens on a project what I'm completely against is filling documents for the sake of filling documents. Some software houses that are on the CMMI trail insist that in order to keep or to re validate their current level they need all their artifacts in order but what is missing from that picture is that sometimes it becomes quite a time waster just filling a 5 page word document or an spreadsheet which is just not adequate for the task needed. Perhaps those artifacts cover required aspects at a high degree but they stop being usable after a while either by being hard to fill on a quick and easy manner by someone with required skills and knowledge or they completely miss the target audience of the artifact. Other possibility is that each artifact needs to be reworked every few days apart to get some kind of report or to get current project status and those tasks are currently done by a human instead of being automated. ...

Are we truly engineers? or just a bunch of hacks...

I've found some things that I simply refuse to work without. Public, Centralized requirements visible to all parties involved. I is ridiculous that we still don't have such repository of information available,  there is not a sane way to assign an identifier to the requirements. Then we go with the 'it is all on Microsoft Office documents' hell which are not kept up to date and which prompts my next entry. Version control. When we arrived here quite a lot of groups were working on windows shared folders... now it is a combination of tools but heck at least there is now version control. Controlled environments and infrastructure. Boy... did I tell you that we are using APIs and tools that are out of support? Continuous deployment. First time here, to assemble a deliverable artifact took 1-2 human days... when it should have been 20 minutes of machine time. And it took 1 week to install said artifact on a previously working environment. And some other things that ...

Qualifications on IT projects. Random thoughts

Projects exceed their estimates both in cost and time. Why? Bad estimation would be an initial thought. If you know your estimates will be off by a wide margin is it possible to minimize the range? Common practice dictates to get better estimates which means get the problem broken down to smaller measurable units, estimate each of them, aggregate results and add a magic number to the total estimate. What if instead of trying to get more accurate estimates we focused on getting more predictable work outcomes? What are the common causes of estimation failure: Difficult problem to solve / Too big problem to solve Problems in comunication Late detection of inconsistencies Underqualified staff Unknown. I'd wager that having underqualified staff is perhaps the most underestimated cause of projects going the way of the dodo. If a problem is too complicated why tackle it with 30 interns and just one senior developer? If it is not complicated but big enough why try to dumb it down a...

Job interviews

So after my sabatic period I started to go to different job interviews (most of them thanks to my fellow colleages whom I can't thank enough) and after most of them I feel a little weird. Everyone tries to get the best people by every means possible but then somethin is quite not right. Maybe they ask wrong questions, ask for too much and are willing to give to little in return or just plain don't know what they want or what they need. Our field is filled with lots of buzzwords and it is obvious that some people manage to get jobs only by putting them on their résumé. Then there are some places where there is a bigger filter and filters out some of the boasters. But still it is a question of what do they really need and what questions are needed to weed out those that do not cover minimal aspects required by the job. Don't get me wrong, it is really hard to identify good developers on an interview. It seems that almost no one knows what to ask in order to get insights abo...