I've seen many routines (functions, methods, procedures .. whatever) that are needlessly long due to overzealous use of the copy/paste technique. But once you eliminate that kind of stuff, I do agree that you can get to a point where you've abstracted too much -- and the actual thing you are trying to accomplish with your code becomes obscured.
I have not read these studies, but to me it seems that the defect rate could rise because it's no longer apparent what's interacting with what, and in what order. If I have to hop around to numerous different 5 line methods (often in numerous different source code files) it becomes very hard to intuit whether the code is "correct" or even have any idea what it's doing.
It's hard to say - I learnt FORTH at one stage. The thing with FORTH is that it's almost impossible to write long functions. You can, but it's brutal. So you write a lot of very short methods. I actually hated working like this, but I must admit the code was better.
The reason I didn't like doing this is that it disturbed my flow. I wasn't even sure how the function was going to look, let alone put together 10 functions just to flesh it out. You needed a strong design and intent before you started.
Today, of course, people rely a lot on refactoring for the same effect. I tend to like this better, but I'm not sure it's for any solid basis on my part, or I'm just lazy and want to hack instead of pontificate on the problem.
Now I try and have a rule of thumb - I like functions that have a single verb and noun, with a single optional subject, adverb and adjective. (The verbs and nouns can be system concepts, naturally). So ideally anything can be described in sub ~5 words... If I'm really serious I put together the standard nomenclature first.
Once it becomes hard to describe in those terms, or you need n+ nouns, you've probably split the atom and working at too high a level of granularity.
These findings might really be about module-size, not function-size.
For Java/C#, objects and packages are module units that were not available in the languages in the studies, that would naturally share the role of modularity performed by functions.
My opinion: what belongs in a module, and its size, is determined by the problem and its domain. The crucial criteria is to make things clearer, and uncluttered by irrelevant things. A module is a way to combine some things, and separate other things. There can be several legitimate ways to divide up a task into modules, though it will make more sense if you choose just one organizing principle and stick to it. This makes your choice of modularity predictable.
You made me realize that there are so many different levels of expertise, that what is right for someone's situation might not be right for mine. It might not even make sense to me (if I don't happen to know their specific situation), even though it is right for them.
Generally, cut-and-paste isn't the issue, copy-and-paste is. Granted, you can cut and paste paste paste ..., but cut-and-paste usually means moving things to a more sensible place.
The biggest problem is when a code blob is duplicated all over the place and (worse still!) the copies slowly drift out of sync, making them hard to reunite. I've been working on a program off and on that tries to detect this in a language-agnostic manner, but I've hit a couple dead ends. (I half expect somebody with a more orthodox CS education to take one look at it and go, "Oh, that's NP-complete," but I haven't let it stop me - it's been a great way to learn about graph algorithms.)
As you have discovered the problem is complex because of the explosion of compares if you try to do it in any general way. It would be useful to query "is there any other code like the section that I selected?" The copied code usually has changes in things like constants and argument. If you abstracted away constants it might be easier to compare. Just a suggestion.
Right. Allowing preprocessing with a syntax-specific extension or (ideally) scanning on the AST would be more interesting - code can have be surprisingly similar except for variable names, types, and other semantic annotations. Rather than calling them "patterns" and prizing them, sometimes abstracting them away would be a better approach.
My main focus is on curing the heartbreak of copy-and-paste-programming, but I'm sure it would have more general applications (if I ever get it out of quadratic performance, etc.). Unsurprisingly, what is "similar enough" can be a very slippery concept.
Doing it on the AST should help performance. Even if you're still using an N^4 algorithm, the AST should be significantly smaller than the string that represents the code.
And, as it sounds like you realize, this will catch cases where variable names have been changed, but the structure is the same.
Right, and there would also be more semantic boundaries, (possibly greatly) reducing the size of the overall problem space. If (by sheer coincidence) there was a token overlap between the last two lines of one function and the first two of another, that's probably not relevant in the same way a copied-and-pasted version of the same function in two places would be.
The AST also makes it easier to use heuristics based on semantic information to shorten the number of subtrees you're concerned with (previously substrings).
For example, if in a C-like language, you could decide to only look at subtrees that are at least complete statements, or go to a higher granularity and only look at compound statements. You could also eliminate common idioms, such as
for (i = 0; i < N; i++)
Which will pop up all over the place and give false positives.
If you're literally just looking for code that appears in two places, then there is an obvious polynomial time algorithm, no? (Check every substring against every other substring...)
Yeah, it's polynomial, but it's still really slow and not practical.
In a string of length N, there are N substrings of length 1, N-1 substrings of length 2, N-2 substring of length 3, ..., 1 substring of length N. This is the sum of the first N integers, which is N(N+1)/2. If we're doing big oh analysis, we can just say that's O(N^2).
Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).
Doing a O(N^4) algorithm on nontrivial sizes of N (and nontrivial sizes of N are where you need it the most) will take a long time.
Disclaimer: it's been a long time since I've done any algorithm analysis, so please check my work.
My algorithms teacher does research in something similar. IIRC he has a program exactly for showing copy/pasted code. You may want to have a look at his papers on Clone Detection Tools, "A Novel Approach to Optimize Clone Refactoring Activity", etc.
Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).
If you are just checking for equality, comparing each of N items to another group of N items is O(N). Use a hash table.
I have not read these studies, but to me it seems that the defect rate could rise because it's no longer apparent what's interacting with what, and in what order. If I have to hop around to numerous different 5 line methods (often in numerous different source code files) it becomes very hard to intuit whether the code is "correct" or even have any idea what it's doing.