I've seen many routines (functions, methods, procedures .. whatever) that are ne...

jwilliams · on March 10, 2009

It's hard to say - I learnt FORTH at one stage. The thing with FORTH is that it's almost impossible to write long functions. You can, but it's brutal. So you write a lot of very short methods. I actually hated working like this, but I must admit the code was better.

The reason I didn't like doing this is that it disturbed my flow. I wasn't even sure how the function was going to look, let alone put together 10 functions just to flesh it out. You needed a strong design and intent before you started.

Today, of course, people rely a lot on refactoring for the same effect. I tend to like this better, but I'm not sure it's for any solid basis on my part, or I'm just lazy and want to hack instead of pontificate on the problem.

Now I try and have a rule of thumb - I like functions that have a single verb and noun, with a single optional subject, adverb and adjective. (The verbs and nouns can be system concepts, naturally). So ideally anything can be described in sub ~5 words... If I'm really serious I put together the standard nomenclature first.

Once it becomes hard to describe in those terms, or you need n+ nouns, you've probably split the atom and working at too high a level of granularity.

10ren · on March 10, 2009

These findings might really be about module-size, not function-size.

For Java/C#, objects and packages are module units that were not available in the languages in the studies, that would naturally share the role of modularity performed by functions.

My opinion: what belongs in a module, and its size, is determined by the problem and its domain. The crucial criteria is to make things clearer, and uncluttered by irrelevant things. A module is a way to combine some things, and separate other things. There can be several legitimate ways to divide up a task into modules, though it will make more sense if you choose just one organizing principle and stick to it. This makes your choice of modularity predictable.

10ren · on March 10, 2009

I haven't encountered copy and paste code.

You made me realize that there are so many different levels of expertise, that what is right for someone's situation might not be right for mine. It might not even make sense to me (if I don't happen to know their specific situation), even though it is right for them.

russell · on March 10, 2009

I just had to fix the same bug in 4 places because of someone's use of cut and paste. (Actually, refactored, but you get the picture.)

silentbicycle · on March 10, 2009

Generally, cut-and-paste isn't the issue, copy-and-paste is. Granted, you can cut and paste paste paste ..., but cut-and-paste usually means moving things to a more sensible place.

The biggest problem is when a code blob is duplicated all over the place and (worse still!) the copies slowly drift out of sync, making them hard to reunite. I've been working on a program off and on that tries to detect this in a language-agnostic manner, but I've hit a couple dead ends. (I half expect somebody with a more orthodox CS education to take one look at it and go, "Oh, that's NP-complete," but I haven't let it stop me - it's been a great way to learn about graph algorithms.)

likpok · on March 11, 2009

Pattern Insight does this, IIRC. They found at least one bug in Linux as a result of copypasta.

http://www.patterninsight.com/products/pattern-miner.html

I saw a demo, and it was pretty slick. I have never used it on an actual project though.

russell · on March 10, 2009

I meant copy and paste.

As you have discovered the problem is complex because of the explosion of compares if you try to do it in any general way. It would be useful to query "is there any other code like the section that I selected?" The copied code usually has changes in things like constants and argument. If you abstracted away constants it might be easier to compare. Just a suggestion.

silentbicycle · on March 10, 2009

Right. Allowing preprocessing with a syntax-specific extension or (ideally) scanning on the AST would be more interesting - code can have be surprisingly similar except for variable names, types, and other semantic annotations. Rather than calling them "patterns" and prizing them, sometimes abstracting them away would be a better approach.

My main focus is on curing the heartbreak of copy-and-paste-programming, but I'm sure it would have more general applications (if I ever get it out of quadratic performance, etc.). Unsurprisingly, what is "similar enough" can be a very slippery concept.

scott_s · on March 10, 2009

Doing it on the AST should help performance. Even if you're still using an N^4 algorithm, the AST should be significantly smaller than the string that represents the code.

And, as it sounds like you realize, this will catch cases where variable names have been changed, but the structure is the same.

silentbicycle · on March 10, 2009

Right, and there would also be more semantic boundaries, (possibly greatly) reducing the size of the overall problem space. If (by sheer coincidence) there was a token overlap between the last two lines of one function and the first two of another, that's probably not relevant in the same way a copied-and-pasted version of the same function in two places would be.

scott_s · on March 10, 2009

The AST also makes it easier to use heuristics based on semantic information to shorten the number of subtrees you're concerned with (previously substrings).

For example, if in a C-like language, you could decide to only look at subtrees that are at least complete statements, or go to a higher granularity and only look at compound statements. You could also eliminate common idioms, such as

  for (i = 0; i < N; i++)

Which will pop up all over the place and give false positives.

jibiki · on March 10, 2009

If you're literally just looking for code that appears in two places, then there is an obvious polynomial time algorithm, no? (Check every substring against every other substring...)

scott_s · on March 10, 2009

Yeah, it's polynomial, but it's still really slow and not practical.

In a string of length N, there are N substrings of length 1, N-1 substrings of length 2, N-2 substring of length 3, ..., 1 substring of length N. This is the sum of the first N integers, which is N(N+1)/2. If we're doing big oh analysis, we can just say that's O(N^2).

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

Doing a O(N^4) algorithm on nontrivial sizes of N (and nontrivial sizes of N are where you need it the most) will take a long time.

Disclaimer: it's been a long time since I've done any algorithm analysis, so please check my work.

ramchip · on March 10, 2009

My algorithms teacher does research in something similar. IIRC he has a program exactly for showing copy/pasted code. You may want to have a look at his papers on Clone Detection Tools, "A Novel Approach to Optimize Clone Refactoring Activity", etc.

http://www.polymtl.ca/recherche/rc/en/professeurs/details.ph...

silentbicycle · on March 10, 2009

Awesome, thanks!

lacker · on March 10, 2009

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

If you are just checking for equality, comparing each of N items to another group of N items is O(N). Use a hash table.

silentbicycle · on March 11, 2009

Exactly. Interned strings / symbols work out the same, too.