[darcs-users] Detecting hunk moves [was: Automatic detection of file renames]

Wed Aug 28 06:38:02 UTC 2013

I've reordered this a bit to hopefully make my replies in a more helpful
order.

On 28/08/2013 04:51, AntC wrote:
> Ganesh Sittampalam <ganesh <at> earth.li> writes:

>> If I understand your terminology correctly, then Darcs always follows
>> content/context. I don't think following file names and line
>> numbering would work well in general merges would usually produce
>> bad results - e.g. merging a file change to A and a rename of A to B
>> should result in a file change to B, not to A.
>>
> Thank you Ganesh. In looking at the darcs theory stuff, most of its
> examples for commuting show 'shuffling' of line numbers from hunk
> deletes/inserts, as if darcs is following addresses rather than
> content.

Ah, yes. I think I misunderstood you, in that I thought you meant by
file name/line number that a darcs patch would refer to an absolute file
name and line number and never change that.

Perhaps the best short description of what darcs does is that it tracks
addresses internally and alters them as necessary so that the visible
result is that it follows content.

Darcs never stores the content of "context" lines in the same way that
diff/patch do. However it does ensure that content is kept in the
"correct" place, i.e. with its context, as it gets merged and commuted.
That's implemented by tracking the /current/ file name and line number
for the content but changing it as necessary as the overall context changes.

>> Once a patch is inferred based on whatever heuristics and recorded, it
>> would be treated just like the user had recorded it by hand in future
>> merges. ...
> 
> Hmm? I guess that the heuristics have to record the patch in such a way 
> that when it is merged/commuted it not only follows "darcs rules", but 
> also has the same 'moral effect' (as darcs calls it). Could we reasonably 
> expect a user to understand how that works out in all possible contexts?

Well, I guess that's the idea of existing darcs patches - they have a
formal behaviour that tries to match user intuition. For example by
following renames "properly", and by maintaining the relative ordering
of content in a file.

Sometimes this will still be semantically incorrect because for example
cherry-picking will allow you to pull a patch that uses function X
without the patch that defines function X, but it should be sort of
obvious to users how it happened.

> So patches based on "similar content" are tricky: are they recorded aginst 
> the file/line number location, or against the content/context? -- wherever 
> that goes in the target repo. And what if that content is not pulled into 
> the target repo?

The patch itself wouldn't say anything about "similar content". Let's
suppose we start with this in file F:

Aardvark 15
Bison 28
Cougar 34
Dolphin 49
Elephant 58

and then the user edits F and creates G so that F contains

Aardvark 15
Elephant 58

and G contains

Bison 28
Cougar 35
Dolphin 49

The idea of the similar content heuristic would be that we would first
record a "hunk move" of lines 2,3,4 of F to line 1 of G, followed by a
hunk change of line 2 of G to replace "Cougar 34" by "Cougar 35".

So the heuristic would use the presence of "similar content" make a
guess that the user actually moved the content and then changed it
slightly. If the user accepted that guess when recording, what would
actually be tracked by darcs is those two steps - a move of content
keeping it unchanged, followed by an edit to that content.

[It could also be recorded the other way round - edit first then move
content - but the two patches in this example should commute cleanly so
it won't make a significant difference.]

>>
>>> Suppose we have this sequence in Repo A:
>>> + create file F
>>> + add hunk text H1 to F
>>> + insert hunk text H2 into F (into the middle of H1)
>>>
>>> The author knows (but darcs can't) that the content of H2 'links to' 
>>> H1.
>>> (For example, H2 is program code that refs names declared in H1.)
>>> (I'm not sure if H2 is dependent on H1 in a darcs sense, because you 
>>> can't commute the two hunk operations -- you'd have to split 
>>>  H1 into two 'hunklets' of text.)
>>>
>>> There's then this sequence in Repo B.
>>> + pull create file F
>>> + pull add hunk text H1
>>> + create file G
>>> + hunk move text H1 to file G (-cut+paste)
>>> - this leaves file F empty
>>> ? pull hunk text H2
>>>
>>> I'm guessing that darcs will put text H2 into file G, as the only 
>>> content -- then compile would fail(?) Would git do the same?
>>
>> Yes, that would be what I would expect to happen. If H2 depends on H1
>> then this seems like exactly the right thing to do and I don't see why
>> compilation would fail.
>>
>> Given what you say below you might have meant to say "file F" here;
> 
> Aargh! and big apologies for confusing you. You are quite right that I 
> meant to say "file F". I guessed that patch H2 would go into the same file 
> name as it came from (in the absence of file renames).
> 
> So thank you for correcting my guess. I'm interested in understanding how 
> (and why) darcs knows to target the content (where H1 has gone), rather 
> than the address (file name/line number).

In this scenario it would know to target the new location of H1 because
the patch that moved H1 would be recorded as "move lines 5-10 from file
F to line 1 of file G", and the patch that added H2 would initially be
recorded as "insert this content at line 7 in file F". Because 7 is
between 5 and 10, the merge rule for combining a hunk change and a hunk
move would define that after the merge, the patch would be "insert his
content at line 3 in file G". (3 = 1 + 7 - 5)

It's this rule that makes hunk move worthwhile as a patch type - without
it, we might as well just store a hunk move as the corresponding delete
and add.

> what if there was H3 inserted within H1 at exactly the point H2 
> wants to go? What is the 'right thing'? Insert before H3, or after H3, or 
> report a conflict?

One of the things needed for implementing hunk move is working out the
correct rules for this kind of situation, where two patches to be merged
or commuted touch each other. It's surprisingly subtle even with
existing "hunk change" patches. I think in the scenario you describe it
would have to be a conflict, intuitively because there's no answer
that's unambiguous.

> What if H3's text is the same (or similar) to H2's?

In general that doesn't change the answer; the patch that inserted H3
and the patch that inserted H2 are separate and should conflict.

However Darcs 2 patches have a special notion of "duplicate" where two
patches with identical content actually do get merged without a
conflict, so if we still have Darcs 2 merging semantics at the time we
get hunk move, then you would end up with just one copy of the content
and no conflict would be reported. I don't really like this behaviour
and my gut feeling is that we should remove it again for Darcs 3, but
it'll need further discussion before we commit to that.

Cheers,

Ganesh