This is an old revision of the document!

Python Compare

This module Difflib provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

Difflib

Finding Matching String with SequenceMatcher

class difflib.SequenceMatcher The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The most basic functions:

find_longest_match
get_matching_blocks

The function get_opcodes using above these functions for parsing

match ratio

Calculate match ratio of two strings:

import difflib
 
a = ' abcd'
b = 'abcd abcd'
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
 
print rate

⇒ output:

71.4285714286

longest match ratio

Find substring with longest match ratio:

Syntax:
```
find_longest_match(alo, ahi, blo, bhi)
```
Find longest matching block in a[alo:ahi] and b[blo:bhi].(lo: low, hi: high). Returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k] with longest match ratio

Example

import difflib
 
a = ' abcd'
b = 'abcd abcd'
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
 
print rate
print seq.find_longest_match(0, 5, 0, 9)
a = 'm abcd'
seq.set_seq1(a)
print seq.find_longest_match(0, 6, 0, 9)

⇒ output:

71.4285714286
Match(a=0, b=4, size=5)
Match(a=1, b=4, size=5)

Get matching blocks

Get matching blocks:

import difflib
 
a = ' abcd'
b = 'abcd abcd'
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
 
print 'matching1:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
a = 'abced abc'
seq.set_seq1(a)
print 'matching2:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block

⇒ output:

matching1:
a[0] and b[4] match for 5 elements
a[5] and b[9] match for 0 elements
matching2:
a[0] and b[0] match for 3 elements
a[4] and b[3] match for 5 elements
a[9] and b[9] match for 0 elements

a[0] and b[4] match for 5 elements: 5 elements from a[0] are ' abcd' and 5 elements from b[9] are ' abcd'

Math string with multilines

import difflib
 
a = """ abcd 
abc pq
ef abc
"""
b = """abcd abcd
ef
"""
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print 'rate: ',rate
print 'longest_match: ', seq.find_longest_match(0, 20, 0, 9)
print 'matching blocks:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', a[block[0]:(block[0] + block[2])]
    print '<<<<', b[block[1]:(block[1] + block[2])]

output:

rate:  52.9411764706
longest_match:  Match(a=0, b=4, size=5)
matching blocks:
a[0] and b[4] match for 5 elements
>>>>  abcd
<<<<  abcd
a[13] and b[9] match for 3 elements
>>>>
ef
<<<<
ef
a[20] and b[12] match for 1 elements
>>>>

<<<<

a[21] and b[13] match for 0 elements
>>>>
<<<<

SequenceMatcher with files

import difflib
from os import path 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read()
with open(htmlfile2, 'r') as f:
    doc2 = f.read()    
seq = difflib.SequenceMatcher(None, doc1, doc2)
rate = seq.ratio() * 100
print rate
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', doc1[block[0]:(block[0] + block[2])]
    print '<<<<', doc2[block[1]:(block[1] + block[2])]

Finding Diffing String with Differ Object

Differ object using APIs of SequenceMatcher for comparing:

SequenceMatcher.get_opcodes
And SequenceMatcher.get_grouped_opcodes

Understand some basic function:

Differ.compare:

d = difflib.Differ()
result = d.compare(a, b)

ndiff default will ignore characters IS_CHARACTER_JUNK(' \t') when comparing :

def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
    return Differ(linejunk, charjunk).compare(a, b)

Simple Diff

import difflib
from os import path
from pprint import pprint
import sys 
 
a = """ abcd 
abc pq
ef abc
""".splitlines(1)
b = """abcd abcd
abc pq
ef
""".splitlines(1)
 
d = difflib.Differ()
result = list(d.compare(a, b))
sys.stdout.writelines(result)

output:

-  abcd
+ abcd abcd
  abc pq
- ef abc
+ ef

diff 2 files

compare 2 files normal:

import difflib
from os import path
from pprint import pprint
import sys 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
    doc2 = f.read().splitlines(1)
 
d = difflib.Differ()
result = d.compare(doc1, doc2)
with open('compare.html', 'wb') as f:
    for line in result:
        f.writelines(line)

compare ignore characters IS_CHARACTER_JUNK(' \t') when comparing

import difflib
from os import path
from pprint import pprint
import sys 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
    doc2 = f.read().splitlines(1)
 
result = difflib.ndiff(doc1, doc2)
with open('compare.html', 'wb') as f:
    for line in result:
        f.writelines(line)

my notes

Table of Contents

Python Compare

Difflib

Finding Matching String with SequenceMatcher

match ratio

longest match ratio

Get matching blocks

Math string with multilines

SequenceMatcher with files

Finding Diffing String with Differ Object

Simple Diff

diff 2 files

filecmp