====== Python Compare ====== This module **Difflib** provides classes and functions for comparing sequences. It can be used for example, **for comparing files**, and can produce difference information in various formats, including HTML and context and unified diffs. For **comparing directories and files**, see also, the **filecmp** module. ===== Difflib ===== ==== Finding Matching String with SequenceMatcher ==== **class difflib.SequenceMatcher** The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The most basic functions: * **find_longest_match** * **get_matching_blocks** The function **get_opcodes** using above these functions for parsing\\ Create SequenceMatcher with input are two **strings or two lists** === match ratio === * Calculate match ratio of two strings:


import difflib

a = ' abcd'
b = 'abcd abcd'

seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100

print rate

=> output:


71.4285714286

=== longest match ratio === * Find substring with longest match ratio: * Syntax:


find_longest_match(alo, ahi, blo, bhi)

Find longest matching block in a[alo:ahi] and b[blo:bhi].(lo: low, hi: high). Returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k] with longest match ratio * Simple Example


import difflib

a = ' abcd'
b = 'abcd abcd'

seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100

print rate
print seq.find_longest_match(0, 5, 0, 9)
a = 'm abcd'
seq.set_seq1(a)
print seq.find_longest_match(0, 6, 0, 9)

=> output:


71.4285714286
Match(a=0, b=4, size=5)
Match(a=1, b=4, size=5)

* Example find_longest_match with isjunk option:


import difflib
 
a = ' abcd'
b = 'abcd abcd'
 
seq = difflib.SequenceMatcher(None, a, b)
seq2 = difflib.SequenceMatcher(lambda x: x==" ", a, b)
seq3 = difflib.SequenceMatcher(difflib.IS_LINE_JUNK, a, b)

print seq.find_longest_match(0, 5, 0, 9)
print seq2.find_longest_match(0, 5, 0, 9)
print seq3.find_longest_match(0, 5, 0, 9)

output:


Match(a=0, b=4, size=5)
Match(a=1, b=0, size=4)
Match(a=1, b=0, size=4)

=== Get matching blocks === * Get matching blocks:


import difflib
 
a = ' abcd'
b = 'abcd abcd'
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100

print 'matching1:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
a = 'abced abc'
seq.set_seq1(a)
print 'matching2:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block

=> output:


matching1:
a[0] and b[4] match for 5 elements
a[5] and b[9] match for 0 elements
matching2:
a[0] and b[0] match for 3 elements
a[4] and b[3] match for 5 elements
a[9] and b[9] match for 0 elements

**a[0] and b[4] match for 5 elements:** 5 elements from a[0] are ' abcd' and 5 elements from b[9] are ' abcd' === get_opcodes ===


import difflib
import sys
 
a = """ abcd
abc pq
ef abc
mn
""".splitlines(1)
b = """abcd abcd
ef
mn
""".splitlines(1)
print 'a = ', a
print 'b = ', b
seq = difflib.SequenceMatcher(None, a, b)
print '*******************************'
for tag, alo, ahi, blo, bhi in seq.get_opcodes():
    print '- ', tag, alo, ahi, blo, bhi, ':'
    print '--from:'
    for i in range(alo, ahi):
        sys.stdout.writelines(a[i])
    print '--to:'
    for i in range(blo, bhi):
        sys.stdout.writelines(b[i])
result = list(difflib.ndiff(a, b))
print '*******************************'
print 'normal diff:'
sys.stdout.writelines(result)

output:


a =  [' abcd\n', 'abc pq\n', 'ef abc\n', 'mn\n']
b =  ['abcd abcd\n', 'ef\n', 'mn\n']
*******************************
-  replace 0 3 0 2 :
--from:
 abcd
abc pq
ef abc
--to:
abcd abcd
ef
-  equal 3 4 2 3 :
--from:
mn
--to:
mn
*******************************
normal diff:
-  abcd
+ abcd abcd
? ++++
+ ef
- abc pq
- ef abc
  mn

=== Match string with multilines ===


import difflib
import sys
 
a = """ abcd 
abc pq
ef abc
mn
"""
b = """abcd abcd
ef
mn
"""
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'rate1: ',rate
print 'longest_match1: ', seq.find_longest_match(0, 20, 0, 9)
print 'matching blocks1:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', a[block[0]:(block[0] + block[2])]
    print '<<<<', b[block[1]:(block[1] + block[2])]

a = a.splitlines(1)
b = b.splitlines(1)
seq2 = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'rate2: ',rate
print 'longest_match2: ', seq2.find_longest_match(0, 4, 0, 3)
print 'matching blocks2:'
for block in seq2.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', a[block[0]:(block[0] + block[2])]
    print '<<<<', b[block[1]:(block[1] + block[2])]


d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)

output:


*************************
rate1:  60.0
longest_match1:  Match(a=0, b=4, size=5)
matching blocks1:
a[0] and b[4] match for 5 elements
>>>>  abcd
<<<<  abcd
a[13] and b[9] match for 3 elements
>>>>
ef
<<<<
ef
a[20] and b[12] match for 4 elements
>>>>
mn

<<<<
mn

a[24] and b[16] match for 0 elements
>>>>
<<<<
*************************
rate2:  60.0
longest_match2:  Match(a=3, b=2, size=1)
matching blocks2:
a[3] and b[2] match for 1 elements
>>>> ['mn\n']
<<<< ['mn\n']
a[4] and b[3] match for 0 elements
>>>> []
<<<< []
normal diff:
+ abcd abcd
+ ef
-  abcd
- abc pq
- ef abc
  mn

=== SequenceMatcher with files ===


import difflib
from os import path 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read()
with open(htmlfile2, 'r') as f:
    doc2 = f.read()    
seq = difflib.SequenceMatcher(None, doc1, doc2)
rate = seq.ratio() * 100
print rate
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', doc1[block[0]:(block[0] + block[2])]
    print '<<<<', doc2[block[1]:(block[1] + block[2])]

==== Finding Diffing String with Differ Object ==== Differ object using APIs of SequenceMatcher for comparing: * SequenceMatcher.get_opcodes * And SequenceMatcher.get_grouped_opcodes Understand some basic function: * Differ.compare:


d = difflib.Differ()
result = d.compare(a, b)

* ndiff default will ignore characters IS_CHARACTER_JUNK(' \t') when comparing :


def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
    return Differ(linejunk, charjunk).compare(a, b)

=== Simple Diff ===


import difflib
from os import path
from pprint import pprint
import sys 
 
a = """ abcd 
abc pq
ef abc
 mpq
""".splitlines(1)
b = """abcd abcd
abc pq
ef
mpq
""".splitlines(1)
 
d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)

print 'diff with charjunk = difflib.IS_CHARACTER_JUNK:'
result = difflib.ndiff(a, b)
sys.stdout.writelines(result)

output:


-  abcd
+ abcd abcd
  abc pq
- ef abc
+ ef

=== diff 2 files === * compare 2 files normal:


import difflib
from os import path
from pprint import pprint
import sys 

INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
    doc2 = f.read().splitlines(1)
     
d = difflib.Differ()
result = d.compare(doc1, doc2)
with open('compare.html', 'wb') as f:
    for line in result:
        f.writelines(line)

* compare ignore characters IS_CHARACTER_JUNK(' \t') when comparing


import difflib
from os import path
from pprint import pprint
import sys 

INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
    doc2 = f.read().splitlines(1)
     
result = difflib.ndiff(doc1, doc2)
with open('compare.html', 'wb') as f:
    for line in result:
        f.writelines(line)

* Compare 2 html files:


import difflib
from os import path
from pprint import pprint
import sys, re 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    content = f.read()
    content = re.sub('[\s\t]+/>', '/>', content)
    content = re.sub('[\s\t]+>', '>', content)
    content = re.sub('>[\s\t]+<', '>\n<', content)
    content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
    doc1 = content.splitlines(1)
with open(htmlfile2, 'r') as f:
    content = f.read()
    content = re.sub('[\s\t]+/>', '/>', content)
    content = re.sub('[\s\t]+>', '>', content)
    content = re.sub('>[\s\t]+<', '>\n<', content)
    content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
    doc2 = content.splitlines(1)

result = difflib.ndiff(doc1, doc2)
with open('compare.html', 'wb') as f:
    for line in result:
        f.writelines(line)

==== lxml.html.diff for comparing HTML files ==== xml.html.diff using 2 basic libraries: * difflib for comparing 2 files * etree for parsing HTML Examples for lxml.html.diff: * Simple diff:


from os import path
import sys, re
from lxml.html import diff, etree, HTMLParser
import codecs
import StringIO
doc1 = '''

     
        Add to cart
    
 

    
    
    Add to Wish List
    
    simple

'''
doc2 = '''

     
        Add to cart
    
 

    
    
    Add to Wish List change
    

'''
diffcontent = diff.htmldiff(doc1, doc2)
diffcontent = codecs.encode(diffcontent, 'utf-8')
print diffcontent

output:


Add to cart  
  Add to Wish List change   simple

* diff 2 HTML files:


from os import path
import sys, re
from lxml.html import diff
import codecs
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    content = f.read()
    doc1 = content
with open(htmlfile2, 'r') as f:
    content = f.read()
    doc2 = content
diffcontent = diff.htmldiff(doc1, doc2)
diffcontent = codecs.encode(diffcontent, 'utf-8')
print diffcontent

===== filecmp =====