====== Python Compare ======
This module **Difflib** provides classes and functions for comparing sequences. It can be used for example, **for comparing files**, and can produce difference information in various formats, including HTML and context and unified diffs. For **comparing directories and files**, see also, the **filecmp** module.
===== Difflib =====
==== Finding Matching String with SequenceMatcher ====
**class difflib.SequenceMatcher**
The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The most basic functions:
* **find_longest_match**
* **get_matching_blocks**
The function **get_opcodes** using above these functions for parsing\\
Create SequenceMatcher with input are two **strings or two lists**
=== match ratio ===
* Calculate match ratio of two strings:
import difflib
a = ' abcd'
b = 'abcd abcd'
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print rate
=> output:
71.4285714286
=== longest match ratio ===
* Find substring with longest match ratio:
* Syntax:
find_longest_match(alo, ahi, blo, bhi)
Find longest matching block in a[alo:ahi] and b[blo:bhi].(lo: low, hi: high). Returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k] with longest match ratio
* Simple Example
import difflib
a = ' abcd'
b = 'abcd abcd'
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print rate
print seq.find_longest_match(0, 5, 0, 9)
a = 'm abcd'
seq.set_seq1(a)
print seq.find_longest_match(0, 6, 0, 9)
=> output:
71.4285714286
Match(a=0, b=4, size=5)
Match(a=1, b=4, size=5)
* Example find_longest_match with isjunk option:
import difflib
a = ' abcd'
b = 'abcd abcd'
seq = difflib.SequenceMatcher(None, a, b)
seq2 = difflib.SequenceMatcher(lambda x: x==" ", a, b)
seq3 = difflib.SequenceMatcher(difflib.IS_LINE_JUNK, a, b)
print seq.find_longest_match(0, 5, 0, 9)
print seq2.find_longest_match(0, 5, 0, 9)
print seq3.find_longest_match(0, 5, 0, 9)
output:
Match(a=0, b=4, size=5)
Match(a=1, b=0, size=4)
Match(a=1, b=0, size=4)
=== Get matching blocks ===
* Get matching blocks:
import difflib
a = ' abcd'
b = 'abcd abcd'
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print 'matching1:'
for block in seq.get_matching_blocks():
print "a[%d] and b[%d] match for %d elements" % block
a = 'abced abc'
seq.set_seq1(a)
print 'matching2:'
for block in seq.get_matching_blocks():
print "a[%d] and b[%d] match for %d elements" % block
=> output:
matching1:
a[0] and b[4] match for 5 elements
a[5] and b[9] match for 0 elements
matching2:
a[0] and b[0] match for 3 elements
a[4] and b[3] match for 5 elements
a[9] and b[9] match for 0 elements
**a[0] and b[4] match for 5 elements:** 5 elements from a[0] are ' abcd' and 5 elements from b[9] are ' abcd'
=== get_opcodes ===
import difflib
import sys
a = """ abcd
abc pq
ef abc
mn
""".splitlines(1)
b = """abcd abcd
ef
mn
""".splitlines(1)
print 'a = ', a
print 'b = ', b
seq = difflib.SequenceMatcher(None, a, b)
print '*******************************'
for tag, alo, ahi, blo, bhi in seq.get_opcodes():
print '- ', tag, alo, ahi, blo, bhi, ':'
print '--from:'
for i in range(alo, ahi):
sys.stdout.writelines(a[i])
print '--to:'
for i in range(blo, bhi):
sys.stdout.writelines(b[i])
result = list(difflib.ndiff(a, b))
print '*******************************'
print 'normal diff:'
sys.stdout.writelines(result)
output:
a = [' abcd\n', 'abc pq\n', 'ef abc\n', 'mn\n']
b = ['abcd abcd\n', 'ef\n', 'mn\n']
*******************************
- replace 0 3 0 2 :
--from:
abcd
abc pq
ef abc
--to:
abcd abcd
ef
- equal 3 4 2 3 :
--from:
mn
--to:
mn
*******************************
normal diff:
- abcd
+ abcd abcd
? ++++
+ ef
- abc pq
- ef abc
mn
=== Match string with multilines ===
import difflib
import sys
a = """ abcd
abc pq
ef abc
mn
"""
b = """abcd abcd
ef
mn
"""
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'rate1: ',rate
print 'longest_match1: ', seq.find_longest_match(0, 20, 0, 9)
print 'matching blocks1:'
for block in seq.get_matching_blocks():
print "a[%d] and b[%d] match for %d elements" % block
print '>>>>', a[block[0]:(block[0] + block[2])]
print '<<<<', b[block[1]:(block[1] + block[2])]
a = a.splitlines(1)
b = b.splitlines(1)
seq2 = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'rate2: ',rate
print 'longest_match2: ', seq2.find_longest_match(0, 4, 0, 3)
print 'matching blocks2:'
for block in seq2.get_matching_blocks():
print "a[%d] and b[%d] match for %d elements" % block
print '>>>>', a[block[0]:(block[0] + block[2])]
print '<<<<', b[block[1]:(block[1] + block[2])]
d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)
output:
*************************
rate1: 60.0
longest_match1: Match(a=0, b=4, size=5)
matching blocks1:
a[0] and b[4] match for 5 elements
>>>> abcd
<<<< abcd
a[13] and b[9] match for 3 elements
>>>>
ef
<<<<
ef
a[20] and b[12] match for 4 elements
>>>>
mn
<<<<
mn
a[24] and b[16] match for 0 elements
>>>>
<<<<
*************************
rate2: 60.0
longest_match2: Match(a=3, b=2, size=1)
matching blocks2:
a[3] and b[2] match for 1 elements
>>>> ['mn\n']
<<<< ['mn\n']
a[4] and b[3] match for 0 elements
>>>> []
<<<< []
normal diff:
+ abcd abcd
+ ef
- abcd
- abc pq
- ef abc
mn
=== SequenceMatcher with files ===
import difflib
from os import path
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
doc1 = f.read()
with open(htmlfile2, 'r') as f:
doc2 = f.read()
seq = difflib.SequenceMatcher(None, doc1, doc2)
rate = seq.ratio() * 100
print rate
for block in seq.get_matching_blocks():
print "a[%d] and b[%d] match for %d elements" % block
print '>>>>', doc1[block[0]:(block[0] + block[2])]
print '<<<<', doc2[block[1]:(block[1] + block[2])]
==== Finding Diffing String with Differ Object ====
Differ object using APIs of SequenceMatcher for comparing:
* SequenceMatcher.get_opcodes
* And SequenceMatcher.get_grouped_opcodes
Understand some basic function:
* Differ.compare:
d = difflib.Differ()
result = d.compare(a, b)
* ndiff default will ignore characters IS_CHARACTER_JUNK(' \t') when comparing :
def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
return Differ(linejunk, charjunk).compare(a, b)
=== Simple Diff ===
import difflib
from os import path
from pprint import pprint
import sys
a = """ abcd
abc pq
ef abc
mpq
""".splitlines(1)
b = """abcd abcd
abc pq
ef
mpq
""".splitlines(1)
d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)
print 'diff with charjunk = difflib.IS_CHARACTER_JUNK:'
result = difflib.ndiff(a, b)
sys.stdout.writelines(result)
output:
- abcd
+ abcd abcd
abc pq
- ef abc
+ ef
=== diff 2 files ===
* compare 2 files normal:
import difflib
from os import path
from pprint import pprint
import sys
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
doc2 = f.read().splitlines(1)
d = difflib.Differ()
result = d.compare(doc1, doc2)
with open('compare.html', 'wb') as f:
for line in result:
f.writelines(line)
* compare ignore characters IS_CHARACTER_JUNK(' \t') when comparing
import difflib
from os import path
from pprint import pprint
import sys
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
doc1 = f.read().splitlines(1)
with open(htmlfile2, 'r') as f:
doc2 = f.read().splitlines(1)
result = difflib.ndiff(doc1, doc2)
with open('compare.html', 'wb') as f:
for line in result:
f.writelines(line)
* Compare 2 html files:
import difflib
from os import path
from pprint import pprint
import sys, re
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
content = f.read()
content = re.sub('[\s\t]+/>', '/>', content)
content = re.sub('[\s\t]+>', '>', content)
content = re.sub('>[\s\t]+<', '>\n<', content)
content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
doc1 = content.splitlines(1)
with open(htmlfile2, 'r') as f:
content = f.read()
content = re.sub('[\s\t]+/>', '/>', content)
content = re.sub('[\s\t]+>', '>', content)
content = re.sub('>[\s\t]+<', '>\n<', content)
content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
doc2 = content.splitlines(1)
result = difflib.ndiff(doc1, doc2)
with open('compare.html', 'wb') as f:
for line in result:
f.writelines(line)
==== lxml.html.diff for comparing HTML files ====
xml.html.diff using 2 basic libraries:
* difflib for comparing 2 files
* etree for parsing HTML
Examples for lxml.html.diff:
* Simple diff:
from os import path
import sys, re
from lxml.html import diff, etree, HTMLParser
import codecs
import StringIO
doc1 = ''''''
doc2 = ''''''
diffcontent = diff.htmldiff(doc1, doc2)
diffcontent = codecs.encode(diffcontent, 'utf-8')
print diffcontent
output:
* diff 2 HTML files:
from os import path
import sys, re
from lxml.html import diff
import codecs
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
content = f.read()
doc1 = content
with open(htmlfile2, 'r') as f:
content = f.read()
doc2 = content
diffcontent = diff.htmldiff(doc1, doc2)
diffcontent = codecs.encode(diffcontent, 'utf-8')
print diffcontent
===== filecmp =====