User Tools

Site Tools


python:compare

This is an old revision of the document!


Python Compare

This module Difflib provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

Difflib

Finding Matching String with SequenceMatcher

class difflib.SequenceMatcher The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The most basic functions:

  • find_longest_match
  • get_matching_blocks

The function get_opcodes using above these functions for parsing

match ratio

  • Calculate match ratio of two strings:
    import difflib
     
    a = ' abcd'
    b = 'abcd abcd'
     
    seq = difflib.SequenceMatcher(None, a, b)
    rate = seq.ratio() * 100
     
    print rate

    ⇒ output:

    71.4285714286

longest match ratio

  • Find substring with longest match ratio:
    • Syntax:
      find_longest_match(alo, ahi, blo, bhi)

      Find longest matching block in a[alo:ahi] and b[blo:bhi].(lo: low, hi: high). Returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k] with longest match ratio

    • Simple Example
      import difflib
       
      a = ' abcd'
      b = 'abcd abcd'
       
      seq = difflib.SequenceMatcher(None, a, b)
      rate = seq.ratio() * 100
       
      print rate
      print seq.find_longest_match(0, 5, 0, 9)
      a = 'm abcd'
      seq.set_seq1(a)
      print seq.find_longest_match(0, 6, 0, 9)

      ⇒ output:

      71.4285714286
      Match(a=0, b=4, size=5)
      Match(a=1, b=4, size=5)
    • Example find_longest_match with isjunk option:
      import difflib
       
      a = ' abcd'
      b = 'abcd abcd'
       
      seq = difflib.SequenceMatcher(None, a, b)
      seq2 = difflib.SequenceMatcher(lambda x: x==" ", a, b)
      seq3 = difflib.SequenceMatcher(difflib.IS_LINE_JUNK, a, b)
       
      print seq.find_longest_match(0, 5, 0, 9)
      print seq2.find_longest_match(0, 5, 0, 9)
      print seq3.find_longest_match(0, 5, 0, 9)

      output:

      Match(a=0, b=4, size=5)
      Match(a=1, b=0, size=4)
      Match(a=1, b=0, size=4)

Get matching blocks

  • Get matching blocks:
    import difflib
     
    a = ' abcd'
    b = 'abcd abcd'
     
    seq = difflib.SequenceMatcher(None, a, b)
    rate = seq.ratio() * 100
     
    print 'matching1:'
    for block in seq.get_matching_blocks():
        print "a[%d] and b[%d] match for %d elements" % block
    a = 'abced abc'
    seq.set_seq1(a)
    print 'matching2:'
    for block in seq.get_matching_blocks():
        print "a[%d] and b[%d] match for %d elements" % block

    ⇒ output:

    matching1:
    a[0] and b[4] match for 5 elements
    a[5] and b[9] match for 0 elements
    matching2:
    a[0] and b[0] match for 3 elements
    a[4] and b[3] match for 5 elements
    a[9] and b[9] match for 0 elements

    a[0] and b[4] match for 5 elements: 5 elements from a[0] are ' abcd' and 5 elements from b[9] are ' abcd'

Match string with multilines

import difflib
import sys
 
a = """ abcd 
abc pq
ef abc
mn
"""
b = """abcd abcd
ef
mn
"""
 
seq = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'rate1: ',rate
print 'longest_match1: ', seq.find_longest_match(0, 20, 0, 9)
print 'matching blocks1:'
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', a[block[0]:(block[0] + block[2])]
    print '<<<<', b[block[1]:(block[1] + block[2])]
 
a = a.splitlines(1)
b = b.splitlines(1)
seq2 = difflib.SequenceMatcher(None, a, b)
rate = seq.ratio() * 100
print '*************************'
print 'matching blocks2:'
for block in seq2.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', a[block[0]:(block[0] + block[2])]
    print '<<<<', b[block[1]:(block[1] + block[2])]
 
 
d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)

output:

*************************
rate1:  60.0
longest_match1:  Match(a=0, b=4, size=5)
matching blocks1:
a[0] and b[4] match for 5 elements
>>>>  abcd
<<<<  abcd
a[13] and b[9] match for 3 elements
>>>>
ef
<<<<
ef
a[20] and b[12] match for 4 elements
>>>>
mn

<<<<
mn

a[24] and b[16] match for 0 elements
>>>>
<<<<
*************************
matching blocks2:
a[3] and b[2] match for 1 elements
>>>> ['mn\n']
<<<< ['mn\n']
a[4] and b[3] match for 0 elements
>>>> []
<<<< []
normal diff:
+ abcd abcd
+ ef
-  abcd
- abc pq
- ef abc
  mn

SequenceMatcher with files

import difflib
from os import path 
 
INPUT_DIR = 'opencart_47066'
htmlfile1 = path.join(INPUT_DIR, 'index.html')
htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
with open(htmlfile1, 'r') as f:
    doc1 = f.read()
with open(htmlfile2, 'r') as f:
    doc2 = f.read()    
seq = difflib.SequenceMatcher(None, doc1, doc2)
rate = seq.ratio() * 100
print rate
for block in seq.get_matching_blocks():
    print "a[%d] and b[%d] match for %d elements" % block
    print '>>>>', doc1[block[0]:(block[0] + block[2])]
    print '<<<<', doc2[block[1]:(block[1] + block[2])]

Finding Diffing String with Differ Object

Differ object using APIs of SequenceMatcher for comparing:

  • SequenceMatcher.get_opcodes
  • And SequenceMatcher.get_grouped_opcodes

Understand some basic function:

  • Differ.compare:
    d = difflib.Differ()
    result = d.compare(a, b)
  • ndiff default will ignore characters IS_CHARACTER_JUNK(' \t') when comparing :
    def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
        return Differ(linejunk, charjunk).compare(a, b)

Simple Diff

import difflib
from os import path
from pprint import pprint
import sys 
 
a = """ abcd 
abc pq
ef abc
 mpq
""".splitlines(1)
b = """abcd abcd
abc pq
ef
mpq
""".splitlines(1)
 
d = difflib.Differ()
result = list(d.compare(a, b))
print 'normal diff:'
sys.stdout.writelines(result)
 
print 'diff with charjunk = difflib.IS_CHARACTER_JUNK:'
result = difflib.ndiff(a, b)
sys.stdout.writelines(result)

output:

-  abcd
+ abcd abcd
  abc pq
- ef abc
+ ef

diff 2 files

  • compare 2 files normal:
    import difflib
    from os import path
    from pprint import pprint
    import sys 
     
    INPUT_DIR = 'opencart_47066'
    htmlfile1 = path.join(INPUT_DIR, 'index.html')
    htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
    with open(htmlfile1, 'r') as f:
        doc1 = f.read().splitlines(1)
    with open(htmlfile2, 'r') as f:
        doc2 = f.read().splitlines(1)
     
    d = difflib.Differ()
    result = d.compare(doc1, doc2)
    with open('compare.html', 'wb') as f:
        for line in result:
            f.writelines(line)
  • compare ignore characters IS_CHARACTER_JUNK(' \t') when comparing
    import difflib
    from os import path
    from pprint import pprint
    import sys 
     
    INPUT_DIR = 'opencart_47066'
    htmlfile1 = path.join(INPUT_DIR, 'index.html')
    htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
    with open(htmlfile1, 'r') as f:
        doc1 = f.read().splitlines(1)
    with open(htmlfile2, 'r') as f:
        doc2 = f.read().splitlines(1)
     
    result = difflib.ndiff(doc1, doc2)
    with open('compare.html', 'wb') as f:
        for line in result:
            f.writelines(line)
  • Compare 2 html files:
    import difflib
    from os import path
    from pprint import pprint
    import sys, re 
     
    INPUT_DIR = 'opencart_47066'
    htmlfile1 = path.join(INPUT_DIR, 'index.html')
    htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
    with open(htmlfile1, 'r') as f:
        content = f.read()
        content = re.sub('[\s\t]+/>', '/>', content)
        content = re.sub('[\s\t]+>', '>', content)
        content = re.sub('>[\s\t]+<', '>\n<', content)
        content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
        doc1 = content.splitlines(1)
    with open(htmlfile2, 'r') as f:
        content = f.read()
        content = re.sub('[\s\t]+/>', '/>', content)
        content = re.sub('[\s\t]+>', '>', content)
        content = re.sub('>[\s\t]+<', '>\n<', content)
        content = re.sub('[\s\t]*\n[\s\t]*', '\n', content)
        doc2 = content.splitlines(1)
     
    result = difflib.ndiff(doc1, doc2)
    with open('compare.html', 'wb') as f:
        for line in result:
            f.writelines(line)

lxml.html.diff for comparing HTML files

xml.html.diff using 2 basic libraries:

  • difflib for comparing 2 files
  • etree for parsing HTML

Examples for lxml.html.diff:

  • Simple diff HTML:
    from os import path
    import sys, re
    from lxml.html import diff
    import codecs
     
    INPUT_DIR = 'opencart_47066'
    htmlfile1 = path.join(INPUT_DIR, 'index.html')
    htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html')
    with open(htmlfile1, 'r') as f:
        content = f.read()
        doc1 = content
    with open(htmlfile2, 'r') as f:
        content = f.read()
        doc2 = content
    diffcontent = diff.htmldiff(doc1, doc2)
    diffcontent = codecs.encode(diffcontent, 'utf-8')
    print diffcontent 

filecmp

python/compare.1406012835.txt.gz · Last modified: 2022/10/29 16:15 (external edit)