This is an old revision of the document!
Table of Contents
Python Compare
This module Difflib provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.
Difflib
Finding Matching String with SequenceMatcher
class difflib.SequenceMatcher The idea is to find the longest contiguous matching subsequence that contains no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The most basic functions:
- find_longest_match
- get_matching_blocks
The function get_opcodes using above these functions for parsing
match ratio
- Calculate match ratio of two strings:
import difflib a = ' abcd' b = 'abcd abcd' seq = difflib.SequenceMatcher(None, a, b) rate = seq.ratio() * 100 print rate
⇒ output:
71.4285714286
longest match ratio
- Find substring with longest match ratio:
- Syntax:
find_longest_match(alo, ahi, blo, bhi)
Find longest matching block in a[alo:ahi] and b[blo:bhi].(lo: low, hi: high). Returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k] with longest match ratio
- Simple Example
import difflib a = ' abcd' b = 'abcd abcd' seq = difflib.SequenceMatcher(None, a, b) rate = seq.ratio() * 100 print rate print seq.find_longest_match(0, 5, 0, 9) a = 'm abcd' seq.set_seq1(a) print seq.find_longest_match(0, 6, 0, 9)
⇒ output:
71.4285714286 Match(a=0, b=4, size=5) Match(a=1, b=4, size=5)
- Example find_longest_match with isjunk option:
import difflib a = ' abcd' b = 'abcd abcd' seq = difflib.SequenceMatcher(None, a, b) seq2 = difflib.SequenceMatcher(lambda x: x==" ", a, b) seq3 = difflib.SequenceMatcher(difflib.IS_LINE_JUNK, a, b) print seq.find_longest_match(0, 5, 0, 9) print seq2.find_longest_match(0, 5, 0, 9) print seq3.find_longest_match(0, 5, 0, 9)
output:
Match(a=0, b=4, size=5) Match(a=1, b=0, size=4) Match(a=1, b=0, size=4)
Get matching blocks
- Get matching blocks:
import difflib a = ' abcd' b = 'abcd abcd' seq = difflib.SequenceMatcher(None, a, b) rate = seq.ratio() * 100 print 'matching1:' for block in seq.get_matching_blocks(): print "a[%d] and b[%d] match for %d elements" % block a = 'abced abc' seq.set_seq1(a) print 'matching2:' for block in seq.get_matching_blocks(): print "a[%d] and b[%d] match for %d elements" % block
⇒ output:
matching1: a[0] and b[4] match for 5 elements a[5] and b[9] match for 0 elements matching2: a[0] and b[0] match for 3 elements a[4] and b[3] match for 5 elements a[9] and b[9] match for 0 elements
a[0] and b[4] match for 5 elements: 5 elements from a[0] are ' abcd' and 5 elements from b[9] are ' abcd'
Match string with multilines
import difflib import sys a = """ abcd abc pq ef abc mn """ b = """abcd abcd ef mn """ seq = difflib.SequenceMatcher(None, a, b) rate = seq.ratio() * 100 print '*************************' print 'rate1: ',rate print 'longest_match1: ', seq.find_longest_match(0, 20, 0, 9) print 'matching blocks1:' for block in seq.get_matching_blocks(): print "a[%d] and b[%d] match for %d elements" % block print '>>>>', a[block[0]:(block[0] + block[2])] print '<<<<', b[block[1]:(block[1] + block[2])] a = a.splitlines(1) b = b.splitlines(1) seq2 = difflib.SequenceMatcher(None, a, b) rate = seq.ratio() * 100 print '*************************' print 'matching blocks2:' for block in seq2.get_matching_blocks(): print "a[%d] and b[%d] match for %d elements" % block print '>>>>', a[block[0]:(block[0] + block[2])] print '<<<<', b[block[1]:(block[1] + block[2])] d = difflib.Differ() result = list(d.compare(a, b)) print 'normal diff:' sys.stdout.writelines(result)
output:
************************* rate1: 60.0 longest_match1: Match(a=0, b=4, size=5) matching blocks1: a[0] and b[4] match for 5 elements >>>> abcd <<<< abcd a[13] and b[9] match for 3 elements >>>> ef <<<< ef a[20] and b[12] match for 4 elements >>>> mn <<<< mn a[24] and b[16] match for 0 elements >>>> <<<< ************************* matching blocks2: a[3] and b[2] match for 1 elements >>>> ['mn\n'] <<<< ['mn\n'] a[4] and b[3] match for 0 elements >>>> [] <<<< [] normal diff: + abcd abcd + ef - abcd - abc pq - ef abc mn
SequenceMatcher with files
import difflib from os import path INPUT_DIR = 'opencart_47066' htmlfile1 = path.join(INPUT_DIR, 'index.html') htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html') with open(htmlfile1, 'r') as f: doc1 = f.read() with open(htmlfile2, 'r') as f: doc2 = f.read() seq = difflib.SequenceMatcher(None, doc1, doc2) rate = seq.ratio() * 100 print rate for block in seq.get_matching_blocks(): print "a[%d] and b[%d] match for %d elements" % block print '>>>>', doc1[block[0]:(block[0] + block[2])] print '<<<<', doc2[block[1]:(block[1] + block[2])]
Finding Diffing String with Differ Object
Differ object using APIs of SequenceMatcher for comparing:
- SequenceMatcher.get_opcodes
- And SequenceMatcher.get_grouped_opcodes
Understand some basic function:
- Differ.compare:
d = difflib.Differ() result = d.compare(a, b)
- ndiff default will ignore characters IS_CHARACTER_JUNK(' \t') when comparing :
def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK): return Differ(linejunk, charjunk).compare(a, b)
Simple Diff
import difflib from os import path from pprint import pprint import sys a = """ abcd abc pq ef abc mpq """.splitlines(1) b = """abcd abcd abc pq ef mpq """.splitlines(1) d = difflib.Differ() result = list(d.compare(a, b)) print 'normal diff:' sys.stdout.writelines(result) print 'diff with charjunk = difflib.IS_CHARACTER_JUNK:' result = difflib.ndiff(a, b) sys.stdout.writelines(result)
output:
- abcd + abcd abcd abc pq - ef abc + ef
diff 2 files
- compare 2 files normal:
import difflib from os import path from pprint import pprint import sys INPUT_DIR = 'opencart_47066' htmlfile1 = path.join(INPUT_DIR, 'index.html') htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html') with open(htmlfile1, 'r') as f: doc1 = f.read().splitlines(1) with open(htmlfile2, 'r') as f: doc2 = f.read().splitlines(1) d = difflib.Differ() result = d.compare(doc1, doc2) with open('compare.html', 'wb') as f: for line in result: f.writelines(line)
- compare ignore characters IS_CHARACTER_JUNK(' \t') when comparing
import difflib from os import path from pprint import pprint import sys INPUT_DIR = 'opencart_47066' htmlfile1 = path.join(INPUT_DIR, 'index.html') htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html') with open(htmlfile1, 'r') as f: doc1 = f.read().splitlines(1) with open(htmlfile2, 'r') as f: doc2 = f.read().splitlines(1) result = difflib.ndiff(doc1, doc2) with open('compare.html', 'wb') as f: for line in result: f.writelines(line)
- Compare 2 html files:
import difflib from os import path from pprint import pprint import sys, re INPUT_DIR = 'opencart_47066' htmlfile1 = path.join(INPUT_DIR, 'index.html') htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html') with open(htmlfile1, 'r') as f: content = f.read() content = re.sub('[\s\t]+/>', '/>', content) content = re.sub('[\s\t]+>', '>', content) content = re.sub('>[\s\t]+<', '>\n<', content) content = re.sub('[\s\t]*\n[\s\t]*', '\n', content) doc1 = content.splitlines(1) with open(htmlfile2, 'r') as f: content = f.read() content = re.sub('[\s\t]+/>', '/>', content) content = re.sub('[\s\t]+>', '>', content) content = re.sub('>[\s\t]+<', '>\n<', content) content = re.sub('[\s\t]*\n[\s\t]*', '\n', content) doc2 = content.splitlines(1) result = difflib.ndiff(doc1, doc2) with open('compare.html', 'wb') as f: for line in result: f.writelines(line)
lxml.html.diff for comparing HTML files
xml.html.diff using 2 basic libraries:
- difflib for comparing 2 files
- etree for parsing HTML
Examples for lxml.html.diff:
- Simple diff HTML:
from os import path import sys, re from lxml.html import diff import codecs INPUT_DIR = 'opencart_47066' htmlfile1 = path.join(INPUT_DIR, 'index.html') htmlfile2 = path.join(INPUT_DIR, 'index.php@route=account%2Flogin.html') with open(htmlfile1, 'r') as f: content = f.read() doc1 = content with open(htmlfile2, 'r') as f: content = f.read() doc2 = content diffcontent = diff.htmldiff(doc1, doc2) diffcontent = codecs.encode(diffcontent, 'utf-8') print diffcontent