Table of Contents

Twisted, lxml and re package

Twisted Package

Twisted is an event-driven networking engine written in Python and licensed under the open source:

Refer:

reactor: core of event loop

The event loop at the core of your program.

Som basic functions in core reactor twisted/internet/base.py:

event dispatching: Scheduling and Deferreds

Scheduling

Timeouts, repeated events, and more: when you want things to happen later.

Deferreds

Like callback functions, only a lot better.Twisted’s preferred mechanism for controlling the flow of asynchronous code. We would still need a way of saying “do this only when that has finished”.

Below are some basic function of defer API:

communication protocol

TCP servers , TCP clients , UDP networking and Using processes

using Threads

lxml package

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt:

two the most basic classes in lxml packages for parsing xml and html:

Parsing xml and html to Etree Object

refer: http://lxml.de/parsing.html
etree.parse return lxml.etree._ElementTree object

XPath for xml and html

XPath for python:

XPath Syntax

refer: http://www.w3schools.com/XPath/xpath_syntax.asp

XPath Expressions:

XPath function return:

Examples

lxml.html

HTMLParser

Create tree or root Element with lxml.html

refer: http://lxml.de/lxmlhtml.html

Build xml using Etree

Custom Functions

re Package(Regular Expression)

To use re package, we need to import it:

import re

Regular Expression Language

A regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern
refer:

Match Character

Anchors: cause a match to succeed or fail depending on the current position in the string

Grouping constructs: Grouping constructs delineate subexpressions of a regular expression and typically capture substrings of an input string

Quantifier: A quantifier specifies how many instances of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur

re Package

refer: http://www.pythonforbeginners.com/regex/regular-expressions-in-python

re Flags:

# flags
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments

re.findall

findall: The findall() is probably the single most powerful function in the re module

  1. Example 1:
    str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
     
    ## Here re.findall() returns a list of all the found email strings
    emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]']
     
    for email in emails:
        # do something with each found email string
        print email

    Understand pattern syntax above:

  1. Example 2:
    # Open file
    f = open('test.txt', 'r')
     
    # Feed the file text into findall(); it returns a list of all the found strings
    strings = re.findall(r'some pattern', f.read())

re.search, re.match

re.sub and re.compile

re.split