test<body><h1>page title</h3>" parser = HTMLParser() tree = etree.parse(StringIO.StringIO(broken_html), parser) result = etree.tostring(tree.getroot(), pretty_print=True, method="html") print(result) </code> * HTML file<code python> from lxml import etree from lxml.html import HTMLParser parser = HTMLParser() tree = etree.parse('index.html', parser) </code> * Iterate Parsing HTML<code python> import StringIO from lxml import etree with open('index.html', 'r') as f: htmlcontent = f.read() context = etree.iterparse(StringIO.StringIO(htmlcontent), html = True) for action, elem in context: print("%s: %s" % (action, elem.tag)) </code> ==== XPath for xml and html ==== XPath for python: * https://docs.python.org/2/library/xml.etree.elementtree.html#elementtree-xpath * http://lxml.de/xpathxslt.html === XPath Syntax === refer: http://www.w3schools.com/XPath/xpath_syntax.asp XPath Expressions: * Selecting Nodes * Predicates: Predicates are used to **find a specific node** or a node that **contains a specific value** * Selecting Unknown Nodes * Selecting Several Paths XPath function return: * XPath() function return value are array of HTMLElement objects * If you add string option in XPath function **<nowiki>//</nowiki>text()**, XPath will return value are array of String Objects === Examples === * Simple Example for xml<code python> import StringIO from lxml import etree f = StringIO.StringIO('<foo><bar></bar></foo>') tree = etree.parse(f) r = tree.xpath('/foo/bar') print r[0].tag len(r) r = tree.xpath('bar') r[0].tag print r[0].tag </code>=>output:<code> bar bar </code> * xpath with html * Information of tag which you want parser:<code> <div title="buyer-name">Carson Busses</div> <span class="item-price">$29.95</span> </code> * code parser <code python> from lxml import html import requests page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(page.text) #This will create a list of buyers: buyers = tree.xpath('//div[@title="buyer-name"]/text()') #This will create a list of prices prices = tree.xpath('//span[@class="item-price"]/text()') print 'Buyers: ', buyers print 'Prices: ', prices </code>=>output:<code> Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'nt', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell'] Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', ' 0.09'] </code> ==== lxml.html ==== === HTMLParser === * parser element with basic HTMLParser<code python> from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('''<html><head><title>Test

====== Twisted, lxml and re package ====== ===== Twisted Package ===== Twisted is an **event-driven networking engine** written in Python and licensed under the open source: * As a platform, **Twisted should be focused on integration**. Ideally, all functionality will be accessible through all protocols. Failing that, all functionality should be configurable through at least one protocol, with a seamless and consistent user-interface. * The next phase of development will be focusing strongly on a c**onfiguration system which will unify many disparate pieces of the current infrastructure**, and **allow them to be tacked together by a non-programmer**. Refer: * http://twistedmatrix.com/documents/current/core/ * http://krondo.com/wp-content/uploads/2009/08/twisted-intro.html ==== reactor: core of event loop ==== The event loop at the core of your program. * reactor basics: http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html * reactor core: https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.interfaces.IReactorCore.html Som basic functions in core reactor **twisted/internet/base.py**: * class reactor init


class ReactorBase(object):
    def __init__(self):
        self.threadCallQueue = []
        self._eventTriggers = {}
        self._pendingTimedCalls = []
        self._newTimedCalls = []
        self._cancellations = 0
        self.running = False
        self._started = False
        self._justStopped = False
        self._startedBefore = False
        # reactor internal readers, e.g. the waker.
        self._internalReaders = set()
        self.waker = None

        # Arrange for the running attribute to change to True at the right time
        # and let a subclass possibly do other things at that time (eg install
        # signal handlers).
        self.addSystemEventTrigger(
            'during', 'startup', self._reallyStartRunning)
        self.addSystemEventTrigger('during', 'shutdown', self.crash)
        self.addSystemEventTrigger('during', 'shutdown', self.disconnectAll)

        if platform.supportsThreads():
            self._initThreads()
        self.installWaker()

* class reactor run: **Fire 'startup' System Events**, move the reactor **to the 'running' state**, then run the main loop **until it is stopped with stop() or crash()**


@implementer(IReactorCore, IReactorTime, IReactorPluggableResolver)
class ReactorBase(object):
    def fireSystemEvent(self, eventType):
        """See twisted.internet.interfaces.IReactorCore.fireSystemEvent.
        """
        event = self._eventTriggers.get(eventType)
        if event is not None:
            event.fireEvent()                
    def startRunning(self):
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
            raise error.ReactorNotRestartable()
        self._started = True
        self._stopped = False
        if self._registerAsIOThread:
            threadable.registerAsIOThread()
        self.fireSystemEvent('startup')

class _SignalReactorMixin(object):        
    def run(self, installSignalHandlers=True):
        self.startRunning(installSignalHandlers=installSignalHandlers)
        self.mainLoop()

    def mainLoop(self):
        while self._started:
            try:
                while self._started:
                    # Advance simulation time in delayed event
                    # processors.
                    self.runUntilCurrent()
                    t2 = self.timeout()
                    t = self.running and t2
                    self.doIteration(t)
            except:
                log.msg("Unexpected error in main loop.")
                log.err()
            else:
                log.msg('Main loop terminated.')    
    def runUntilCurrent(self):
        """Run all pending timed calls.
        """
        if self.threadCallQueue:
            # Keep track of how many calls we actually make, as we're
            # making them, in case another call is added to the queue
            # while we're in this loop.
            count = 0
            total = len(self.threadCallQueue)
            for (f, a, kw) in self.threadCallQueue:
                try:
                    f(*a, **kw)
                except:
                    log.err()
                count += 1
                if count == total:
                    break
            del self.threadCallQueue[:count]
            if self.threadCallQueue:
                self.wakeUp()

        # insert new delayed calls now
        self._insertNewDelayedCalls()

        now = self.seconds()
        while self._pendingTimedCalls and (self._pendingTimedCalls[0].time <= now):
            call = heappop(self._pendingTimedCalls)
            if call.cancelled:
                self._cancellations-=1
                continue

            if call.delayed_time > 0:
                call.activate_delay()
                heappush(self._pendingTimedCalls, call)
                continue

            try:
                call.called = 1
                call.func(*call.args, **call.kw)
            except:
                log.deferr()
                if hasattr(call, "creator"):
                    e = "\n"
                    e += " C: previous exception occurred in " + \
                         "a DelayedCall created here:\n"
                    e += " C:"
                    e += "".join(call.creator).rstrip().replace("\n","\n C:")
                    e += "\n"
                    log.msg(e)


        if (self._cancellations > 50 and
             self._cancellations > len(self._pendingTimedCalls) >> 1):
            self._cancellations = 0
            self._pendingTimedCalls = [x for x in self._pendingTimedCalls
                                       if not x.cancelled]
            heapify(self._pendingTimedCalls)

        if self._justStopped:
            self._justStopped = False
            self.fireSystemEvent("shutdown")

* class reactor callLater


    def callLater(self, _seconds, _f, *args, **kw):
        """See twisted.internet.interfaces.IReactorTime.callLater.
        """
        assert callable(_f), "%s is not callable" % _f
        assert _seconds >= 0, \
               "%s is not greater than or equal to 0 seconds" % (_seconds,)
        tple = DelayedCall(self.seconds() + _seconds, _f, args, kw,
                           self._cancelCallLater,
                           self._moveCallLaterSooner,
                           seconds=self.seconds)
        self._newTimedCalls.append(tple)
        return tple

==== event dispatching: Scheduling and Deferreds ==== === Scheduling === Timeouts, repeated events, and more: when you want things to happen later.\\ * Sheduling API: * https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.task.html * https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.interfaces.IReactorTime.html * example codes: http://twistedmatrix.com/documents/current/core/howto/time.html === Deferreds === Like **callback functions**, only a lot better.Twisted’s preferred mechanism for** controlling the flow of asynchronous code**. We would still need a way of saying **“do this only when that has finished”**.\\ * introduce defer: http://twistedmatrix.com/documents/current/core/howto/defer-intro.html * defer API: * https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.defer.html * https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.defer.Deferred.html * example codes: http://twistedmatrix.com/documents/current/core/howto/defer.html Below are some basic function of defer API: * defer.py: **maybeDeferred**


def succeed(result):
    d = Deferred()
    d.callback(result)
    return d
def fail(result=None):
    d = Deferred()
    d.errback(result)
    return d
def maybeDeferred(f, *args, **kw):
    try:
        result = f(*args, **kw)
    except:
        return fail(failure.Failure(captureVars=Deferred.debug))

    if isinstance(result, Deferred):
        return result
    elif isinstance(result, failure.Failure):
        return fail(result)
    else:
        return succeed(result)

* class Defered:**addCallback,callback**


    def __init__(self, canceller=None):
        self.callbacks = []
        self._canceller = canceller
        if self.debug:
            self._debugInfo = DebugInfo()
            self._debugInfo.creator = traceback.format_stack()[:-1]

    def addCallbacks(self, callback, errback=None,
                     callbackArgs=None, callbackKeywords=None,
                     errbackArgs=None, errbackKeywords=None):
        assert callable(callback)
        assert errback == None or callable(errback)
        cbs = ((callback, callbackArgs, callbackKeywords),
               (errback or (passthru), errbackArgs, errbackKeywords))
        self.callbacks.append(cbs)

        if self.called:
            self._runCallbacks()
        return self
    def addCallback(self, callback, *args, **kw):
        return self.addCallbacks(callback, callbackArgs=args,
                                 callbackKeywords=kw)
    def addErrback(self, errback, *args, **kw):
        return self.addCallbacks(passthru, errback,
                                 errbackArgs=args,
                                 errbackKeywords=kw)
    def callback(self, result):
        assert not isinstance(result, Deferred)
        self._startRunCallbacks(result)

* class Defered: **inlineCallbacks(f)**: helps you write Deferred-using code that looks like a regular sequential function. ==== communication protocol ==== TCP servers , TCP clients , UDP networking and Using processes ==== using Threads ==== * Threads API: https://twistedmatrix.com/documents/14.0.0/api/twisted.internet.interfaces.IReactorThreads.html * example codes: http://twistedmatrix.com/documents/current/core/howto/threading.html ===== lxml package ===== The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt: * http://lxml.de/index.html * http://lxml.de/api/index.html * https://docs.python.org/2/library/xml.etree.elementtree.html two the most basic classes in lxml packages for parsing xml and html: * ElementTree:https://docs.python.org/2/library/xml.etree.elementtree.html#elementtree-objects * HTMLElement:https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects ==== Parsing xml and html to Etree Object ==== refer: http://lxml.de/parsing.html\\ etree.parse return **lxml.etree._ElementTree** object * Parsing XML * XML String


import StringIO
from lxml import etree
f = StringIO.StringIO('')
tree = etree.parse(f)

* XML file


from lxml import etree

tree = etree.parse("doc/test.xml")

* Iterate Parsing xml


import StringIO
from lxml import etree
f = StringIO.StringIO('')    
context = etree.iterparse(f)
for action, elem in context:
    print("%s: %s" % (action, elem.tag))


end: bar
end: foo

* Parsing HTML * HTML String


import StringIO
from lxml import etree
from lxml.html import HTMLParser

broken_html = "test<body><h1>page title</h3>"
parser = HTMLParser()
tree   = etree.parse(StringIO.StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
</code>
    * HTML file<code python>
from lxml import etree
from lxml.html import HTMLParser

parser = HTMLParser()
tree   = etree.parse('index.html', parser)
</code>
    * Iterate Parsing HTML<code python>
import StringIO
from lxml import etree

with open('index.html', 'r') as f:
    htmlcontent = f.read()
    context = etree.iterparse(StringIO.StringIO(htmlcontent), html = True)
    for action, elem in context:
        print("%s: %s" % (action, elem.tag))
</code>
==== XPath for xml and html ====
XPath for python: 
  * https://docs.python.org/2/library/xml.etree.elementtree.html#elementtree-xpath
  * http://lxml.de/xpathxslt.html
=== XPath Syntax ===
refer: http://www.w3schools.com/XPath/xpath_syntax.asp

XPath Expressions:
  * Selecting Nodes
  * Predicates: Predicates are used to **find a specific node** or a node that **contains a specific value**
  * Selecting Unknown Nodes
  * Selecting Several Paths
XPath function return:
  * XPath() function return value are array of HTMLElement objects
  * If you add string option in XPath function **<nowiki>//</nowiki>text()**, XPath will return value are array of String Objects
=== Examples ===
  * Simple Example for xml<code python>
import StringIO
from lxml import etree
f = StringIO.StringIO('<foo><bar></bar></foo>')
tree = etree.parse(f)
r = tree.xpath('/foo/bar')
print r[0].tag
len(r)
r = tree.xpath('bar')
r[0].tag
print r[0].tag
</code>=>output:<code>
bar
bar
</code>
  * xpath with html
      * Information of tag which you want parser:<code>
<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span>
</code>
      * code parser <code python>
from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.text)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print 'Buyers: ', buyers
print 'Prices: ', prices
</code>=>output:<code>
Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'nt', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '
0.09']
</code>
==== lxml.html ====
=== HTMLParser ===
  * parser element with basic HTMLParser<code python>
from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('''<html><head><title>Test
            Parse me!''')

output:


Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data  :

Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

=== Create tree or root Element with lxml.html === refer: http://lxml.de/lxmlhtml.html * parse html from request:


from lxml import html, etree
from lxml.html import HTMLParser
import StringIO
import requests
 
tree = html.parse('http://www.google.com')
tree.write('index.html')


from lxml import html
import requests
 
page = requests.get('http://www.google.com')
tree = html.fromstring(page.text)
r = tree.xpath('//title');
print r[0].text

* parse html from local file


from lxml import html as HTML

tree = HTML.parse('index.html')
r = tree.xpath('//title');
print r[0].tag
print r[0].text

* parser html with HTMLParser:


import StringIO
from lxml import etree
from lxml.html import HTMLParser
 
broken_html = "test<body><h1>page title</h3>"
parser = HTMLParser()
tree   = etree.parse(StringIO.StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
</code>
==== Build xml using Etree ====
  * Build xml using xml.etree.ElementTree:<code python>
from xml.etree import ElementTree as ET
'''
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
    </country>
</data>
'''
data = ET.Element('data')

country1 = ET.SubElement(data, 'country', {'name':'Liechtenstein'})
rank1 = ET.SubElement(country1, 'rank')
rank1.text = '1'
year1 = ET.SubElement(country1, 'year')
year1.text = '2008'

country2 = ET.SubElement(data, 'country', {'name':'Singapore'})
rank2 = ET.SubElement(country2, 'rank')
rank2.text = '4'
year2 = ET.SubElement(country2, 'year')
year2.text = '2011'
print ET.tostring(data)
</code> output:<code>
<data><country name="Liechtenstein"><rank>1</rank><year>2008</year></country><country name="Singapore"><rank>4</rank><year>2011</year></country></data>
</code>
  * Build xml using lxml.etree:<code python>
from lxml import etree as ET
'''
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
    </country>
</data>
'''
data = ET.Element('data')

country1 = ET.SubElement(data, 'country', {'name':'Liechtenstein'})
rank1 = ET.SubElement(country1, 'rank')
rank1.text = '1'
year1 = ET.SubElement(country1, 'year')
year1.text = '2008'

country2 = ET.SubElement(data, 'country', {'name':'Singapore'})
rank2 = ET.SubElement(country2, 'rank')
rank2.text = '4'
year2 = ET.SubElement(country2, 'year')
year2.text = '2011'
print ET.tostring(data)
</code> output: <code>
<data><country name="Liechtenstein"><rank>1</rank><year>2008</year></country><country name="Singapore"><rank>4</rank><year>2011</year></country></data>
</code>
==== Custom Functions ====
  * Load html from local file<code python>
from lxml.html import HtmlElement
from lxml import etree
from lxml import html as HTML
 
tree = HTML.parse('index.html')
r = tree.xpath('//div[@id="content"]');
print(etree.tostring(r[0], pretty_print=True, encoding='utf-8'))
</code>

  * Print HTMLElement<code python>
from lxml.html import HtmlElement
from lxml import etree
from lxml import html as HTML

tree = HTML.parse('index.html')
r = tree.xpath('//div[@id="content"]');
print(etree.tostring(r[0], pretty_print=True, method="html"))
</code>
  * Write html from tree<code python>
from lxml import html, etree
from lxml.html import HTMLParser
import StringIO
import requests
 
page = requests.get('http://shop.babies.vn')
parser = HTMLParser()
tree   = etree.parse(StringIO.StringIO(page.text), parser)
tree.write('index.html', method = 'html')
</code> Or <code python>
from lxml import html, etree
from lxml.html import HTMLParser
import StringIO
import requests
 
tree = html.parse('http://shop.babies.vn')
tree.write('index.html', method = 'html')
</code>
===== re Package(Regular Expression) =====
To use re package, we need to import it:<code python>
import re
</code>
==== Regular Expression Language ====
A regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern\\
refer: 
  * http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
  * python: https://docs.python.org/2/library/re.html#regular-expression-syntax

**Match Character**
  * Character Escapes: The backslash character (\) in a regular expression indicates that the character that follows it either is **a special character** (as shown in the following table), or should be interpreted literally
      * **\t** Match a tab: pattern **(\w+)\t** match **"item1\t", "item2\t"** in **"item1\titem2\t"**
      * **\n** Match a new line 
      * **\d** Matches any decimal digit
      * **\D** Matches any character other than a decimal digit.
  * Character Classes: A character class **matches any one of a set of characters**.
      * **[character_group]** Matches any single character in character_group:pattern **[ae]** match **"a"** in **"gray"** and **"a", "e"** in **"lane"**
      * **[^character_group]** Negation: Matches any single character that is not in character_group
      * **[first-last]** Character range: Matches any single character in the range from first to last.
      * **\w** Matches any word character
      * **\W** Matches any non-word character.
      * \s Matches any white-space character: Pattern \w\s match **"D "** in **"ID A1.3"**
**Anchors**: cause a match to succeed or fail depending on the **current position in the string**
  * ^ The match must start at the beginning of the string or line: pattern **^\d{3}** match **"901"** in **"901-333-"**
  * $ The match must occur at the end of the string or before \n at the end of the line or string: pattern -\d{3}$ match "-333" in "-901-333"
  * \A The match must occur at the start of the string: pattern **\A\d{3}** match **"901"** in **"901-333-"**
  * \Z The match must occur at the end of the string or before \n at the end of the string: pattern **-\d{3}\Z** match **"-333"** in **"-901-333"**
**Grouping constructs**: Grouping constructs delineate **subexpressions of a regular expression** and typically capture **substrings of an input string**
  * **(subexpression)** Captures the matched subexpression and assigns it a one-based ordinal number: pattern **(\w)\1** match **"ee"** in **"deep"**
**Quantifier**: A quantifier specifies **how many instances** of the previous element (which can be a character, a group, or a character class) must be present in the input string for a match to occur
  * <nowiki>*</nowiki> Matches the previous element zero or more times: pattern **\d*\.\d** match **".0", "19.9", "219.9"**
  * + Matches the previous element one or more times: pattern "be+" match "bee" in "been", "be" in "bent"
  * ? Matches the previous element zero or one time: pattern "rai?n" match "ran", "rain"
  * {n} Matches the previous element exactly n times:pattern **",\d{3}"** match **",043"** in **"1,043.6", ",876", ",543", and ",210" in "9,876,543,210"**
  * {n ,} Matches the previous element at least n times: pattern **"\d{2,}"** match **"166", "29", "1930"**
  * {n,m} Matches the previous element at least n times, but no more than m times: pattern **"\d{3,5}"** match **"166", "17668" "19302"** in **"193024"**
==== re Package ====
refer: http://www.pythonforbeginners.com/regex/regular-expressions-in-python

re Flags:<code python>
# flags
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments
</code>
=== re.findall ===
findall: The findall() is probably the single most powerful function in the re module
  - Example 1: <code python>
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  
for email in emails:
    # do something with each found email string
    print email
</code>Understand pattern syntax above:
  * [\w\.-]+ => Begin with one or multiple(sign: +) in group(sign: []): word(sign: \w) or character **.**(sign: \.) or character **-**
  * @[\w\.-]+ => next of it is character @ and one or multiple characters in group: [word, **.** , **-**]
  - Example 2: <code python>
# Open file
f = open('test.txt', 'r')

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
</code>
=== re.search, re.match ===
  * re.search: The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string<code python>
str = 'an example word:cat!!'
match = re.search(r'word:www', str)
# If-statement after search() tests if it succeeded
  if match:                      
    print 'found', match.group() ## 'found word:cat'
  else:
    print 'did not find'
</code>As you can see in the example below, I have used the | operator, which search for either pattern I specify.<code python>
import re
programming = ["Python", "Perl", "PHP", "C++"]
pat = "^B|^P|i$|H$"
for lang in programming:
    if re.search(pat,lang,re.IGNORECASE):
        print lang , "FOUND"
    else:
        print lang, "NOT FOUND"
</code>The output of above script will be:<code>
Python FOUND
Perl FOUND
PHP FOUND
C++ NOT FOUND
</code>
  * re.search and re.match<code python>
import re

text = "The Attila the Hun Show"

# a single character
m = re.match(".", text)
if m: print repr("."), "=>", repr(m.group(0))

# any string of characters
m = re.match(".*", text)
if m: print repr(".*"), "=>", repr(m.group(0))

# a string of letters (at least one)
m = re.match("\w+", text)
if m: print repr("\w+"), "=>", repr(m.group(0))

# a string of digits
m = re.match("\d+", text)
if m: print repr("\d+"), "=>", repr(m.group(0))

</code>output:<code>
'.' => 'T'
'.*' => 'The Attila the Hun Show'
'\\w+' => 'The'
</code>
  * re.search and re.match:<code python>
import re

print '**********************************'
text ="10/15/99"

print "match1:"
m = re.match("(\d{2})/(\d{2})/(\d{2,4})", text)
if m:
    print m.group(1, 2, 3)

print "search1:"
s = re.search("(\d{2})/(\d{2})/(\d{2,4})", text)
if s:
    print s.group(1, 2, 3)    

print '**********************************'
text ="hello 10/15/99"
print "match2:"
m = re.match("(\d{2})/(\d{2})/(\d{2,4})", text)
if m:
    print m.group(1, 2, 3)

print "search2:"
s = re.search("(\d{2})/(\d{2})/(\d{2,4})", text)
if s:
    print s.group(1, 2, 3)    

</code>output:<code>
**********************************
match1:
('10', '15', '99')
search1:
('10', '15', '99')
**********************************
match2:
search2:
('10', '15', '99')
</code>
=== re.sub and re.compile ===
  * re.sub: The re.sub() function in the re module can be used to replace substrings
      * first example<code python>
import re
text = "Python for beginner is a very cool website"
text2 = re.sub("cool", "good", text)
print text2
</code>output<code>
Python for beginner is a very good website
</code>
      * Here is another example (taken from Googles Python class ) which searches for all the email addresses, and changes them to keep the user (1) but have yo-yo-dyne.com as the host.<code python>
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher' 

## re.sub(pat, replacement, str) -- returns new string with all replacements,

## 1 is group(1), 2 group(2) in the replacement

print re.sub(r'([\w.-]+)@([\w.-]+)', r'1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
</code>output:<code>
purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
</code>
  * re.compile: With the re.compile() function we can compile pattern into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. 
    * The first example checks if the input from the user contains only letters, spaces or . (no digits) Any other character is not allowed.<code python>
import re

name_check = re.compile(r"[^A-Za-zs.]")
name = raw_input ("Please, enter your name: ")
while name_check.search(name):
    print "Please enter your name correctly!"
    name = raw_input ("Please, enter your name: ")
</code>
    * The second example checks if the input from the user contains only numbers, parentheses, spaces or hyphen (no letters) Any other character is not allowed<code python>
import re

phone_check = re.compile(r"[^0-9s-()]")
phone = raw_input ("Please, enter your phone: ")
while phone_check.search(phone):
    print "Please enter your phone correctly!"
    phone = raw_input ("Please, enter your phone: ")
</code> The output of above script will be:<code>
Please, enter your phone: s
Please enter your phone correctly!
</code> It will continue to ask until you put in numbers only.
=== re.split ===