associação pythonbrasil[11] django zope/plone planet Início Logado como (Entrar)

BeautifulSoup

parseando xml/xhtml

BeautifulSoup vamos ver como funciona:

licença, download, documentação, créditos etc... http://www.crummy.com/software/BeautifulSoup/

parseando xhtml

seja o arquivo teste_parser.html

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>teste do parser</title>
</head>
<body>
<table id="table1">
        <tr><td>linha1 celula1</td><td>linha1 celula2</td></tr>
        <tr><td>linha2 celula1</td><td>linha2 celula2</td></tr>
        <tr><td>linha3 celula1</td><td>linha3 celula2</td></tr>
</table>
<form action="f1" method="post">
<input type="text" name="texto1" size="10" maxlength="10" value=""/> 
<input type="text" name="texto2" size="10" maxlength="10" value=""/> 
<input type="text" name="texto3" size="10" maxlength="10" value=""/>
<select name="sel1" size="2">
<option value="1" label="1"></option>
<option value="2" label="2"></option>
<option value="3" label="3"></option>
</select> 
</form>
</body>
</html>

no idle

Python 2.4.1 (#2, May  5 2005, 11:32:06) 
[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
Type "copyright", "credits" or "license()" for more information.

    ****************************************************************
    Personal firewall software may warn about the connection IDLE
    makes to its subprocess using this computer's internal loopback
    interface.  This connection is not visible on any external
    interface and no data is sent to or received from the Internet.
    ****************************************************************
    
IDLE 1.1.1      
>>> from BeautifulSoup import BeautifulSoup
>>> arq=file('teste_parser.html')
>>> tree=BeautifulSoup(arq.read())
>>> tree('title')
[<title>teste do parser</title>]
>>> tree('title')[0]
<title>teste do parser</title>
>>> tree('title')[0].string
'teste do parser'
>>> len(tree('table')[0]('td'))
6
>>> it=len(tree('table')[0]('td'))
>>> it
6
>>> for i in range(it):
        print tree('table')[0]('td')[i].string

        
linha1 celula1
linha1 celula2
linha2 celula1
linha2 celula2
linha3 celula1
linha3 celula2
>>> #explorando atributos
>>> tree('table')[0]['id']
'table1'
>>> #explorando um <form>
>>> for i in range(len(tree('form'))):
        print tree('form')[i]

        
<form action="f1" method="post">
<input type="text" name="texto1" size="10" maxlength="10" value="" />
<input type="text" name="texto2" size="10" maxlength="10" value="" />
<input type="text" name="texto3" size="10" maxlength="10" value="" />
<select name="sel1" size="2">
<option value="1" label="1"></option>
<option value="2" label="2"></option>
<option value="3" label="3"></option>
</select>
</form>
>>>

>>> for i in range(len(tree('form'))):
        for j in range(len(tree('form')[i]('select'))):
            for k in range(len(tree('form')[i]('select')[j]('option'))):
                print tree('form')[i]('select')[j]('option')[k]['value']

                        
1
2
3
>>>#é claro que cabe uma 'refatoração' na loucura acima
>>>#alterando um atributo
>>> tree('form')[0]['method']
'post'
>>> tree('form')[0]['method']='get'
>>> tree('form')[0]['method']
'get'
>>>#inserindo um atributo
>>> tree('form')[0]['enctype']='multipart/form-data'
>>> print tree('form')[0]
<form action="f1" method="get" enctype="multipart/form-data">
<input type="text" name="texto1" size="10" maxlength="10" value="" />
<input type="text" name="texto2" size="10" maxlength="10" value="" />
<input type="text" name="texto3" size="10" maxlength="10" value="" />
<select name="sel1" size="2">
<option value="1" label="1"></option>
<option value="2" label="2"></option>
<option value="3" label="3"></option>
</select>
</form>
>>>#potência!!!! como diz o Bidú (meu cunhado)
>>>
>>>#Luiz Antonio de Campos
>>>#no módulo BeautifulSoup existem outras 3 subclasses - verifique na URL acima
>>>#tente também com xhtml malformado (tipo <td>xxxxxxxxxx</td<td>xxxx</td>) ele vai acertar se
>>>#a malformação não for muito 'porca'