parseando xml/xhtml
BeautifulSoup vamos ver como funciona:
licença, download, documentação, créditos etc... http://www.crummy.com/software/BeautifulSoup/
parseando xhtml
seja o arquivo teste_parser.html
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>teste do parser</title> </head> <body> <table id="table1"> <tr><td>linha1 celula1</td><td>linha1 celula2</td></tr> <tr><td>linha2 celula1</td><td>linha2 celula2</td></tr> <tr><td>linha3 celula1</td><td>linha3 celula2</td></tr> </table> <form action="f1" method="post"> <input type="text" name="texto1" size="10" maxlength="10" value=""/> <input type="text" name="texto2" size="10" maxlength="10" value=""/> <input type="text" name="texto3" size="10" maxlength="10" value=""/> <select name="sel1" size="2"> <option value="1" label="1"></option> <option value="2" label="2"></option> <option value="3" label="3"></option> </select> </form> </body> </html>
no idle
Python 2.4.1 (#2, May 5 2005, 11:32:06) [GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2 Type "copyright", "credits" or "license()" for more information. **************************************************************** Personal firewall software may warn about the connection IDLE makes to its subprocess using this computer's internal loopback interface. This connection is not visible on any external interface and no data is sent to or received from the Internet. **************************************************************** IDLE 1.1.1 >>> from BeautifulSoup import BeautifulSoup >>> arq=file('teste_parser.html') >>> tree=BeautifulSoup(arq.read()) >>> tree('title') [<title>teste do parser</title>] >>> tree('title')[0] <title>teste do parser</title> >>> tree('title')[0].string 'teste do parser' >>> len(tree('table')[0]('td')) 6 >>> it=len(tree('table')[0]('td')) >>> it 6 >>> for i in range(it): print tree('table')[0]('td')[i].string linha1 celula1 linha1 celula2 linha2 celula1 linha2 celula2 linha3 celula1 linha3 celula2 >>> #explorando atributos >>> tree('table')[0]['id'] 'table1' >>> #explorando um <form> >>> for i in range(len(tree('form'))): print tree('form')[i] <form action="f1" method="post"> <input type="text" name="texto1" size="10" maxlength="10" value="" /> <input type="text" name="texto2" size="10" maxlength="10" value="" /> <input type="text" name="texto3" size="10" maxlength="10" value="" /> <select name="sel1" size="2"> <option value="1" label="1"></option> <option value="2" label="2"></option> <option value="3" label="3"></option> </select> </form> >>> >>> for i in range(len(tree('form'))): for j in range(len(tree('form')[i]('select'))): for k in range(len(tree('form')[i]('select')[j]('option'))): print tree('form')[i]('select')[j]('option')[k]['value'] 1 2 3 >>>#é claro que cabe uma 'refatoração' na loucura acima >>>#alterando um atributo >>> tree('form')[0]['method'] 'post' >>> tree('form')[0]['method']='get' >>> tree('form')[0]['method'] 'get' >>>#inserindo um atributo >>> tree('form')[0]['enctype']='multipart/form-data' >>> print tree('form')[0] <form action="f1" method="get" enctype="multipart/form-data"> <input type="text" name="texto1" size="10" maxlength="10" value="" /> <input type="text" name="texto2" size="10" maxlength="10" value="" /> <input type="text" name="texto3" size="10" maxlength="10" value="" /> <select name="sel1" size="2"> <option value="1" label="1"></option> <option value="2" label="2"></option> <option value="3" label="3"></option> </select> </form> >>>#potência!!!! como diz o Bidú (meu cunhado) >>> >>>#Luiz Antonio de Campos >>>#no módulo BeautifulSoup existem outras 3 subclasses - verifique na URL acima >>>#tente também com xhtml malformado (tipo <td>xxxxxxxxxx</td<td>xxxx</td>) ele vai acertar se >>>#a malformação não for muito 'porca'