使用python的LXML进行数据解析-工具盒子

python语言我们可以通过lxml进行解析，所以想要在网页解析HTML和XML从而采集数据，通过lxml解析网站数据，爬虫采集的数据就容易很多。lxml是速度是非常的快。

使用lxml提取网页数据的流程

使用lxml只需要两步就能解析出网站的数据：

1、用lxml把网页进行解析出来。通过这个过程，我们一般选择lxml.html来完成
2、使用xpath解析，然后采集所需要的数据。

想要提取整个网站所需要的数据，这里的整个抓取关键是网络性能，而不是程序性能。所以不能用异步把程序性能提高了，如果采用异步提取的话，这样抓取的频率提高了，反而更容易被网站限制。

我们可以通过实例来解析一下HTML代码：

    #! -*- encoding:utf-8 -*-
import aiohttp, asyncio

targetUrl = &quot;http://httpbin.org/ip&quot;

# 代理服务器(产品官网 www.16yun.cn)
proxyHost = &quot;t.16yun.cn&quot;
proxyPort = &quot;31111&quot;

# 代理验证信息
proxyUser = &quot;username&quot;
proxyPass = &quot;password&quot;

proxyServer = &quot;http://%(user)s:%(pass)s@%(host)s:%(port)s&quot; % {
    &quot;host&quot; : proxyHost,
    &quot;port&quot; : proxyPort,
    &quot;user&quot; : proxyUser,
    &quot;pass&quot; : proxyPass,
}

userAgent = &quot;Chrome/83.0.4103.61&quot;

async def entry():
    conn = aiohttp.TCPConnector(verify_ssl=False)

    async with aiohttp.ClientSession(headers={&quot;User-Agent&quot;: userAgent}, connector=conn) as session:
        async with session.get(targetUrl, proxy=proxyServer) as resp:
            body = await resp.read()

            print(resp.status)
            print(body)

loop = asyncio.get_event_loop()
loop.run_until_complete(entry())
loop.run_forever()&lt;/code&gt;&lt;/pre&gt;

51工具盒子

使用python的LXML进行数据解析

厉飞雨

相关推荐

最新文章

猜你喜欢

快捷分类