英文:

Scrape data from AJAX webpage with python

问题 {#heading}

我遇到了这个问题 - 我需要从这个网页 - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb 中抓取动态表格的数据。

这个网页使用ajax来生成我想获取的表格。我已经检查了元素，似乎很简单，我有带参数的请求URL，我尝试发送请求，得到了200的响应代码，但响应内容为空。

我肯定是做错了些什么，但我不确定如何在Python中获取这些数据，尽管看起来似乎很简单，有人能帮我吗？

我想获取与网站上显示的相同的表格。英文:

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

答案1 {#1}

得分: 3

以下是已翻译的内容：

"Actually, this page turned out to be a pretty cool challenge!" -> "实际上，这个页面竟然是一个相当酷的挑战！"

"Breakdown:" -> "分解："

"- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out" -> "- 报告的链接位于源HTML中，但表格是通过JavaScript动态呈现的，但您可以轻松地提取它"

"- The safeargs_data value" -> "- safeargs_data的值"

"is just a silly way of obfuscating in hex this value" -> "只是以十六进制方式混淆此值的愚蠢方式"

"- I've decoded it for ease of readability and editing e.g. the data key" -> "- 为了便于阅读和编辑，我已对其进行了解码，例如data键"

"- Finally, I use the table_link, payload data, and updated headers to make a POST request." -> "- 最后，我使用table_link，payload数据和更新的headers发出POST请求。"

"- Then, it's easy to get the table out of the JSON and parse it with pandas" -> "- 然后，很容易从JSON中获取表格并使用pandas进行解析"

"By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report." -> "顺便说一句，如果您将URL中的十六进制值转换并将safeargs_data添加到其中，您仍然可以获得您的报告。"

"Here's a full, decoded URL" -> "这是一个完整的、解码后的URL"

"Here's my take on it:" -> "这是我的见解："

"import binascii" -> "导入 binascii"

"from urllib.parse import urlencode" -> "从 urllib.parse 导入 urlencode"

"import pandas as pd" -> "导入 pandas as pd"

"import requests" -> "导入 requests"

"from bs4 import BeautifulSoup" -> "从 bs4 导入 BeautifulSoup"

"from tabulate import tabulate" -> "从 tabulate 导入 tabulate"

"url = (..." -> "url = (..."

"headers = {..." -> "headers = {..."

"with requests.Session() as session:" -> "使用 requests.Session() 作为 session："

"table_link = (..." -> "table_link = (..."

"headers.update({..." -> "headers.update({..."

"payload_data = {" -> "payload_data = {"

"hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()" -> "hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()"

"table_data = session.post(" -> "table_data = session.post("

"df = pd.read_html(" -> "df = pd.read_html("

"df.to_csv(" -> "df.to_csv("

"print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" -> "print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" 英文:

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
The safeargs_data value

5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&amp;__callType=p&amp;data=2023-08-10&amp;_svgsupport=true&amp;resourceID=renderingURL&amp;__pageNumber=1&amp;__batchID=189df65a64d-1

I've decoded it for ease of readability and editing e.g. the data key
Finally, I use the table_link, payload data, and updated headers to make a POST request.
Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&amp;p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&amp;p_p_lifecycle=1&amp;p_p_state=normal&amp;p_p_mode=view&amp;p_p_col_id=column-2&amp;p_p_col_pos=1&amp;p_p_col_count=2&amp;_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&amp;__action=processEdit__report=PL_USE_RBB&amp;__callType=p&amp;data=2023-08-10&amp;_svgsupport=true&amp;resourceID=renderingURL&amp;__pageNumber=1&amp;__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = (
&quot;https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/&quot;
&quot;raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb&quot;
)
headers = {
&quot;User-Agent&quot;: &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) &quot;
&quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
&quot;Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200&quot;,
}
with requests.Session() as session:
table_link = (
BeautifulSoup(session.get(url, headers=headers).content, &quot;lxml&quot;)
.select_one(&quot;a[class=&#39;vui-generic-url&#39;]&quot;)
.get(&quot;href&quot;)
)
headers.update({&amp;quot;X-Requested-With&amp;quot;: &amp;quot;XMLHttpRequest&amp;quot;})

payload_data = {
    &amp;quot;__report&amp;quot;: &amp;quot;PL_USE_RBB&amp;quot;,
    &amp;quot;__callType&amp;quot;: &amp;quot;p&amp;quot;,
    &amp;quot;data&amp;quot;: &amp;quot;2023-08-10&amp;quot;,
    &amp;quot;_svgsupport&amp;quot;: &amp;quot;true&amp;quot;,
    &amp;quot;resourceID&amp;quot;: &amp;quot;renderingURL&amp;quot;,
    &amp;quot;__pageNumber&amp;quot;: &amp;quot;1&amp;quot;,
    &amp;quot;__batchID&amp;quot;: &amp;quot;189df65a64d-1&amp;quot;,
}

hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()

table_data = session.post(
    table_link,
    data={&amp;quot;safeargs_data&amp;quot;: hex_it},
    headers=headers,
)

df = pd.read_html(
    # .replace() is used to get rid of NBSPs
    table_data.json()[&amp;quot;reportContent&amp;quot;].replace(&amp;quot;\xa0&amp;quot;, &amp;quot;&amp;quot;),
    flavor=&amp;quot;lxml&amp;quot;,
    skiprows=[0],
)[1]
df.dropna(how=&amp;quot;all&amp;quot;, inplace=True)
df.to_csv(&amp;quot;PL_USE_RBB.csv&amp;quot;, index=False)

print(tabulate(df, headers=&amp;quot;keys&amp;quot;, tablefmt=&amp;quot;psql&amp;quot;, showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   (&#39;Numer iteracji&#39;, &#39;24&#39;) | (&#39;Początek&#39;, &#39;2023-08-10 15:15:34&#39;)   | (&#39;Koniec&#39;, &#39;2023-08-10 15:16:06&#39;)   |   (&#39;Początek&#39;, &#39;17&#39;) |   (&#39;Koniec&#39;, &#39;24&#39;) |   (&#39;[MWh]&#39;, &#39;47561,000&#39;) |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

51工具盒子

从AJAX网页中使用Python抓取数据。

问题 {#heading}

答案1 {#1}

厉飞雨

相关推荐

最新文章

猜你喜欢

快捷分类