英文:
Scrape data from AJAX webpage with python
问题 {#heading}
我遇到了这个问题 - 我需要从这个网页 - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb 中抓取动态表格的数据。
这个网页使用ajax来生成我想获取的表格。我已经检查了元素,似乎很简单,我有带参数的请求URL,我尝试发送请求,得到了200的响应代码,但响应内容为空。
我肯定是做错了些什么,但我不确定如何在Python中获取这些数据,尽管看起来似乎很简单,有人能帮我吗?
我想获取与网站上显示的相同的表格。 英文:
im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb
This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.
I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?
I want to get the same table as the one that is displayed on the website.
答案1 {#1}
得分: 3
以下是已翻译的内容:
"Actually, this page turned out to be a pretty cool challenge!" -> "实际上,这个页面竟然是一个相当酷的挑战!"
"Breakdown:" -> "分解:"
"- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out" -> "- 报告的链接位于源HTML中,但表格是通过JavaScript动态呈现的,但您可以轻松地提取它"
"- The safeargs_data
value" -> "- safeargs_data
的值"
"is just a silly way of obfuscating in hex this value" -> "只是以十六进制方式混淆此值的愚蠢方式"
"- I've decoded it for ease of readability and editing e.g. the data
key" -> "- 为了便于阅读和编辑,我已对其进行了解码,例如data
键"
"- Finally, I use the table_link
, payload
data, and updated headers
to make a POST
request." -> "- 最后,我使用table_link
,payload
数据和更新的headers
发出POST
请求。"
"- Then, it's easy to get the table out of the JSON
and parse it with pandas
" -> "- 然后,很容易从JSON
中获取表格并使用pandas
进行解析"
"By the way, if you convert the hex value from the URL and add the safeargs_data
to it, you'll still get your report." -> "顺便说一句,如果您将URL中的十六进制值转换并将safeargs_data
添加到其中,您仍然可以获得您的报告。"
"Here's a full, decoded URL" -> "这是一个完整的、解码后的URL"
"Here's my take on it:" -> "这是我的见解:"
"import binascii" -> "导入 binascii"
"from urllib.parse import urlencode" -> "从 urllib.parse 导入 urlencode"
"import pandas as pd" -> "导入 pandas as pd"
"import requests" -> "导入 requests"
"from bs4 import BeautifulSoup" -> "从 bs4 导入 BeautifulSoup"
"from tabulate import tabulate" -> "从 tabulate 导入 tabulate"
"url = (..." -> "url = (..."
"headers = {..." -> "headers = {..."
"with requests.Session() as session:" -> "使用 requests.Session() 作为 session:"
"table_link = (..." -> "table_link = (..."
"headers.update({..." -> "headers.update({..."
"payload_data = {" -> "payload_data = {"
"hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()" -> "hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()"
"table_data = session.post(" -> "table_data = session.post("
"df = pd.read_html(" -> "df = pd.read_html("
"df.to_csv(" -> "df.to_csv("
"print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" -> "print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" 英文:
Actually, this page turned out to be a pretty cool challenge!
Breakdown:
-
The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
-
The
safeargs_data
value5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31
is just a silly way of obfuscating in hex this value
__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
- I've decoded it for ease of readability and editing e.g. the
data
key - Finally, I use the
table_link
,payload
data, and updatedheaders
to make aPOST
request. - Then, it's easy to get the table out of the
JSON
and parse it withpandas
By the way, if you convert the hex value from the URL and add the safeargs_data
to it, you'll still get your report.
Here's a full, decoded URL:
https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
Here's my take on it:
import binascii
from urllib.parse import urlencode
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = (
"https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
"raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}
with requests.Session() as session:
table_link = (
BeautifulSoup(session.get(url, headers=headers).content, "lxml")
.select_one("a[class='vui-generic-url']")
.get("href")
)
headers.update({"X-Requested-With": "XMLHttpRequest"})
payload_data = {
"__report": "PL_USE_RBB",
"__callType": "p",
"data": "2023-08-10",
"_svgsupport": "true",
"resourceID": "renderingURL",
"__pageNumber": "1",
"__batchID": "189df65a64d-1",
}
hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()
table_data = session.post(
table_link,
data={"safeargs_data": hex_it},
headers=headers,
)
df = pd.read_html(
# .replace() is used to get rid of NBSPs
table_data.json()["reportContent"].replace("\xa0", ""),
flavor="lxml",
skiprows=[0],
)[1]
df.dropna(how="all", inplace=True)
df.to_csv("PL_USE_RBB.csv", index=False)
print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))
This should save a .csv
file PL_USE_RBB.csv
and then print this:
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
| ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34') | ('Koniec', '2023-08-10 15:16:06') | ('Początek', '17') | ('Koniec', '24') | ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
| 23 | 2023-08-10 14:15:42 | 2023-08-10 14:16:16 | 16 | 24 | 4.88788e+07 |
| 22 | 2023-08-10 13:15:39 | 2023-08-10 13:16:24 | 15 | 24 | 4.50884e+07 |
| 21 | 2023-08-10 12:15:36 | 2023-08-10 12:16:10 | 14 | 24 | 4.09294e+07 |
| 20 | 2023-08-10 11:15:33 | 2023-08-10 11:16:15 | 13 | 24 | 3.12136e+07 |
| 19 | 2023-08-10 10:15:41 | 2023-08-10 10:16:07 | 12 | 24 | 2.55946e+07 |
| 18 | 2023-08-10 09:15:40 | 2023-08-10 09:16:05 | 11 | 24 | 2.26086e+07 |
| 17 | 2023-08-10 08:15:40 | 2023-08-10 08:16:00 | 10 | 24 | 1.58324e+07 |
| 16 | 2023-08-10 07:15:35 | 2023-08-10 07:15:56 | 9 | 24 | 1.11414e+07 |
| 15 | 2023-08-10 06:15:33 | 2023-08-10 06:15:52 | 8 | 24 | 1.11796e+07 |
| 14 | 2023-08-10 05:15:32 | 2023-08-10 05:15:52 | 7 | 24 | 9.639e+06 |
| 13 | 2023-08-10 04:15:41 | 2023-08-10 04:16:11 | 6 | 24 | 9.0502e+06 |
| 12 | 2023-08-10 03:15:36 | 2023-08-10 03:15:55 | 5 | 24 | 7.871e+06 |
| 11 | 2023-08-10 02:15:35 | 2023-08-10 02:16:03 | 4 | 24 | 8.395e+06 |
| 10 | 2023-08-10 01:15:41 | 2023-08-10 01:16:04 | 3 | 24 | 7.8954e+06 |
| 9 | 2023-08-10 00:15:37 | 2023-08-10 00:15:55 | 2 | 24 | 8.2582e+06 |
| 8 | 2023-08-09 23:15:03 | 2023-08-09 23:15:24 | 1 | 24 | 6.6784e+06 |
| 7 | 2023-08-09 22:15:08 | 2023-08-09 22:15:16 | 1 | 24 | 603200 |
| 6 | 2023-08-09 21:15:12 | 2023-08-09 21:15:22 | 1 | 24 | 0 |
| 5 | 2023-08-09 20:15:06 | 2023-08-09 20:15:12 | 1 | 24 | 0 |
| 4 | 2023-08-09 19:15:04 | 2023-08-09 19:15:14 | 1 | 24 | 0 |
| 3 | 2023-08-09 18:15:11 | 2023-08-09 18:15:32 | 1 | 24 | 0 |
| 2 | 2023-08-09 17:15:11 | 2023-08-09 17:15:22 | 1 | 24 | 0 |
| 1 | 2023-08-09 16:15:09 | 2023-08-09 16:15:31 | 1 | 24 | 0 |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+