51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

从AJAX网页中使用Python抓取数据。

英文:

Scrape data from AJAX webpage with python

问题 {#heading}

我遇到了这个问题 - 我需要从这个网页 - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb 中抓取动态表格的数据。

这个网页使用ajax来生成我想获取的表格。我已经检查了元素,似乎很简单,我有带参数的请求URL,我尝试发送请求,得到了200的响应代码,但响应内容为空。

我肯定是做错了些什么,但我不确定如何在Python中获取这些数据,尽管看起来似乎很简单,有人能帮我吗?

我想获取与网站上显示的相同的表格。 英文:

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

答案1 {#1}

得分: 3

以下是已翻译的内容:

"Actually, this page turned out to be a pretty cool challenge!" -> "实际上,这个页面竟然是一个相当酷的挑战!"

"Breakdown:" -> "分解:"

"- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out" -> "- 报告的链接位于源HTML中,但表格是通过JavaScript动态呈现的,但您可以轻松地提取它"

"- The safeargs_data value" -> "- safeargs_data的值"

"is just a silly way of obfuscating in hex this value" -> "只是以十六进制方式混淆此值的愚蠢方式"

"- I've decoded it for ease of readability and editing e.g. the data key" -> "- 为了便于阅读和编辑,我已对其进行了解码,例如data键"

"- Finally, I use the table_link, payload data, and updated headers to make a POST request." -> "- 最后,我使用table_linkpayload数据和更新的headers发出POST请求。"

"- Then, it's easy to get the table out of the JSON and parse it with pandas" -> "- 然后,很容易从JSON中获取表格并使用pandas进行解析"

"By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report." -> "顺便说一句,如果您将URL中的十六进制值转换并将safeargs_data添加到其中,您仍然可以获得您的报告。"

"Here's a full, decoded URL" -> "这是一个完整的、解码后的URL"

"Here's my take on it:" -> "这是我的见解:"

"import binascii" -> "导入 binascii"

"from urllib.parse import urlencode" -> "从 urllib.parse 导入 urlencode"

"import pandas as pd" -> "导入 pandas as pd"

"import requests" -> "导入 requests"

"from bs4 import BeautifulSoup" -> "从 bs4 导入 BeautifulSoup"

"from tabulate import tabulate" -> "从 tabulate 导入 tabulate"

"url = (..." -> "url = (..."

"headers = {..." -> "headers = {..."

"with requests.Session() as session:" -> "使用 requests.Session() 作为 session:"

"table_link = (..." -> "table_link = (..."

"headers.update({..." -> "headers.update({..."

"payload_data = {" -> "payload_data = {"

"hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()" -> "hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()"

"table_data = session.post(" -> "table_data = session.post("

"df = pd.read_html(" -> "df = pd.read_html("

"df.to_csv(" -> "df.to_csv("

"print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" -> "print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" 英文:

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

  • The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out

  • The safeargs_data value

    5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
  • I've decoded it for ease of readability and editing e.g. the data key
  • Finally, I use the table_link, payload data, and updated headers to make a POST request.
  • Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

url = (
    "https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
    "raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
)

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
}

with requests.Session() as session:
    table_link = (
        BeautifulSoup(session.get(url, headers=headers).content, "lxml")
        .select_one("a[class='vui-generic-url']")
        .get("href")
    )

    headers.update({"X-Requested-With": "XMLHttpRequest"})

    payload_data = {
        "__report": "PL_USE_RBB",
        "__callType": "p",
        "data": "2023-08-10",
        "_svgsupport": "true",
        "resourceID": "renderingURL",
        "__pageNumber": "1",
        "__batchID": "189df65a64d-1",
    }

    hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()

    table_data = session.post(
        table_link,
        data={"safeargs_data": hex_it},
        headers=headers,
    )

    df = pd.read_html(
        # .replace() is used to get rid of NBSPs
        table_data.json()["reportContent"].replace("\xa0", ""),
        flavor="lxml",
        skiprows=[0],
    )[1]
    df.dropna(how="all", inplace=True)
    df.to_csv("PL_USE_RBB.csv", index=False)

    print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34')   | ('Koniec', '2023-08-10 15:16:06')   |   ('Początek', '17') |   ('Koniec', '24') |   ('[MWh]', '47561,000') |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

赞(1)
未经允许不得转载:工具盒子 » 从AJAX网页中使用Python抓取数据。