英文:
Python - turns string into dictionary where keys are subheadings and values are links
问题 {#heading}
pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")
link_dict = {}
current_subheading = None
for line in lines:
if line.startswith("----"):
current_subheading = line.replace("----", "").strip()
link_dict[current_subheading] = []
elif current_subheading:
link_dict[current_subheading].append(line.strip())
英文:
In the middle of some text I have the following.
Some random text before.
----CAPITAL WORDS:
first subheading
https://link1
https://link2
second subheading
https://link3
third subheading
https://link4
https://link5
https://link6
https://link7
`----MORE CAPITAL WORDS:
Some random text after.
`
I would like to extract the string between ----CAPITAL WORDS:
and ----MORE CAPITAL WORDS
and store it in a dictionary as follows
{
'first subheading': ["https://link1", "https://link2"],
'second subheading': ["https://link3"]
'third subheading': ["https://link4", "https://link5", "https://link6", "https://link7"]
}
Attempt {#attempt}
pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")
`link_dict = {}
for line in lines:
if line:
pass # unsure how to continue
`
答案1 {#1}
得分: 3
给定你已经处理过的lines
如下所示:
['first subheading', 'https://link1', 'https://link2', '',
'second subheading', 'https://link3', '',
'third subheading', 'https://link4', 'https://link5',
'https://link6', 'https://link7']
你可以使用itertools.groupby
来简洁地完成这个任务:
from itertools import groupby
{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading': ['https://link1', 'https://link2'],
# 'second subheading': ['https://link3'],
# 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']}
<details>
<summary>英文:</summary>
Given that you have already processed `lines` to be:
[&#39;first subheading&#39;, &#39;https://link1&#39;, &#39;https://link2&#39;, &#39;&#39;,
&#39;second subheading&#39;, &#39;https://link3&#39;, &#39;&#39;,
&#39;third subheading&#39;, &#39;https://link4&#39;, &#39;https://link5&#39;,
&#39;https://link6&#39;, &#39;https://link7&#39;]
You can do this concisely using https://docs.python.org/3/library/itertools.html#itertools.groupby
from itertools import groupby
{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
# &#39;second subheading&#39;: [&#39;https://link3&#39;],
# &#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}
\</details\>
答案2
===
得分: 1
```python
{
'first subheading': ['https://link1', 'https://link2'],
'second subheading': ['https://link3'],
'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}
</code></pre>
<details>
<summary>英文:</summary>
<pre><code>example = &quot;&quot;&quot;Some random text before.
----CAPITAL WORDS:
first subheading
https://link1
https://link2
second subheading
https://link3
third subheading
https://link4
https://link5
https://link6
https://link7
----MORE CAPITAL WORDS:
Some random text after.&quot;&quot;&quot;
out = {}
current_heading = None
for line in example.splitlines():
if line.startswith(&#39;----&#39;):
pass
elif line.startswith(&#39;http&#39;):
out[current_heading].append(line)
elif line.islower():
current_heading = line
out[current_heading] = []
</code></pre>
<p>Output:</p>
<pre><code>{&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
&#39;second subheading&#39;: [&#39;https://link3&#39;],
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]}
</code></pre>
</details>
<h1 id="3">答案3</h1>
<p><strong>得分</strong>: 1</p>
<p>以下是翻译好的内容:</p>
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;"><code><span style="display:flex;"><span><span style="color:#75715e"># 匹配 `----CAPITAL WORDS:` 并处理之后的部分,不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>pattern <span style="color:#f92672">=</span> <span style="color:#e6db74">r</span><span style="color:#e6db74">"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P<sub>\S.*)(?:\n(?!(?1))(?P<val>\S.*))+\s*"</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 模式匹配:</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?: 非捕获组</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># | 或者</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># \G 断言当前位置在前一次匹配的结尾</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ) 非捕获组结束</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?! 负向先行断言,断言右侧不是</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**,匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ) 负向先行断言结束</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?P<sub>\S.*) 捕获组 sub,匹配单行子标题(至少以一个非空格字符开头以防止匹配空行)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?: 非捕获组,作为整体重复1次或更多次</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># \n 匹配换行符</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?!(?1)) 断言第1组的模式右侧没有直接出现</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?P<val>\S.*) 捕获组 val,捕获单行的值,如 "https://link1"</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># )+ 非捕获组结束并重复1次或更多次</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># \s* 匹配可选的空白字符</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> regex
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>s <span style="color:#f92672">=</span> (<span style="color:#e6db74">"Some random text before.</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"----CAPITAL WORDS:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"first subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link1</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link2</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"second subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link3</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"third subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link4</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link5</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link6</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"https://link7</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"----MORE CAPITAL WORDS:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span> <span style="color:#e6db74">"Some random text after."</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>matches <span style="color:#f92672">=</span> regex<span style="color:#f92672">.</span>finditer(pattern, s, regex<span style="color:#f92672">.</span>MULTILINE)
</span></span><span style="display:flex;"><span>dct <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> _, m <span style="color:#f92672">in</span> enumerate(matches):
</span></span><span style="display:flex;"><span> dct[m<span style="color:#f92672">.</span>captures(<span style="color:#e6db74">"sub"</span>)[<span style="color:#ae81ff">0</span>]] <span style="color:#f92672">=</span> m<span style="color:#f92672">.</span>captures(<span style="color:#e6db74">"val"</span>)
</span></span><span style="display:flex;"><span>print(dct)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 输出</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># {</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'first subheading': ['https://link1', 'https://link2'],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'second subheading': ['https://link3'],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># }</span>
</span></span></code></pre>
<details>
<summary>英文:</summary>
<p>To match <code>----CAPITAL WORDS:</code> and process all following parts without crossing either <code>----CAPITAL WORDS:</code> or <code>----MORE CAPITAL WORDS:</code> you could make use of the <a href="https://pypi.org/project/regex/" rel="external nofollow" target="_blank">PyPi regex module</a> and <code>captures()</code> for repeated capture groups.</p>
<pre><code>(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*
</code></pre>
<p>The pattern matches:</p>
<ul>
<li><code>(?:</code> Non capture group
<ul>
<li><code>^----CAPITAL WORDS:\n</code> Match literally from the start of the string followed by a newline</li>
<li><code>|</code> Or</li>
<li><code>\G</code> Assert the current position at the end of the previous match</li>
</ul></li>
<li><code>)</code> Close the non capture group</li>
<li><code>(?!</code> Negative lookahead, assert that what is directly to the right is not
<ul>
<li><code>(----(?:MORE )?CAPITAL WORDS:)</code> Capture in <strong>group 1</strong>, matching <code>CAPITAL WORDS:</code> with an optional leading <code>MORE </code> part</li>
</ul></li>
<li><code>)</code> Close the negative lookahead</li>
<li><code>(?P&lt;sub&gt;\S.*)</code> Capture group <strong>sub</strong>, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)</li>
<li><code>(?:</code> Non capture group to repeat as a whole part
<ul>
<li><code>\n</code> Match a newline</li>
<li><code>(?!(?1))</code> Assert that the pattern of group 1 is not directly to the right</li>
<li><code>(?P&lt;val&gt;\S.*)</code> Capture in group <strong>val</strong> the single lines values like "https://link1"</li>
</ul></li>
<li><code>)+</code> Close the non capture group and repeat it 1+ times</li>
<li><code>\s*</code> Match optional whitespace chars</li>
</ul>
<p>See a <a href="https://regex101.com/r/piFIiu/1" rel="external nofollow" target="_blank">regex demo</a> and a <a href="https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828" rel="external nofollow" target="_blank">Python demo</a>.</p>
<pre><code>import regex
pattern = r&quot;(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*&quot;
s = (&quot;Some random text before.\n\n&quot;
&quot;----CAPITAL WORDS:\n&quot;
&quot;first subheading\n&quot;
&quot;https://link1\n&quot;
&quot;https://link2\n\n&quot;
&quot;second subheading\n&quot;
&quot;https://link3\n\n&quot;
&quot;third subheading\n&quot;
&quot;https://link4\n&quot;
&quot;https://link5\n&quot;
&quot;https://link6\n&quot;
&quot;https://link7\n\n&quot;
&quot;----MORE CAPITAL WORDS:\n&quot;
&quot;Some random text after.&quot;)
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures(&quot;sub&quot;)[0]] = m.captures(&quot;val&quot;)
print(dct)
</code></pre>
<p>Output</p>
<pre><code>{
&#39;first subheading&#39;: [&#39;https://link1&#39;, &#39;https://link2&#39;],
&#39;second subheading&#39;: [&#39;https://link3&#39;],
&#39;third subheading&#39;: [&#39;https://link4&#39;, &#39;https://link5&#39;, &#39;https://link6&#39;, &#39;https://link7&#39;]
}
</code></pre>
</details>
<p></p>
</div>
```