51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

Python – 将字符串转换为字典,其中键是副标题,值是链接。

英文:

Python - turns string into dictionary where keys are subheadings and values are links

问题 {#heading}

pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")

link_dict = {}
current_subheading = None
for line in lines:
    if line.startswith("----"):
        current_subheading = line.replace("----", "").strip()
        link_dict[current_subheading] = []
    elif current_subheading:
        link_dict[current_subheading].append(line.strip())

英文:

In the middle of some text I have the following.

Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2


second subheading
https://link3


third subheading
https://link4
https://link5
https://link6
https://link7

`----MORE CAPITAL WORDS:
Some random text after.
`

I would like to extract the string between ----CAPITAL WORDS: and ----MORE CAPITAL WORDS and store it in a dictionary as follows

{
    'first subheading': ["https://link1", "https://link2"],
    'second subheading': ["https://link3"]
    'third subheading': ["https://link4", "https://link5", "https://link6", "https://link7"]
}

Attempt {#attempt}

pattern = r"CAPITAL WORDS:(.*?)(?:\n----MORE CAPITAL WORDS:|$)"
matches = re.search(pattern, descriptions[0], re.DOTALL)
lines = matches.group(1).strip().split("\n")
`link_dict = {}
for line in lines:
if line:
pass # unsure how to continue
`

答案1 {#1}

得分: 3

给定你已经处理过的lines如下所示:

['first subheading', 'https://link1', 'https://link2', '',
 'second subheading', 'https://link3', '',
 'third subheading', 'https://link4', 'https://link5',
 'https://link6', 'https://link7']

你可以使用itertools.groupby来简洁地完成这个任务:

from itertools import groupby

{next(g): [*g] for k, g in groupby(lines, key=bool) if k}
# {'first subheading':  ['https://link1', 'https://link2'], 
#  'second subheading': ['https://link3'], 
#  'third subheading':  ['https://link4', 'https://link5', 'https://link6', 'https://link7']}


<details>
<summary>英文:</summary>

Given that you have already processed `lines` to be:


    [&amp;#39;first subheading&amp;#39;, &amp;#39;https://link1&amp;#39;, &amp;#39;https://link2&amp;#39;, &amp;#39;&amp;#39;, 
     &amp;#39;second subheading&amp;#39;, &amp;#39;https://link3&amp;#39;, &amp;#39;&amp;#39;, 
     &amp;#39;third subheading&amp;#39;, &amp;#39;https://link4&amp;#39;, &amp;#39;https://link5&amp;#39;, 
     &amp;#39;https://link6&amp;#39;, &amp;#39;https://link7&amp;#39;]




You can do this concisely using https://docs.python.org/3/library/itertools.html#itertools.groupby


    from itertools import groupby

    {next(g): [*g] for k, g in groupby(lines, key=bool) if k}
    # {&amp;#39;first subheading&amp;#39;:  [&amp;#39;https://link1&amp;#39;, &amp;#39;https://link2&amp;#39;], 
    #  &amp;#39;second subheading&amp;#39;: [&amp;#39;https://link3&amp;#39;], 
    #  &amp;#39;third subheading&amp;#39;:  [&amp;#39;https://link4&amp;#39;, &amp;#39;https://link5&amp;#39;, &amp;#39;https://link6&amp;#39;, &amp;#39;https://link7&amp;#39;]}




\</details\>


答案2
===



得分: 1


```python
{
    'first subheading': ['https://link1', 'https://link2'],
    'second subheading': ['https://link3'],
    'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']
}
</code></pre>
 <details>
  <summary>英文:</summary>
  <pre><code>example = &amp;quot;&amp;quot;&amp;quot;Some random text before.

----CAPITAL WORDS:
first subheading
https://link1
https://link2

second subheading
https://link3

third subheading
https://link4
https://link5
https://link6
https://link7

----MORE CAPITAL WORDS:
Some random text after.&amp;quot;&amp;quot;&amp;quot;


out = {}
current_heading = None
for line in example.splitlines():
    if line.startswith(&amp;#39;----&amp;#39;):
        pass
    elif line.startswith(&amp;#39;http&amp;#39;):
        out[current_heading].append(line)
    elif line.islower():
        current_heading = line
        out[current_heading] = []
</code></pre>
  <p>Output:</p>
  <pre><code>{&amp;#39;first subheading&amp;#39;: [&amp;#39;https://link1&amp;#39;, &amp;#39;https://link2&amp;#39;], 
&amp;#39;second subheading&amp;#39;: [&amp;#39;https://link3&amp;#39;], 
&amp;#39;third subheading&amp;#39;: [&amp;#39;https://link4&amp;#39;, &amp;#39;https://link5&amp;#39;, &amp;#39;https://link6&amp;#39;, &amp;#39;https://link7&amp;#39;]}
</code></pre>
 </details>
 <h1 id="3">答案3</h1>
 <p><strong>得分</strong>: 1</p>
 <p>以下是翻译好的内容:</p>
 <pre tabindex="0" style="color:#f8f8f2;background-color:#272822;"><code><span style="display:flex;"><span><span style="color:#75715e"># 匹配 `----CAPITAL WORDS:` 并处理之后的部分,不要跨越 `----CAPITAL WORDS:` 或 `----MORE CAPITAL WORDS:`</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 可以使用 [PyPi regex 模块](https://pypi.org/project/regex/) 和 `captures()` 来重复捕获组。</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>pattern <span style="color:#f92672">=</span> <span style="color:#e6db74">r</span><span style="color:#e6db74">"(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&lt;sub&gt;\S.*)(?:\n(?!(?1))(?P&lt;val&gt;\S.*))+\s*"</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 模式匹配:</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?: 非捕获组</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   ^----CAPITAL WORDS:\n 从字符串开头匹配 "----CAPITAL WORDS:" 后跟换行符</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   | 或者</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   \G 断言当前位置在前一次匹配的结尾</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ) 非捕获组结束</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?! 负向先行断言,断言右侧不是</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   (----(?:MORE )?CAPITAL WORDS:) 捕获到 **第1组**,匹配带有可选的 `MORE ` 部分的 `CAPITAL WORDS:`</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ) 负向先行断言结束</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?P&lt;sub&gt;\S.*) 捕获组 sub,匹配单行子标题(至少以一个非空格字符开头以防止匹配空行)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># (?: 非捕获组,作为整体重复1次或更多次</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   \n 匹配换行符</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   (?!(?1)) 断言第1组的模式右侧没有直接出现</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   (?P&lt;val&gt;\S.*) 捕获组 val,捕获单行的值,如 "https://link1"</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># )+ 非捕获组结束并重复1次或更多次</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># \s* 匹配可选的空白字符</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 查看 [regex 演示](https://regex101.com/r/piFIiu/1) 和 [Python 演示](https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828)。</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> regex
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>s <span style="color:#f92672">=</span> (<span style="color:#e6db74">"Some random text before.</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"----CAPITAL WORDS:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"first subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link1</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link2</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"second subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link3</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"third subheading</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link4</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link5</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link6</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"https://link7</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"----MORE CAPITAL WORDS:</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">"</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">"Some random text after."</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>matches <span style="color:#f92672">=</span> regex<span style="color:#f92672">.</span>finditer(pattern, s, regex<span style="color:#f92672">.</span>MULTILINE)
</span></span><span style="display:flex;"><span>dct <span style="color:#f92672">=</span> {}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> _, m <span style="color:#f92672">in</span> enumerate(matches):
</span></span><span style="display:flex;"><span>    dct[m<span style="color:#f92672">.</span>captures(<span style="color:#e6db74">"sub"</span>)[<span style="color:#ae81ff">0</span>]] <span style="color:#f92672">=</span> m<span style="color:#f92672">.</span>captures(<span style="color:#e6db74">"val"</span>)
</span></span><span style="display:flex;"><span>print(dct)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 输出</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># {</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'first subheading': ['https://link1', 'https://link2'],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'second subheading': ['https://link3'],</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># 'third subheading': ['https://link4', 'https://link5', 'https://link6', 'https://link7']</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># }</span>
</span></span></code></pre>
 <details>
  <summary>英文:</summary>
  <p>To match <code>----CAPITAL WORDS:</code> and process all following parts without crossing either <code>----CAPITAL WORDS:</code> or <code>----MORE CAPITAL WORDS:</code> you could make use of the <a href="https://pypi.org/project/regex/" rel="external nofollow" target="_blank">PyPi regex module</a> and <code>captures()</code> for repeated capture groups.</p>
  <pre><code>(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&amp;lt;sub&amp;gt;\S.*)(?:\n(?!(?1))(?P&amp;lt;val&amp;gt;\S.*))+\s*
</code></pre>
  <p>The pattern matches:</p>
  <ul>
   <li><code>(?:</code> Non capture group 
    <ul>
     <li><code>^----CAPITAL WORDS:\n</code> Match literally from the start of the string followed by a newline</li>
     <li><code>|</code> Or</li>
     <li><code>\G</code> Assert the current position at the end of the previous match</li>
    </ul></li>
   <li><code>)</code> Close the non capture group</li>
   <li><code>(?!</code> Negative lookahead, assert that what is directly to the right is not 
    <ul>
     <li><code>(----(?:MORE )?CAPITAL WORDS:)</code> Capture in <strong>group 1</strong>, matching <code>CAPITAL WORDS:</code> with an optional leading <code>MORE </code> part</li>
    </ul></li>
   <li><code>)</code> Close the negative lookahead</li>
   <li><code>(?P&amp;lt;sub&amp;gt;\S.*)</code> Capture group <strong>sub</strong>, match the single lined subheading (starting with at least a single non whitespace char to prevent matching empty lines)</li>
   <li><code>(?:</code> Non capture group to repeat as a whole part 
    <ul>
     <li><code>\n</code> Match a newline</li>
     <li><code>(?!(?1))</code> Assert that the pattern of group 1 is not directly to the right</li>
     <li><code>(?P&amp;lt;val&amp;gt;\S.*)</code> Capture in group <strong>val</strong> the single lines values like "https://link1"</li>
    </ul></li>
   <li><code>)+</code> Close the non capture group and repeat it 1+ times</li>
   <li><code>\s*</code> Match optional whitespace chars</li>
  </ul>
  <p>See a <a href="https://regex101.com/r/piFIiu/1" rel="external nofollow" target="_blank">regex demo</a> and a <a href="https://tio.run/##jZJdT8IwFIbv@ytqr1rEIeJHsqgLUWJIQAhgvGBoytaxRtYtbSEY9bfPw8cFCk561z7vefvmnJO92zhVtTyXSZZqi7WYiAVCGbdWaIVvsCbUc19O4NzVu81BvYWfO737vuurT/@BUe@ILhlo2p1eAzPvp4qBonttZuNbv@@U4AJ1yxqvuiZzPl0TduybEkHIwJeU9NNEYM1VmCbYioXFYxGlWji@8hVBeOuQfcl@SSKpjcUQIhY8lGqyI4itzYxbqUyleqsW0rM9CYwIUhUe6l/b42BjqQ82OC@kF4X0spBe/dHe1WT/6fHOxHgEC@QQhlDCbRCL5WBXy@VEUoUSIN0sWRmb8ga1n1qDZqv52GAoDCxUfHwhGDx@LeMES4WFmiVCcyvoxpS5qxggHiZOwDM708JQAq0kbHg6GoHF9jusGyTKtFSWQg3L828" rel="external nofollow" target="_blank">Python demo</a>.</p>
  <pre><code>import regex
pattern = r&amp;quot;(?:^----CAPITAL WORDS:\n|\G)(?!(----(?:MORE )?CAPITAL WORDS:))(?P&amp;lt;sub&amp;gt;\S.*)(?:\n(?!(?1))(?P&amp;lt;val&amp;gt;\S.*))+\s*&amp;quot;
s = (&amp;quot;Some random text before.\n\n&amp;quot;
&amp;quot;----CAPITAL WORDS:\n&amp;quot;
&amp;quot;first subheading\n&amp;quot;
&amp;quot;https://link1\n&amp;quot;
&amp;quot;https://link2\n\n&amp;quot;
&amp;quot;second subheading\n&amp;quot;
&amp;quot;https://link3\n\n&amp;quot;
&amp;quot;third subheading\n&amp;quot;
&amp;quot;https://link4\n&amp;quot;
&amp;quot;https://link5\n&amp;quot;
&amp;quot;https://link6\n&amp;quot;
&amp;quot;https://link7\n\n&amp;quot;
&amp;quot;----MORE CAPITAL WORDS:\n&amp;quot;
&amp;quot;Some random text after.&amp;quot;)
matches = regex.finditer(pattern, s, regex.MULTILINE)
dct = {}
for _, m in enumerate(matches):
dct[m.captures(&amp;quot;sub&amp;quot;)[0]] = m.captures(&amp;quot;val&amp;quot;)
print(dct)
</code></pre>
  <p>Output</p>
  <pre><code>{
&amp;#39;first subheading&amp;#39;: [&amp;#39;https://link1&amp;#39;, &amp;#39;https://link2&amp;#39;],
&amp;#39;second subheading&amp;#39;: [&amp;#39;https://link3&amp;#39;],
&amp;#39;third subheading&amp;#39;: [&amp;#39;https://link4&amp;#39;, &amp;#39;https://link5&amp;#39;, &amp;#39;https://link6&amp;#39;, &amp;#39;https://link7&amp;#39;]
}
</code></pre>
 </details>
 <p></p>
</div>

```
赞(2)
未经允许不得转载:工具盒子 » Python – 将字符串转换为字典,其中键是副标题,值是链接。