51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

合并两个或多个表,使用Python检查唯一值。

英文:

Merge two or more tables checking the unique value using python

问题 {#heading}

以下是翻译好的部分:

我是Python的初学者。
我必须检查2或3个表格,看看是否有一些数据缺失。
这些数据是`pandas.core.frame.DataFrame`。
我有这些表格:

    start	end	val	accn	fy	fp	form	filed	frame
    0	2016-01-01	2016-12-31	90272000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	CY2016
    6	2017-01-01	2017-12-31	110855000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    7	2017-01-01	2017-12-31	110855000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2017
    18	2018-01-01	2018-12-31	136819000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    19	2018-01-01	2018-12-31	136819000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2018
    30	2019-01-01	2019-12-31	161857000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2019
    41	2020-01-01	2020-12-31	182527000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2020
    52	2021-01-01	2021-12-31	257637000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2021
    61	2022-01-01	2022-12-31	282836000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2022




和


    start	end	val	accn	fy	fp	form	filed	frame
    0	2006-12-31	2007-12-29	39474000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	CY2007
    4	2007-12-30	2008-12-27	43251000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    5	2007-12-30	2008-12-27	43251000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    13	2008-12-28	2009-12-26	43232000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	CY2009
    15	2008-12-28	2009-12-26	43232000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    16	2008-12-28	2009-12-26	43232000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    27	2009-12-27	2010-12-25	57838000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	CY2010
    28	2009-12-27	2010-12-25	57838000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    30	2009-12-27	2010-12-25	57838000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    41	2010-12-26	2011-12-31	66504000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    42	2010-12-26	2011-12-31	66504000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    43	2010-12-26	2011-12-31	66504000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	CY2011
    54	2012-01-01	2012-12-29	65492000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    56	2012-01-01	2012-12-29	65492000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    57	2012-01-01	2012-12-29	65492000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    68	2012-12-30	2013-12-28	66415000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	CY2013
    70	2012-12-30	2013-12-28	66415000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    71	2012-12-30	2013-12-28	66415000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    82	2013-12-29	2014-12-27	66683000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	CY2014
    83	2013-12-29	2014




\<details\>
\<summary\>英文:\</summary\>


I am a beginner with Python.
I have to check 2 or 3 tables and see if some data is missing.
The data are `pandas.core.frame.DataFrame`.
I have these tables:


    start	end	val	accn	fy	fp	form	filed	frame
    0	2016-01-01	2016-12-31	90272000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	CY2016
    6	2017-01-01	2017-12-31	110855000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    7	2017-01-01	2017-12-31	110855000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2017
    18	2018-01-01	2018-12-31	136819000000	0001652044-19-000004	2018	FY	10-K	2019-02-05	NaN
    19	2018-01-01	2018-12-31	136819000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2018
    30	2019-01-01	2019-12-31	161857000000	0001652044-20-000008	2019	FY	10-K	2020-02-04	CY2019
    41	2020-01-01	2020-12-31	182527000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2020
    52	2021-01-01	2021-12-31	257637000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2021
    61	2022-01-01	2022-12-31	282836000000	0001652044-23-000016	2022	FY	10-K	2023-02-03	CY2022




and


    start	end	val	accn	fy	fp	form	filed	frame
    0	2006-12-31	2007-12-29	39474000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	CY2007
    4	2007-12-30	2008-12-27	43251000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    5	2007-12-30	2008-12-27	43251000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    13	2008-12-28	2009-12-26	43232000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	CY2009
    15	2008-12-28	2009-12-26	43232000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    16	2008-12-28	2009-12-26	43232000000	0001193125-10-036385	2009	FY	10-K	2010-02-22	NaN
    27	2009-12-27	2010-12-25	57838000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	CY2010
    28	2009-12-27	2010-12-25	57838000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    30	2009-12-27	2010-12-25	57838000000	0001193125-11-040427	2010	FY	10-K	2011-02-18	NaN
    41	2010-12-26	2011-12-31	66504000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    42	2010-12-26	2011-12-31	66504000000	0001193125-12-081822	2011	FY	10-K	2012-02-27	NaN
    43	2010-12-26	2011-12-31	66504000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	CY2011
    54	2012-01-01	2012-12-29	65492000000	0001445305-13-000278	2012	FY	10-K	2013-02-21	NaN
    56	2012-01-01	2012-12-29	65492000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    57	2012-01-01	2012-12-29	65492000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    68	2012-12-30	2013-12-28	66415000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	CY2013
    70	2012-12-30	2013-12-28	66415000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    71	2012-12-30	2013-12-28	66415000000	0000077476-14-000007	2013	FY	10-K	2014-02-14	NaN
    82	2013-12-29	2014-12-27	66683000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	CY2014
    83	2013-12-29	2014-12-27	66683000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	NaN
    85	2013-12-29	2014-12-27	66683000000	0000077476-15-000012	2014	FY	10-K	2015-02-12	NaN
    96	2014-12-28	2015-12-26	63056000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2015
    97	2014-12-28	2015-12-26	63056000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	NaN
    98	2014-12-28	2015-12-26	63056000000	0000077476-16-000066	2015	FY	10-K	2016-02-11	NaN
    109	2015-12-27	2016-12-31	62799000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2016
    110	2015-12-27	2016-12-31	62799000000	0000077476-17-000010	2016	FY	10-K	2017-02-15	NaN
    117	2017-01-01	2017-12-30	63525000000	0000077476-18-000012	2017	FY	10-K	2018-02-13	CY2017




So the target is to check and merge the data from all years.
So basically I have to check the 2 tables and create a new table with all CY Years.
For example the first table start from 2016, but the second from 2007. So I need to create a table with `val` and `frame` merged and unique values. The table basically should have all years found from an API.


Example result:


    39474000000 2007
    43232000000  2009
    57838000000 2010
    ...
    110855000000     2017
    etc etc




There is a thing I have to check if both tables contain data for `the same year`(example `2017`), then I have to check which `val` is bigger and insert this in the new table.
How can  I do an iter for loo year by year and check the val value between the two tables?


\</details\>


答案1
===



得分: 1


如果您想忽略`frame`列中包含`NaN`的行,您可以在应用`max`之前,简单地连接您过滤后的数据框并按`frame`进行分组:


```python
df1 = df1.dropna(subset='frame')[['val', 'frame']]
df2 = df2.dropna(subset='frame')[['val', 'frame']]

pd.concat([
    df1[~df1['frame'].str.contains('Q')],
    df2[~df2['frame'].str.contains('Q')]
]).groupby('frame').max()
</code></pre>
 <p>输出:</p>
 <pre tabindex="0" style="color:#f8f8f2;background-color:#272822;"><code><span style="display:flex;"><span>                 val
</span></span><span style="display:flex;"><span>frame
</span></span><span style="display:flex;"><span>CY2007   39474000000
</span></span><span style="display:flex;"><span>CY2009   43232000000
</span></span><span style="display:flex;"><span>CY2010   57838000000
</span></span><span style="display:flex;"><span>CY2011   66504000000
</span></span><span style="display:flex;"><span>CY2013   66415000000
</span></span><span style="display:flex;"><span>CY2014   66683000000
</span></span><span style="display:flex;"><span>CY2015   63056000000
</span></span><span style="display:flex;"><span>CY2016   90272000000
</span></span><span style="display:flex;"><span>CY2017  110855000000
</span></span><span style="display:flex;"><span>CY2018  136819000000
</span></span><span style="display:flex;"><span>CY2019  161857000000
</span></span><span style="display:flex;"><span>CY2020  182527000000
</span></span><span style="display:flex;"><span>CY2021  257637000000
</span></span><span style="display:flex;"><span>CY2022  282836000000
</span></span></code></pre>
 <p>编辑:不包括<code>frame</code>单元格中包含<code>'Q'</code>的行。</p>
 <details>
  <summary>英文:</summary>
  <p>If you want to ignore rows where <code>frame</code> is <code>NaN</code>, you can simply concat your filtered dfs and group by <code>frame</code> before applying <code>max</code>:</p>
  <pre tabindex="0" style="color:#f8f8f2;background-color:#272822;"><code><span style="display:flex;"><span>df1 <span style="color:#f92672">=</span> df1<span style="color:#f92672">.</span>dropna(subset<span style="color:#f92672">=&amp;</span><span style="color:#75715e">#39;frame&amp;#39;)[[&amp;#39;val&amp;#39;, &amp;#39;frame&amp;#39;]]</span>
</span></span><span style="display:flex;"><span>df2 <span style="color:#f92672">=</span> df2<span style="color:#f92672">.</span>dropna(subset<span style="color:#f92672">=&amp;</span><span style="color:#75715e">#39;frame&amp;#39;)[[&amp;#39;val&amp;#39;, &amp;#39;frame&amp;#39;]]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>pd<span style="color:#f92672">.</span>concat([
</span></span><span style="display:flex;"><span>    df1[<span style="color:#f92672">~</span>df1[<span style="color:#f92672">&amp;</span><span style="color:#75715e">#39;frame&amp;#39;].str.contains(&amp;#39;Q&amp;#39;)], </span>
</span></span><span style="display:flex;"><span>    df2[<span style="color:#f92672">~</span>df2[<span style="color:#f92672">&amp;</span><span style="color:#75715e">#39;frame&amp;#39;].str.contains(&amp;#39;Q&amp;#39;)]</span>
</span></span><span style="display:flex;"><span>])<span style="color:#f92672">.</span>groupby(<span style="color:#f92672">&amp;</span><span style="color:#75715e">#39;frame&amp;#39;).max()</span>
</span></span></code></pre>
  <p>Output:</p>
  <pre><code>                 val
frame
CY2007   39474000000
CY2009   43232000000
CY2010   57838000000
CY2011   66504000000
CY2013   66415000000
CY2014   66683000000
CY2015   63056000000
CY2016   90272000000
CY2017  110855000000
CY2018  136819000000
CY2019  161857000000
CY2020  182527000000
CY2021  257637000000
CY2022  282836000000
</code></pre>
  <p>Edit: excluding <code>frame</code> cells containing <code>&amp;#39;Q&amp;#39;</code>.</p>
 </details>
 <p></p>
</div>

```
赞(1)
未经允许不得转载:工具盒子 » 合并两个或多个表,使用Python检查唯一值。