51工具盒子

依楼听风雨
笑看云卷云舒,淡观潮起潮落

在Python中,不使用Pandas,从元组列表中删除部分重复的最短方式是什么?

英文:

What is the shortest way to drop partial duplicates from a list of tuple in Python without using Pandas?

问题 {#heading}

我有一个元组列表,每个元组的结构如下:(姓名,年龄,城市)。我的列表中最多有大约30个元组。

没有重复项。但是有时姓名年龄会重复。

示例输入可能如下所示:

lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]

我想要删除部分 重复项,其中子集将是姓名年龄 ,但不包括城市。理想情况下,我想保留第一个重复项。我猜示例会更容易理解:

预期输出:

expected_lst = [("Dave", 20, "Dublin"), ("Lisa", 20, "Monaco"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]

Dave和Lisa的重复项被删除了,但Frank没有被删除,因为年龄不匹配。

我到目前为止尝试过的方法:

我查看了以下帖子:

但它们似乎不符合我所要求的,我没有成功理解如何将这些解决方案应用于我的情况。

我找到了一个似乎有效的解决方案,即将我的列表转换为pandas DataFrame,然后使用drop_duplicates()函数及其subset参数删除重复项:

df = pd.DataFrame(lst, columns=["Name", "Age", "City"]).drop_duplicates(subset=["Name", "Age"])

然后使用itertuples将其转换回列表。

expected_lst = list(df.itertuples(index=False, name=None))

但是,我不需要pandas来完成我的代码的任何其他步骤。更改我的数据类型似乎有点"过于"。

因此,我想知道是否有更好的方法来获得我期望的输出,可能要更快或更短?我不是专家,但我认为将列表转换为pandas DataFrame,然后再转回列表的效率不是很高? 英文:

I have a list of tuples where each tuple is structured like this : (Name, Age, City). I have, at most, about 30 tuples in my list.

There are no duplicates. However, sometimes, Name and Age are duplicated.

Example input would be something like this :

lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]

I would like to remove partial duplicates, where the subset would be Name and Age but not City. Ideally I'd like to keep the first duplicate. I guess an example would make it easier to understand :

Expected output :

expected_lst = [(Dave, 20, Dublin), (Lisa, 20, Monaco), (Frank, 56, Berlin),  (Frank, 40, Berlin)]

Dave's and Lisa's duplicates were removed, but not Frank since the Age does not match.

What I have tried so far :

I checked these posts :

But they do not seem to match what I'm asking for and I didn't manage to understand how to apply the solutions to my case.

I did find a solution that seems to work, which is to convert my list to a pandas DataFrame and then drop duplicates using the drop_duplicates() function and its subset parameter :

df = pd.DataFrame(lst, columns= ["Name", "Age", "City"]).drop_duplicates(subset=(["Name", "Age"]))

And then using itertuples to convert it back to a list.

expected_lst = list(df.itertuples(index=False, name=None))

However, I do not need pandas for any of the other steps of my code. Changing the type of my data seems a bit "much".

I was therefore wondering if there was a better way to get my expected output, that would maybe either be quicker or shorter to write ? I'm not an expert but I assume that converting a list to a pandas DataFrame and then back to a list is not very efficient ?

答案1 {#1}

得分: 2

你可以使用元组中的"唯一"元素(姓名、年龄)作为字典键,值是完整的元组。因此,姓名+年龄是唯一的。

为了确保保留第一个条目,你需要在插入之前检查(temp中是否包含) (name, age)。编辑:或者只需像MatBailie所说的那样反转列表。

data = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]

temp = {(name, age) : (name, age, city) for name, age, city in reversed(data)} for unique_item in temp.values(): print(unique_item)

输出:

('Dave', 40, 'Paris')
('Lisa', 20, 'London')
('Frank', 56, 'Berlin')
('Frank', 40, 'Berlin') 英文:

You can use the tuple of the "unique" elements (name, age) as dict key, where the value is the full tuple. Thus the name+age is unique.

In order to ensure you keep the first entry, you need to check if (name, age) is in temp before inserting it. edit: or just reverse the list, like MatBailie said

data = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"),  ("Frank", 40, "Berlin")]
`temp = {(name, age) : (name, age, city) for name, age, city in reversed(data)}
for unique_item in temp.values():
print(unique_item)
`

>('Dave', 40, 'Paris')
('Lisa', 20, 'London')
('Frank', 56, 'Berlin')
('Frank', 40, 'Berlin')

答案2 {#2}

得分: 0

你可以利用 itertools.groupby,以元组的前两个元素作为键值(首先需要对数据进行排序,因为 groupby 在连续条目上操作):

from itertools import groupby
filtered_data = [next(g) for k, g in groupby(sorted(data), key=lambda tup: tup[:2])]

[('Dave', 20, 'Dublin'), ('Frank', 40, 'Berlin'), ('Frank', 56, 'Berlin'), ('Lisa', 20, 'London')]

当然,这仅在初始元组的顺序对你不重要时才有效。否则,@KennyOstrom 的回答会保留原始顺序。 英文:

You could make use of itertools.groupby, using the first 2 elements of your tuples as a key (you first need to sort the data, since groupby operates on consecutive entries):

from itertools import groupby
filtered_data = [next(g) for k,g in groupby(sorted(data), key=lambda tup: tup[:2])]

[('Dave', 20, 'Dublin'), ('Frank', 40, 'Berlin'), ('Frank', 56, 'Berlin'), ('Lisa', 20, 'London')]


Of course, this only works if the initial order of tuples doesn't matter to you. Otherwise, @KennyOstrom's answer preserves the original order.


赞(3)
未经允许不得转载:工具盒子 » 在Python中,不使用Pandas,从元组列表中删除部分重复的最短方式是什么?