英文:
What is the shortest way to drop partial duplicates from a list of tuple in Python without using Pandas?
问题 {#heading}
我有一个元组列表,每个元组的结构如下:(姓名,年龄,城市)
。我的列表中最多有大约30个元组。
没有重复项。但是有时姓名 和年龄会重复。
示例输入可能如下所示:
lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
我想要删除部分 重复项,其中子集将是姓名 和年龄 ,但不包括城市。理想情况下,我想保留第一个重复项。我猜示例会更容易理解:
预期输出:
expected_lst = [("Dave", 20, "Dublin"), ("Lisa", 20, "Monaco"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
Dave和Lisa的重复项被删除了,但Frank没有被删除,因为年龄不匹配。
我到目前为止尝试过的方法:
我查看了以下帖子:
- https://stackoverflow.com/questions/64779544/python-remove-partial-duplicates-from-a-list
- https://stackoverflow.com/questions/51380270/removing-elements-that-have-consecutive-partial-duplicates-in-python
- https://stackoverflow.com/questions/64122700/efficiently-remove-partial-duplicates-in-a-list-of-tuples
但它们似乎不符合我所要求的,我没有成功理解如何将这些解决方案应用于我的情况。
我找到了一个似乎有效的解决方案,即将我的列表转换为pandas DataFrame,然后使用drop_duplicates()
函数及其subset
参数删除重复项:
df = pd.DataFrame(lst, columns=["Name", "Age", "City"]).drop_duplicates(subset=["Name", "Age"])
然后使用itertuples将其转换回列表。
expected_lst = list(df.itertuples(index=False, name=None))
但是,我不需要pandas来完成我的代码的任何其他步骤。更改我的数据类型似乎有点"过于"。
因此,我想知道是否有更好的方法来获得我期望的输出,可能要更快或更短?我不是专家,但我认为将列表转换为pandas DataFrame,然后再转回列表的效率不是很高? 英文:
I have a list of tuples where each tuple is structured like this : (Name, Age, City)
. I have, at most, about 30 tuples in my list.
There are no duplicates. However, sometimes, Name and Age are duplicated.
Example input would be something like this :
lst = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
I would like to remove partial duplicates, where the subset would be Name and Age but not City. Ideally I'd like to keep the first duplicate. I guess an example would make it easier to understand :
Expected output :
expected_lst = [(Dave, 20, Dublin), (Lisa, 20, Monaco), (Frank, 56, Berlin), (Frank, 40, Berlin)]
Dave's and Lisa's duplicates were removed, but not Frank since the Age does not match.
What I have tried so far :
I checked these posts :
- https://stackoverflow.com/questions/64779544/python-remove-partial-duplicates-from-a-list
- https://stackoverflow.com/questions/51380270/removing-elements-that-have-consecutive-partial-duplicates-in-python
- https://stackoverflow.com/questions/64122700/efficiently-remove-partial-duplicates-in-a-list-of-tuples
But they do not seem to match what I'm asking for and I didn't manage to understand how to apply the solutions to my case.
I did find a solution that seems to work, which is to convert my list to a pandas DataFrame and then drop duplicates using the drop_duplicates()
function and its subset
parameter :
df = pd.DataFrame(lst, columns= ["Name", "Age", "City"]).drop_duplicates(subset=(["Name", "Age"]))
And then using itertuples to convert it back to a list.
expected_lst = list(df.itertuples(index=False, name=None))
However, I do not need pandas for any of the other steps of my code. Changing the type of my data seems a bit "much".
I was therefore wondering if there was a better way to get my expected output, that would maybe either be quicker or shorter to write ? I'm not an expert but I assume that converting a list to a pandas DataFrame and then back to a list is not very efficient ?
答案1 {#1}
得分: 2
你可以使用元组中的"唯一"元素(姓名、年龄)作为字典键,值是完整的元组。因此,姓名+年龄是唯一的。
为了确保保留第一个条目,你需要在插入之前检查(temp中是否包含) (name, age)。编辑:或者只需像MatBailie所说的那样反转列表。
data = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
temp = {(name, age) : (name, age, city) for name, age, city in reversed(data)}
for unique_item in temp.values():
print(unique_item)
输出:
('Dave', 40, 'Paris')
('Lisa', 20, 'London')
('Frank', 56, 'Berlin')
('Frank', 40, 'Berlin') 英文:
You can use the tuple of the "unique" elements (name, age) as dict key, where the value is the full tuple. Thus the name+age is unique.
In order to ensure you keep the first entry, you need to check if (name, age) is in temp before inserting it. edit: or just reverse the list, like MatBailie said
data = [("Dave", 20, "Dublin"), ("Dave", 20, "Paris"), ("Lisa", 20, "Monaco"), ("Lisa", 20, "London"), ("Frank", 56, "Berlin"), ("Frank", 40, "Berlin")]
`temp = {(name, age) : (name, age, city) for name, age, city in reversed(data)}
for unique_item in temp.values():
print(unique_item)
`
>('Dave', 40, 'Paris')
('Lisa', 20, 'London')
('Frank', 56, 'Berlin')
('Frank', 40, 'Berlin')
答案2 {#2}
得分: 0
你可以利用 itertools.groupby
,以元组的前两个元素作为键值(首先需要对数据进行排序,因为 groupby
在连续条目上操作):
from itertools import groupby
filtered_data = [next(g) for k, g in groupby(sorted(data), key=lambda tup: tup[:2])]
[('Dave', 20, 'Dublin'), ('Frank', 40, 'Berlin'), ('Frank', 56, 'Berlin'), ('Lisa', 20, 'London')]
当然,这仅在初始元组的顺序对你不重要时才有效。否则,@KennyOstrom 的回答会保留原始顺序。 英文:
You could make use of itertools.groupby
, using the first 2 elements of your tuples as a key (you first need to sort the data, since groupby
operates on consecutive entries):
from itertools import groupby
filtered_data = [next(g) for k,g in groupby(sorted(data), key=lambda tup: tup[:2])]
[('Dave', 20, 'Dublin'), ('Frank', 40, 'Berlin'), ('Frank', 56, 'Berlin'), ('Lisa', 20, 'London')]
Of course, this only works if the initial order of tuples doesn't matter to you. Otherwise, @KennyOstrom's answer preserves the original order.