Hướng dẫn remove subsequent duplicates python

Question

Nội dung chính Show

Learn several methods to remove duplicate values from a Python list
Removing Duplicates From a Sorted List
Removing Duplicates From an Unsorted List
Using for loop
Using OrderedDict
Using Numpy
Using Pandas

Learn several methods to remove duplicate values from a Python list

Image by author

In today's article, you will learn several different ways to remove duplicate values from a Python list. We will consider two types of scenarios —

The given list is in sorted order
The given list is not sorted

Let's get started!

Table of Contents:· Removing Duplicates From a Sorted List
· Removing Duplicates From an Unsorted List
  ∘ Using for loop
  ∘ Using set
  ∘ Using OrderedDict
  ∘ Using Numpy
  ∘ Using Pandas

Removing Duplicates From a Sorted List

We will have some advantages if the list is in sorted order. If the list is sorted, we can compare the two values next to each other. Because in a sorted list, duplicates will appear next to each other. Consider the list below:

lst = [1, 1, 2, 3, 4, 4, 4, 4, 5, 5, 7, 11, 11, 11, 21, 21]

To remove duplicates from this list, we will loop through the entire list, compare the elements next to each other, and store the unique elements in another list.

If you run the code, you will see duplicate values were removed.

Output:
[1, 2, 3, 4, 5, 7, 11, 21]

Removing Duplicates From an Unsorted List

If the list is not sorted, we can’t compare values next to each other. Because duplicates may appear anywhere in the list. In that case, we can use several methods:

Using for loop
Using set
Using OrderedDict
Using Numpy
Using Pandas

Using for loop

The approach is to insert elements in the temporary list one by one. Before inserting a value, we will check if the value is already in the temporary list. If a value is already in the temporary list, we will not insert it.

Output:
[5, 1, 3, 2, 21, 4, 7, 11]

Using list comprehension, we can do it using fewer lines of code:

If you are interested to learn more about Python comprehensions, you can read my article here:

Using set

Sets in Python have special characteristics. Sets only contain unique values. If we want to insert duplicates to a set, it will remove them automatically. So the trick is to simply copy the given list to a set and the duplicates will be removed automatically. We can copy the values of the set in a list again.

Output:
[1, 2, 3, 4, 5, 7, 11, 21]

Or we can do it in one line.

If you are interested to learn more about Python sets, you can read my article here:

Using OrderedDict

We can remove duplicates from the given list by using OrderedDict from collections. We need to import it first.

Output:
[5, 1, 3, 2, 21, 4, 7, 11]

Using Numpy

Numpy has a special method unique(). Using this method we can remove duplicates from a list.

Output:
[1, 2, 3, 4, 5, 7, 11, 21]

Using Pandas

Pandas also has a unique() method, that can help us to remove duplicates.

Output:
[5, 1, 3, 2, 21, 4, 7, 11]

In today's article, I discussed different ways to remove duplicates from a Python list. If you observe closely, you will see in some methods the order of the original array is preserved and in some, the original order is changed. Take set for example. If you use a set to remove duplicates from a list, the order of the original list will be changed. So you need to decide which method suits your need.

And that’s it for today. I hope you find it helpful. Thanks for reading.

More content atplainenglish.io

I have a df that looks like the following:

event_name   |user_id|time_event             |time_install
ProfileScreen|1111   |2021-05-01 11:31:00.679|2021-05-01 11:31:00.679
ProfileScreen|1111   |2021-05-01 11:35:22.273|2021-05-01 11:31:00.679 <--- Delete
WalletScreen |1111   |2021-05-01 11:37:00.329|2021-05-01 11:31:00.679
ProfileScreen|1111   |2021-05-01 11:38:24.456|2021-05-01 11:31:00.679
HomeScreen   |1111   |2021-05-01 11:38:00.679|2021-05-01 11:38:00.679
ProfileScreen|1111   |2021-05-01 11:39:22.273|2021-05-01 11:38:00.679
WalletScreen |1111   |2021-05-01 11:40:00.329|2021-05-01 11:38:00.679
WalletScreen |1111   |2021-05-01 11:41:24.456|2021-05-01 11:38:00.679 <--- Delete
ProfileScreen|2222   |2021-05-03 11:31:00.679|2021-05-03 11:31:00.679
WalletScreen |2222   |2021-05-03 11:35:22.273|2021-05-03 11:31:00.679
HomeScreen   |2222   |2021-05-03 11:37:00.329|2021-05-03 11:31:00.679
ProfileScreen|2222   |2021-05-03 11:37:30.456|2021-05-03 11:31:00.679
ProfileScreen|2222   |2021-05-03 11:38:00.679|2021-05-03 11:38:00.679
ProfileScreen|2222   |2021-05-03 11:39:22.273|2021-05-03 11:38:00.679 <--- Delete
ProfileScreen|2222   |2021-05-03 11:39:42.543|2021-05-03 11:38:00.679 <--- Delete
WalletScreen |2222   |2021-05-03 11:40:00.329|2021-05-03 11:38:00.679
ProfileScreen|2222   |2021-05-03 11:41:24.456|2021-05-03 11:38:00.679

Sorted by time event ascending, I'd like to delete any back-to-back (w/r/t time_event) repeat occurrences where the screen, user_id, and time_install are the same.

asked May 9, 2021 at 3:45

2

To keep the earliest time_event, you can first sort the df by time_event and then use 'keep=first' in drop_duplicates().

To sort, you can use .sort_values(...)

And to drop and keep the earliest, you can use

.drop_duplicates(subset =['event_name', 'user_id', time_install'], inplace=True, keep='first')

answered May 9, 2021 at 4:04

Shubham PeriwalShubham Periwal

2,1082 gold badges7 silver badges23 bronze badges

programming python