duplicate detection
array algorithm
programming
coding
data structures

Algorithm to find duplicate in an array

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Finding duplicates in an array is a common interview problem and a common production problem, but the right algorithm depends on what “find duplicate” actually means. You might only need to know whether any duplicate exists, you might need the repeated values themselves, or you might be under special constraints such as constant extra space. The best solution changes with those requirements.

In ordinary application code, a hash set is usually the cleanest answer. Under stricter interview-style constraints, sorting or Floyd's cycle-detection trick can also be appropriate.

Use a Set for the Clearest General Solution

If you want to detect duplicates in an arbitrary array, a set gives you a fast and easy approach. Walk through the array, remember what you have seen, and report anything that appears twice.

python
1def find_duplicates(values):
2    seen = set()
3    duplicates = set()
4
5    for value in values:
6        if value in seen:
7            duplicates.add(value)
8        else:
9            seen.add(value)
10
11    return sorted(duplicates)
12
13print(find_duplicates([4, 2, 7, 2, 9, 4, 4]))

This solution is usually O(n) time with O(n) extra space. That is a very good tradeoff for most real programs because the code is short, readable, and correct.

If you only need a yes-or-no answer, you can return early instead of collecting all duplicates.

python
1def has_duplicate(values):
2    seen = set()
3    for value in values:
4        if value in seen:
5            return True
6        seen.add(value)
7    return False

Sort First When Extra Memory Must Stay Small

If you want to reduce extra memory, sorting the array lets you detect duplicates by comparing neighbors.

python
1def find_duplicates_sorted(values):
2    items = sorted(values)
3    duplicates = []
4
5    for i in range(1, len(items)):
6        if items[i] == items[i - 1] and (not duplicates or duplicates[-1] != items[i]):
7            duplicates.append(items[i])
8
9    return duplicates
10
11print(find_duplicates_sorted([4, 2, 7, 2, 9, 4, 4]))

This usually runs in O(n log n) time because sorting dominates the cost. The advantage is that you do not need a hash-based memory structure for every value.

Be careful with in-place sorting. If the original order matters, sort a copy instead of mutating the caller's array.

Use Floyd's Algorithm Only Under Specific Constraints

Some versions of the problem say the array has length n + 1, every value is in the range 1..n, and there is guaranteed to be at least one duplicate. Under those exact constraints, you can model the array as a linked structure and use Floyd's cycle detection.

python
1def find_duplicate_number(nums):
2    slow = nums[0]
3    fast = nums[0]
4
5    while True:
6        slow = nums[slow]
7        fast = nums[nums[fast]]
8        if slow == fast:
9            break
10
11    slow = nums[0]
12    while slow != fast:
13        slow = nums[slow]
14        fast = nums[fast]
15
16    return slow
17
18print(find_duplicate_number([1, 3, 4, 2, 2]))

This is clever and uses O(1) extra space, but it only works because the input follows strict rules. It is not a general duplicate-finding method for arbitrary arrays.

Choose the Algorithm by the Actual Question

Before you code, answer these questions:

  • do you need one duplicate or all duplicates
  • may the input be modified
  • is extra memory allowed
  • are the values constrained to a special range

Most confusion disappears once those requirements are explicit. Developers often over-engineer the problem because they reach for a clever algorithm before understanding the input contract.

Common Pitfalls

The most common mistake is using Floyd's algorithm on a general array where the required value-range guarantees do not hold. In that case, the reasoning breaks.

Another issue is sorting the input in place when other code still depends on the original order.

A third problem is collecting duplicates in a list without deduplicating the duplicates themselves. An item that repeats three times should usually still be reported once.

Summary

  • A hash set is the clearest general-purpose solution for duplicate detection.
  • Sorting is a good alternative when you want lower auxiliary memory and can accept O(n log n) time.
  • Floyd's cycle detection is powerful but only for special constrained inputs.
  • Decide whether you need existence, one duplicate, or all duplicates before choosing the algorithm.
  • Be explicit about whether the input array may be modified.

Course illustration
Course illustration

All Rights Reserved.