Finding out the duplicate element in an array

duplicate element

array

data structures

programming

algorithm

Finding out the duplicate element in an array

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Duplicate detection in arrays appears in interviews, data cleaning pipelines, fraud checks, and import validation jobs. There is no single best algorithm because constraints differ across use cases. The right approach depends on whether you need one duplicate or all duplicates, whether input can be modified, and how much memory you can spend.

Clarify the Problem Contract First

Before coding, define expected behavior:

Should output include each duplicate once or include repeated occurrences.
Can the original array be mutated.
Are values bounded, such as range 1 through n.
Is data streamed or fully in memory.

Those decisions determine viable algorithms and prevent incorrect optimization.

Set-Based Linear Scan

For most practical code, a set-based scan is simplest and fast.

python

1def find_duplicates(nums):
2    seen = set()
3    duplicates = set()
4
5    for value in nums:
6        if value in seen:
7            duplicates.add(value)
8        else:
9            seen.add(value)
10
11    return sorted(duplicates)
12
13print(find_duplicates([4, 1, 2, 1, 3, 2, 2, 5]))

Complexity:

Time: O(n) average.
Extra memory: O(n).

This is usually the best default when memory is acceptable.

Frequency Map When Counts Matter

If you need duplicate counts, use a dictionary or Counter.

python

1from collections import Counter
2
3
4def duplicate_counts(nums):
5    counts = Counter(nums)
6    return dict((k, v) for k, v in counts.items() if v > 1)
7
8print(duplicate_counts([1, 1, 2, 2, 2, 3, 4, 4]))

This gives both which values are duplicated and how many times each appears.

Sorting-Based Method for Lower Extra Memory

If mutating input is allowed and extra memory is limited, sort then compare neighbors.

python

1def find_duplicates_sorted(nums):
2    nums.sort()  # modifies input
3    out = []
4
5    for i in range(1, len(nums)):
6        if nums[i] == nums[i - 1]:
7            if not out or out[-1] != nums[i]:
8                out.append(nums[i])
9
10    return out
11
12arr = [7, 3, 1, 7, 2, 2, 9, 3]
13print(find_duplicates_sorted(arr))
14print(arr)

Complexity:

Time: O(n log n).
Extra memory: low, depending on sorting implementation.

Special Case: Exactly One Duplicate in Range

Some interview problems guarantee values from 1 through n with one duplicated value. In that specific case, Floyd cycle detection gives constant extra space.

python

1def find_one_duplicate_floyd(nums):
2    slow = nums[0]
3    fast = nums[0]
4
5    while True:
6        slow = nums[slow]
7        fast = nums[nums[fast]]
8        if slow == fast:
9            break
10
11    slow = nums[0]
12    while slow != fast:
13        slow = nums[slow]
14        fast = nums[fast]
15
16    return slow
17
18print(find_one_duplicate_floyd([1, 3, 4, 2, 2]))

Do not use this method for arbitrary arrays. It depends on strict constraints.

Large-Scale and Streaming Data

For very large datasets, full in-memory sets may be too expensive.

Practical options:

Chunk processing with on-disk key-value storage.
Database grouping with indexed columns.
Probabilistic prefiltering with Bloom filters.

Bloom filters help detect likely duplicates early, but they can produce false positives, so you still need exact confirmation logic.

Choosing an Approach in Real Projects

A simple decision guide:

Need exact duplicates quickly and memory is acceptable: set-based scan.
Need counts for reporting: frequency map.
Need lower auxiliary memory and can mutate: sort then scan.
Strict one-duplicate constrained problem: Floyd method.
Data too large for memory: chunking or external storage.

Pick the approach that matches your data contract, not the one that looks most clever.

Common Pitfalls

Using a specialized algorithm without verifying input constraints. Fix: validate assumptions before method selection.
Returning the same duplicate multiple times. Fix: track duplicates in a dedicated set or check last appended value.
Mutating arrays unexpectedly with in-place sort. Fix: document mutation or sort a copied array.
Ignoring non-integer or null values in mixed datasets. Fix: add input validation and explicit type handling.
Optimizing for asymptotic complexity while ignoring memory limits. Fix: evaluate both time and space for production-size inputs.

Summary

Duplicate detection has multiple correct solutions based on constraints.
Set-based scans are the best default for exact linear detection.
Frequency maps are ideal when counts are required.
Sorting trades extra memory for higher time complexity.
Specialized constant-space methods only work under strict problem contracts.