C11 async is using only one core

C++11

async

concurrency

multithreading

programming

C11 async is using only one core

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When std::async appears to use only one core, the root cause is usually launch policy, task granularity, or measurement method. C plus plus allows deferred execution unless you request asynchronous launch explicitly. To get parallel CPU usage, you need both correct policy and enough independent work.

Launch Policy Matters

By default, std::async may choose deferred execution, where work runs only when future.get is called. That can make code effectively serial. Use std::launch::async to require background thread execution.

cpp

1#include <future>
2#include <iostream>
3#include <vector>
4
5long long work(int n) {
6    long long sum = 0;
7    for (int i = 0; i < n; ++i) {
8        sum += (i % 97);
9    }
10    return sum;
11}
12
13int main() {
14    auto f1 = std::async(std::launch::async, work, 200000000);
15    auto f2 = std::async(std::launch::async, work, 200000000);
16
17    long long total = f1.get() + f2.get();
18    std::cout << total << "
19";
20}

This guarantees asynchronous launch, but not unlimited scaling. Actual parallelism still depends on CPU cores, scheduler behavior, and task cost.

Task Size and Parallel Efficiency

Very small tasks can spend more time in scheduling overhead than useful computation. For CPU bound work, use fewer larger tasks sized to hardware concurrency.

cpp

1#include <future>
2#include <iostream>
3#include <thread>
4#include <vector>
5
6long long chunkWork(int start, int end) {
7    long long s = 0;
8    for (int i = start; i < end; ++i) {
9        s += (i % 101);
10    }
11    return s;
12}
13
14int main() {
15    const int maxN = 800000000;
16    unsigned workers = std::max(1u, std::thread::hardware_concurrency());
17    int chunk = maxN / static_cast<int>(workers);
18
19    std::vector<std::future<long long>> futures;
20    futures.reserve(workers);
21
22    for (unsigned w = 0; w < workers; ++w) {
23        int begin = static_cast<int>(w) * chunk;
24        int end = (w == workers - 1) ? maxN : begin + chunk;
25        futures.push_back(std::async(std::launch::async, chunkWork, begin, end));
26    }
27
28    long long total = 0;
29    for (auto& f : futures) total += f.get();
30    std::cout << total << "
31";
32}

This pattern usually scales better than launching many tiny futures.

Measurement and Environment Checks

If CPU usage still looks single core, confirm build flags include optimization, such as -O2 or -O3. Unoptimized builds can distort timing and scheduler behavior.

Also verify runtime environment. Container limits, CPU affinity, and power saving settings can restrict usable cores even with correct code.

Use a profiler or system monitor that shows per core utilization while workload runs long enough to measure. Very short tasks may finish before tools update usage graphs.

You can detect deferred futures explicitly by calling wait_for with zero timeout and checking status. If many futures report deferred, launch policy or implementation defaults are preventing parallel execution.

When to Use Thread Pools Instead

std::async is convenient but can be less predictable across standard library implementations. For high throughput services with many tasks, a dedicated thread pool often gives better control over queueing and worker count.

If your application repeatedly schedules thousands of jobs, benchmark a pool based design against raw std::async calls to reduce scheduling overhead and improve latency consistency.

Common Pitfalls

A common pitfall is relying on default launch policy and assuming guaranteed parallel execution. Always request std::launch::async when you need concurrency.

Another issue is calling get immediately after launching each future in a loop. That serializes work. Launch all futures first, then collect results.

Developers also benchmark debug builds and conclude concurrency is broken. Measure optimized builds in realistic environments.

Finally, oversubscribing CPU with too many worker tasks can reduce performance due to context switching. Match task count to core availability.

Summary

Use std::launch::async to force asynchronous execution.
Keep tasks large enough to amortize scheduling overhead.
Launch futures first, then join results for real parallelism.
Validate environment limits and profiling method.
Consider thread pools for heavy repetitive task scheduling in production.