BeautifulSoup
web scraping
Python
href
duplicate question

BeautifulSoup getting href

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When scraping HTML with BeautifulSoup, extracting links usually means finding anchor tags and reading their href attributes. The two key points are to select only tags that actually have href, and to decide whether you want raw attribute values, absolute URLs, or only certain kinds of links.

The Basic Pattern

The most direct BeautifulSoup pattern is:

python
1from bs4 import BeautifulSoup
2
3html = """
4<html>
5  <body>
6    <a href="/about">About</a>
7    <a href="https://example.com/contact">Contact</a>
8    <a>No link here</a>
9  </body>
10</html>
11"""
12
13soup = BeautifulSoup(html, "html.parser")
14
15for tag in soup.find_all("a", href=True):
16    print(tag["href"])

Using href=True filters out anchor tags that do not have the attribute.

tag.get("href") Versus tag["href"]

Both patterns are common, but they behave slightly differently.

python
href = tag.get("href")

This returns None if the attribute is missing.

python
href = tag["href"]

This raises KeyError if the attribute is missing.

That is why many scrapers prefer find_all("a", href=True) together with tag["href"], or use tag.get("href") when the HTML is unreliable.

Here is a simple reusable helper.

python
1from bs4 import BeautifulSoup
2
3def extract_hrefs(html: str) -> list[str]:
4    soup = BeautifulSoup(html, "html.parser")
5    return [tag.get("href") for tag in soup.find_all("a", href=True)]
6
7html = """
8<a href="/a">A</a>
9<a href="/b">B</a>
10"""
11
12print(extract_hrefs(html))

This keeps the scraping logic short and easy to test.

Converting Relative URLs to Absolute URLs

Many pages contain relative links. If you want usable full URLs, combine BeautifulSoup with urllib.parse.urljoin.

python
1from bs4 import BeautifulSoup
2from urllib.parse import urljoin
3
4base_url = "https://example.com"
5html = '<a href="/docs/start">Docs</a>'
6
7soup = BeautifulSoup(html, "html.parser")
8links = [urljoin(base_url, tag["href"]) for tag in soup.find_all("a", href=True)]
9print(links)

Without this step, relative links such as /docs/start are not enough on their own.

Real pages often contain mailto:, fragment links, JavaScript pseudo-links, and tracking URLs. Filtering them early makes the result cleaner.

python
1from bs4 import BeautifulSoup
2
3html = """
4<a href="#top">Top</a>
5<a href="mailto:[email protected]">Mail</a>
6<a href="/products">Products</a>
7"""
8
9soup = BeautifulSoup(html, "html.parser")
10links = []
11for tag in soup.find_all("a", href=True):
12    href = tag["href"]
13    if href.startswith("#") or href.startswith("mailto:"):
14        continue
15    links.append(href)
16
17print(links)

This is especially useful when you are building a crawler rather than just collecting every attribute value mechanically.

Handling Duplicates

Pages often repeat the same link several times. If you only want unique results, use a set or preserve order with a small helper.

python
1from bs4 import BeautifulSoup
2
3html = """
4<a href="/about">About</a>
5<a href="/about">About again</a>
6<a href="/contact">Contact</a>
7"""
8
9soup = BeautifulSoup(html, "html.parser")
10links = [tag["href"] for tag in soup.find_all("a", href=True)]
11unique_links = list(dict.fromkeys(links))
12print(unique_links)

This preserves the first-seen order while removing duplicates.

Requests and Parsing Together

In real scraping code, BeautifulSoup is often paired with requests.

python
1import requests
2from bs4 import BeautifulSoup
3
4response = requests.get("https://example.com", timeout=10)
5response.raise_for_status()
6
7soup = BeautifulSoup(response.text, "html.parser")
8links = [tag["href"] for tag in soup.find_all("a", href=True)]
9print(links[:5])

This is the standard workflow for many lightweight scrapers.

Common Pitfalls

The biggest pitfall is calling find_all("a") and then indexing tag["href"] without checking whether the attribute exists. Some anchor tags do not actually contain links.

Another issue is forgetting about relative URLs. A scraper may appear to work while quietly collecting unusable partial paths.

Developers also ignore duplicate links and later wonder why the result list is much larger than expected.

Finally, do not assume static HTML always contains the links you see in the browser. Some pages render links with JavaScript, which BeautifulSoup alone will not execute.

Summary

  • Use soup.find_all("a", href=True) to extract anchor tags that actually have href.
  • Choose tag.get("href") or tag["href"] based on how defensive you want to be.
  • Convert relative links with urljoin when you need full URLs.
  • Filter fragments, mail links, and duplicates when your scraper needs cleaner output.
  • BeautifulSoup parses HTML well, but it does not execute JavaScript-rendered pages by itself.

Course illustration
Course illustration

All Rights Reserved.