BeautifulSoup getting href

BeautifulSoup

web scraping

Python

href

duplicate question

BeautifulSoup getting href

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When scraping HTML with BeautifulSoup, extracting links usually means finding anchor tags and reading their href attributes. The two key points are to select only tags that actually have href, and to decide whether you want raw attribute values, absolute URLs, or only certain kinds of links.

The Basic Pattern

The most direct BeautifulSoup pattern is:

python

1from bs4 import BeautifulSoup
2
3html = """
4<html>
5  <body>
6    <a href="/about">About</a>
7    <a href="https://example.com/contact">Contact</a>
8    <a>No link here</a>
9  </body>
10</html>
11"""
12
13soup = BeautifulSoup(html, "html.parser")
14
15for tag in soup.find_all("a", href=True):
16    print(tag["href"])

Using href=True filters out anchor tags that do not have the attribute.

`tag.get("href")` Versus `tag["href"]`

Both patterns are common, but they behave slightly differently.

python

href = tag.get("href")

This returns None if the attribute is missing.

python

href = tag["href"]

This raises KeyError if the attribute is missing.

That is why many scrapers prefer find_all("a", href=True) together with tag["href"], or use tag.get("href") when the HTML is unreliable.

Collecting All Links Into a List

Here is a simple reusable helper.

python

1from bs4 import BeautifulSoup
2
3def extract_hrefs(html: str) -> list[str]:
4    soup = BeautifulSoup(html, "html.parser")
5    return [tag.get("href") for tag in soup.find_all("a", href=True)]
6
7html = """
8<a href="/a">A</a>
9<a href="/b">B</a>
10"""
11
12print(extract_hrefs(html))

This keeps the scraping logic short and easy to test.

Converting Relative URLs to Absolute URLs

Many pages contain relative links. If you want usable full URLs, combine BeautifulSoup with urllib.parse.urljoin.

python

1from bs4 import BeautifulSoup
2from urllib.parse import urljoin
3
4base_url = "https://example.com"
5html = '<a href="/docs/start">Docs</a>'
6
7soup = BeautifulSoup(html, "html.parser")
8links = [urljoin(base_url, tag["href"]) for tag in soup.find_all("a", href=True)]
9print(links)

Without this step, relative links such as /docs/start are not enough on their own.

Filtering Unwanted Links

Real pages often contain mailto:, fragment links, JavaScript pseudo-links, and tracking URLs. Filtering them early makes the result cleaner.

python

1from bs4 import BeautifulSoup
2
3html = """
4<a href="#top">Top</a>
5<a href="mailto:[email protected]">Mail</a>
6<a href="/products">Products</a>
7"""
8
9soup = BeautifulSoup(html, "html.parser")
10links = []
11for tag in soup.find_all("a", href=True):
12    href = tag["href"]
13    if href.startswith("#") or href.startswith("mailto:"):
14        continue
15    links.append(href)
16
17print(links)

This is especially useful when you are building a crawler rather than just collecting every attribute value mechanically.

Handling Duplicates

Pages often repeat the same link several times. If you only want unique results, use a set or preserve order with a small helper.

python

1from bs4 import BeautifulSoup
2
3html = """
4<a href="/about">About</a>
5<a href="/about">About again</a>
6<a href="/contact">Contact</a>
7"""
8
9soup = BeautifulSoup(html, "html.parser")
10links = [tag["href"] for tag in soup.find_all("a", href=True)]
11unique_links = list(dict.fromkeys(links))
12print(unique_links)

This preserves the first-seen order while removing duplicates.

Requests and Parsing Together

In real scraping code, BeautifulSoup is often paired with requests.

python

1import requests
2from bs4 import BeautifulSoup
3
4response = requests.get("https://example.com", timeout=10)
5response.raise_for_status()
6
7soup = BeautifulSoup(response.text, "html.parser")
8links = [tag["href"] for tag in soup.find_all("a", href=True)]
9print(links[:5])

This is the standard workflow for many lightweight scrapers.

Common Pitfalls

The biggest pitfall is calling find_all("a") and then indexing tag["href"] without checking whether the attribute exists. Some anchor tags do not actually contain links.

Another issue is forgetting about relative URLs. A scraper may appear to work while quietly collecting unusable partial paths.

Developers also ignore duplicate links and later wonder why the result list is much larger than expected.

Finally, do not assume static HTML always contains the links you see in the browser. Some pages render links with JavaScript, which BeautifulSoup alone will not execute.

Summary

Use soup.find_all("a", href=True) to extract anchor tags that actually have href.
Choose tag.get("href") or tag["href"] based on how defensive you want to be.
Convert relative links with urljoin when you need full URLs.
Filter fragments, mail links, and duplicates when your scraper needs cleaner output.
BeautifulSoup parses HTML well, but it does not execute JavaScript-rendered pages by itself.

BeautifulSoup getting href

Master System Design with Codemia

Introduction

The Basic Pattern

tag.get("href") Versus tag["href"]

Collecting All Links Into a List

Converting Relative URLs to Absolute URLs

Filtering Unwanted Links

Handling Duplicates

Requests and Parsing Together

Common Pitfalls

Summary

`tag.get("href")` Versus `tag["href"]`