How do I extract all URLs from a text document?

Paste your text, HTML, or document content into the tool. It automatically scans the content, identifies all URLs (including http://, https://, and relative URLs), and lists them in a clean, organized format for easy copying or analysis.

What types of URLs does the extractor recognize?

The tool recognizes standard URLs (http://, https://), relative URLs (/path/to/page), email links (mailto:), FTP links (ftp://), and various URL formats. It handles URLs embedded in HTML, plain text, markdown, and code.

Can I extract URLs from HTML source code?

Yes, the tool can extract URLs from HTML source code, including links in anchor tags ( ), image sources ( ), script sources, stylesheet links, and other HTML attributes containing URLs.

How do I filter or validate extracted URLs?

The tool can filter URLs by domain, protocol, or validate them for proper format. You can also export the list and use additional tools to check URL validity, accessibility, or categorize them by domain.

Can I extract URLs from multiple files at once?

Yes, you can paste content from multiple sources or upload files. The tool processes all content and extracts URLs from all sources, providing a consolidated list of unique URLs found across all inputs.

How do I extract all URLs from a block of text?

Paste your text into the input area or upload a supported text file, then click Extract URLs so the tool runs the content through extractUrls, which applies a robust https?:// style regex, trims matches, and limits processing to a safe number of characters and URLs. The results list shows each found link, highlights duplicates, and lets you click through or copy them without changing your original input.

Can I clean tracking parameters from URLs when extracting them?

Yes, when Smart Clean & Validate is enabled the extractor parses each URL with the URL constructor, removes a curated set of tracking query parameters like utm_source, fbclid, and gclid, and rebuilds a cleaned version. The UI then displays how many trackers were stripped per link and annotates the stats card with a total Trackers Removed count so you can quickly see how much noise was eliminated.

How can I identify duplicate or invalid URLs in my text?

For every match the logic normalizes it by lowercasing and stripping a trailing slash, then uses a Set to tag any repeated normalized value as a duplicate while still keeping the original raw form. It also attempts to construct a URL object to validate structure and derive the domain, marking failures as invalid so you can see which entries may need manual correction.

Can I categorize extracted URLs by type automatically?

After extraction you can run Smart Categorize, which sends the unique cleaned URLs into categorizeUrlsWithAI and merges the returned category labels back onto the corresponding result rows. Those categories are then rendered as small badges next to each URL, but the underlying cleaned string and duplicate detection are left unchanged so you can still trust the structural output.

How do I export the list of extracted URLs?

You can copy the unique, non duplicate URLs to the clipboard in plain text using the Copy button or download them as a .txt file via the Download menu, both of which use formatExport under the hood. That helper can also generate CSV and JSON strings containing the URL, domain, category, and validity flags, so you can plug the same data into your own workflows or scripts.

Extract URLs

Tool Overview

This free online tool lets you extract URLs from text: it scans text, documents, or code and extracts all HTTP and HTTPS URLs it finds. You paste or load content, run extraction, and get a list of links together with domains, validity flags, and optional categories. Use it to find URLs in text online, perform smart cleaning such as removing common tracking parameters, and export unique URLs as plain text, CSV, or JSON for further use.

The problem it solves is collecting and cleaning web links buried inside unstructured content. When you receive logs, HTML snippets, markdown, reports, or exported data, URLs may appear in many places and formats. Manually copying them out is slow and error-prone, and you may accidentally keep duplicates or copy links with tracking query parameters still attached. This tool lets you extract URLs from text online for free and automates discovery, normalization, and deduplication of links so you can focus on analysis.

The extract URLs tool is aimed at SEO specialists, analysts, developers, security engineers, and anyone who needs to extract links from text or audit URLs. A beginner can paste text and click "Extract URLs" for a quick list. Technical users can enable smart cleaning to remove tracker parameters, copy or download results, and use the optional AI categorization to group URLs by type.

Background & Concept Explanation

A URL (Uniform Resource Locator) is a text string that identifies the location of a resource on the web, usually starting with protocols like "http://" or "https://". In real-world content, URLs appear in plain text, HTML attributes, logs, comments, and code. They often include query parameters for tracking campaigns, user sessions, or analytics, which are not always desirable when you want a clean list of destinations.

URL extraction means scanning a block of text, recognizing substrings that match a URL pattern, and returning those substrings in a structured form. This tool helps you extract urls from plain text or documents in one pass. Smart extraction adds extra steps, such as validating each URL, extracting the domain, removing known tracking parameters, and identifying duplicates. This is especially helpful for link audits, security reviews, or analytics pipelines where you want one clean instance of each destination. A related operation involves extracting email addresses as part of a similar workflow.

People struggle with extraction when using manual search or naive copy-paste, because they may miss links, include trailing punctuation, or repeatedly capture the same URL in slightly different forms. A regex optimized for HTTP and HTTPS URLs, combined with careful normalization and deduplication, gives more reliable results. This tool builds on that approach and adds a basic statistics panel plus AI-based categorization for deeper inspection.

Key Features

Text input and file upload. The main panel allows you to paste or type text directly into a large text area, or to upload text-like files (such as .txt, .md, .html, .htm, .log, .json, or .csv). The tool reads file contents as text and loads them into the input field within defined size limits.
Length and size safeguards. A maximum input length of 50,000 characters is enforced to keep the browser responsive. When loading from a file, the tool checks that the file size is at most 5 MB. If the file content is longer than the internal length limit, only the first part is loaded and a warning is displayed.
Core URL extraction. Internally, the tool uses a regular expression targeted at HTTP and HTTPS URLs (`http://` or `https://`) and grabs contiguous characters that are not whitespace or obvious delimiters. It scans the (possibly truncated) text and collects all matches, then applies additional limits to avoid processing more than a fixed number of URLs per run.
Smart cleaning toggle. A "Smart Clean & Validate" option controls whether extracted URLs are further processed. When enabled, the tool attempts to parse each URL using the browser’s URL parser, extracts the domain, and removes a predefined list of common tracking query parameters (such as `utm_source`, `utm_medium`, `fbclid`, and similar). Cleaned URLs are then rebuilt without those parameters.
Validity detection. For each URL, the tool tries to create a URL object. If this succeeds, it marks the URL as valid and records its hostname as the domain. If parsing fails, the URL is marked as invalid but is still shown. The tool then attempts to extract a hostname from the string using a simpler pattern, when possible, so that you still see a domain hint.
Duplicate detection and normalization. The tool normalizes each cleaned URL by lowercasing it and removing a trailing slash. It uses this normalized form to detect duplicates and mark later occurrences as duplicates. Only the first appearance of a given normalized URL is considered unique; duplicates are still listed but visually marked and excluded from exports.
Tracking parameter counts. When smart cleaning is enabled and query parameters are removed, the tool counts how many tracking parameters were deleted from each URL. This count is displayed as a small badge (for example, "-2 Trackers") next to that URL in the results list.
Statistics panel. A stats card summarizes how many unique URLs were found, how many total matches were detected (including duplicates), and how many tracker parameters were removed overall. This helps you gauge the size of the dataset and the impact of cleaning.
Per-URL details and copy. Each extracted item is rendered with its domain, flags for duplicate status and category (if present), tracker removal count, and the cleaned URL itself. Valid URLs appear as clickable links that open in a new tab. A small copy icon next to non-duplicate URLs lets you quickly copy a single cleaned URL to the clipboard.
Bulk copy and download. A main toolbar includes buttons to copy all unique cleaned URLs as plain text or download them as a .txt file. These operations ignore duplicates and use the export formatting logic to produce a simple line-per-URL file.
Multiple export formats. An internal formatter supports plain text, CSV, and JSON export formats. Text export lists each unique URL on a separate line. CSV export contains `URL`, `Domain`, and `Category` columns. JSON export returns an array of objects with fields for `url`, `domain`, `category`, and `isValid`.
AI-powered categorization. Once URLs are extracted, you can invoke an AI-based categorizer. It receives the list of unique cleaned URLs and returns category tags for some or all of them. The tool merges these categories back into the results so that each URL may show a category label, such as "Social" or "Tracking", depending on backend behavior.
Error feedback. If any step fails—such as file loading, extraction, copy, download, or AI categorization—the tool shows a descriptive error banner at the top. Errors clear automatically after a short time or when dismissed.

Common Use Cases

Link audits from web pages. You can copy HTML source or rendered text from a web page and paste it into the tool. After running extraction with smart cleaning, you get a list of clean URLs without tracking parameters, along with domains and counts. This helps check which domains a page links out to and how many unique destinations are present.

Processing logs or email content. System logs and emails often contain embedded links. Instead of searching manually, you paste the entire log section or message into the tool. The extractor finds all HTTP/HTTPS URLs, marks invalid ones, and gives you a deduplicated list you can inspect or test further.

Building URL lists for scraping or checks. Before running automated checks, crawlers, or uptime tests, you may want a clean list of URLs from documentation or CSV exports. The extract URLs tool lets you combine content from multiple sources into the input area, extract and clean the links, and export them as a plain text or JSON file ready for your scripts. For adjacent tasks, generating URL slugs addresses a complementary step.

Removing tracking clutter. If you receive URLs with long tracking query strings, enabling smart cleaning lets you see what the canonical destination is after removing tracking parameters. The tracker removal counts show how many such parameters were present, which can be useful when reporting on link hygiene.

How to Use This Tool (Step-by-Step)

Open the extract URLs tool in your browser. The main layout shows a left column for options and stats, and a larger right panel for input and results.
Paste or type your text into the main input area in the right panel. This can be plain text, HTML, logs, or any text where URLs appear. You can also click the upload icon to select a text-like file; the file’s contents will be loaded into the input field.
Optional: open the "Options" card in the left column and toggle "Smart Clean & Validate" if you want the tool to attempt URL parsing, domain extraction, and tracking parameter removal. Leave it on for most cases where you want clean URLs.
Click the "Extract URLs" button in the toolbar. If the input text is empty, the tool will show a message asking you to enter text. While extraction is in progress, a loading indicator appears on the button.
When extraction completes, review the statistics card in the left column. It shows how many unique URLs were found, the total number of matches (including duplicates), and the total number of tracker parameters that were removed when smart cleaning is on.
Scroll through the results list in the right panel. Each entry shows the domain, the cleaned URL, and badges indicating duplicates, categories (if present), and how many trackers were removed. Invalid URLs are shown but styled differently so you can recognize them.
If you want to copy all unique URLs, click the copy button in the toolbar. The tool copies a plain-text list of cleaned URLs, one per line, to your clipboard. Paste this list into your spreadsheet, script, or editor.
To download the results as a text file, click the download icon and confirm the format (plain .txt by default). A file containing all unique cleaned URLs is saved to your device.
For categorization, switch to the Smart Categorize card on the left and press the "Categorize URLs" button. Wait while the backend analyses the list. When it finishes, category labels appear next to matching URLs in the results list.
Use the clear button when you want to start over. It clears both the input text and the extracted list, as well as any current errors, and returns focus to the text area so you can paste new content.

Calculations & Logic

The extraction function begins by checking for a non-empty string and truncating the text if it exceeds a hard limit to protect performance. It then applies a regular expression designed to match HTTP and HTTPS URLs, capturing sequences that start with "http://" or "https://" and continue until the next whitespace or obvious delimiter. All matches are collected in an array.

To avoid overloading the tool on input with thousands of URLs, the function limits the number of matches processed to a maximum count. For each matched string, it trims whitespace and initializes counters for removed parameters, a validity flag, and a domain string.

When smart cleaning is enabled, the function attempts to construct a URL object from the string. If successful, it reads the hostname into the domain field and iterates over a predefined list of tracking query parameter names. For each parameter found, it deletes it from the search parameters and increments the counter. After processing, it rebuilds the URL from the modified URL object, which removes those parameters from the cleaned version. When working with related formats, encoding URL components can be a useful part of the process.

If URL construction fails due to malformed input, the function marks the URL as invalid but still keeps it. It then applies a simpler pattern to try and extract a hostname from the string, so you have at least a best-effort domain value for categorization.

For deduplication, the function normalizes each cleaned URL by converting it to lowercase and removing any trailing slash. It uses a set of normalized strings to track which URLs have already been seen. The first time a normalized URL is encountered, it is considered unique and added to the set; subsequent instances are marked as duplicates.

Each URL is then stored in an `ExtractedURL` object with fields such as `id` (a unique identifier), `raw` (the original matched string), `cleaned` (the normalized or cleaned URL), `domain`, `isValid`, `isDuplicate`, and `paramsRemoved` when applicable. The result of extraction is an array of these objects, which the UI then renders.

For export, the `formatExport` function takes the full list of extracted URLs and filters out duplicates, leaving only unique items. In plain text mode, it joins the cleaned URLs with newline characters. In CSV mode, it builds a header row and then one row per URL, escaping double quotes in fields. In JSON mode, it constructs an array of small objects containing the URL, domain, category (or null), and validity flag, and serializes this to a formatted JSON string. In some workflows, url encoder operations is a relevant follow-up operation.

The AI categorization logic selects only unique, non-duplicate cleaned URLs and sends them to a backend service. The backend returns an array of objects including `url` and `category`. For each extracted URL, the tool searches this AI result set for a match on the cleaned URL and, if found, writes the category into the URL’s data. Subsequent rendering shows this category as a label next to the URL.

Reference Tables or Scales

Limit or setting	Value
Maximum input length	50,000 characters
Maximum file size	5 MB per file
Maximum text processed internally	100,000 characters for extraction
Maximum URLs processed per run	1,000 matches

These limits keep the extraction responsive, even on larger documents or logs.

Tips, Limitations & Best Practices

For the most reliable results, feed the tool text that already separates code or markup from binary content. While it can handle HTML, markdown, and logs, it is not intended for binary formats without conversion.

Keep smart cleaning enabled when you want canonical URLs without campaign tracking parameters. If you need the full original query strings for debugging, turn smart cleaning off so that parameters are preserved. For related processing needs, wrapping text at widths handles a complementary task.

Use the statistics and tracker removal counts to monitor link hygiene. A high number of removed tracker parameters may point to content that over-uses tracking, which you might want to simplify.

Remember that only HTTP and HTTPS URLs are matched. Email-style addresses, relative paths, or other kinds of references are not extracted by the core regex in this tool. For those, consider using specialized tools.

AI categorization depends on a backend service and may sometimes fail or provide incomplete labels. Use categories as hints, not definitive classifications, and always review important URLs manually when accuracy is critical.

Options

Extract URLs

Options

Frequently asked questions

How do I extract all URLs from a text document?

What types of URLs does the extractor recognize?

Can I extract URLs from HTML source code?

How do I filter or validate extracted URLs?

Can I extract URLs from multiple files at once?

How do I extract all URLs from a block of text?

Can I clean tracking parameters from URLs when extracting them?

How can I identify duplicate or invalid URLs in my text?

Can I categorize extracted URLs by type automatically?

How do I export the list of extracted URLs?

Built a useful tool?

Content verification and research backing

Creators

References

About Extract URLs

Extract URLs

Tool Overview

Background & Concept Explanation

Key Features

Common Use Cases

How to Use This Tool (Step-by-Step)

Calculations & Logic

Reference Tables or Scales

Tips, Limitations & Best Practices

Related reads

Options

Extract URLs

Options

Frequently asked questions

How do I extract all URLs from a text document?

What types of URLs does the extractor recognize?

Can I extract URLs from HTML source code?

How do I filter or validate extracted URLs?

Can I extract URLs from multiple files at once?

How do I extract all URLs from a block of text?

Can I clean tracking parameters from URLs when extracting them?

How can I identify duplicate or invalid URLs in my text?

Can I categorize extracted URLs by type automatically?

How do I export the list of extracted URLs?

Related tools

Built a useful tool?

About Extract URLs

Tool Overview

Background & Concept Explanation

Key Features

Common Use Cases

How to Use This Tool (Step-by-Step)

Calculations & Logic

Reference Tables or Scales

Tips, Limitations & Best Practices

Related reads