Features

May 12, 2023

How to extract text from webpages

Get text in markdown format from a page at the same moment as taking a screenshot

Jonathan Markwell
Jonathan Markwell
3 min read
Share this article:

If you’ve ever tried extracting useful text from webpages you know it can be a nightmare.

Removing all the code that makes text difficult to read. Retaining just enough markup and spacing that keeps the text readable. Ignoring the supporting parts of the page, such as navigation, if and when it gets in the way.

All of this is even more important when working with LLMs such as OpenAI’s GPT. You need to keep within token limits while controlling the cost and performance.

You know Urlbox is and always will be primarily focused on providing website screenshots you can depend on:

  • The most accurate renders, resilient to the worst crimes against HTML & CSS.
  • The most powerful features, reliably generating the images you want at scale.
  • The fastest possible response times, without compromising your infrastructure’s security or your customers privacy.

It turns out all those things make a huge difference when extracting text.

Urlbox customers have been using our HTML output feature to grab content with their screenshots for years.

Last year we added metadata extraction to remove a processing step. Soon after we added custom metadata to ease data extraction with JavaScript. Today you can try out our latest feature.

Introducing markdown output.

Just as with our HTML output feature, there are two ways to get markdown output from Urlbox:

  1. Set the format to md
https://api.urlbox.io/v1/[API_KEY]/md?url=example.com
  1. Set the save_markdown option
curl -X POST \
https://api.urlbox.io/v1/render/sync \
-H 'Authorization: Bearer YOUR_URLBOX_API_SECRET' \
-H 'Content-Type: application/json' \
-d '{"url":"https://example.com", "save_markdown": true}'

You'll then get a response like this:

{
"renderUrl":"http://storage.googleapis.com/...fad77fa39.png",
"markdownUrl":"http://storage.googleapis.com/...fad77fa39.md",
"size":30499
}

You can save markdown along with a screenshot/pdf, html and metadata all in one request and straight into your own S3 bucket if you prefer. It works great with our webhooks feature too.

We've also created a free tool at url2text.com where you can try it out in the browser. It's perfect for copying & pasting page content into ChatGPT.

a screenshot of URL2Text.com

We'll soon be adding some additional options to help reduce the number of characters/tokens in the returned markdown. Please let us know what you'd like to see first.

We can't wait to hear what you build with it.

This is the first of the features we're working on to support your work integrating AI into your software. We already have a ChatGPT Plugin available. And there's more to come.

Index

All Features

Free Trial

Designers, law firms and infrastructure engineers trust Urlbox to accurately and securely convert HTML to images at scale. Experience it for yourself.