Building Browser-Using AI Agents in Python
Introduction
Most AI agent tutorials start with an API. They show you how to call OpenWeather, hit the Stripe endpoint, pull data from GitHub. That is a fine starting point until you try to build something real and realize that the task you actually need done does not have an API.
Think about what humans do with browsers every day: filing government forms, reading competitor pricing, extracting research from sites that guard their data behind JavaScript rendering, logging into portals that have never heard of OAuth. There are roughly 1.1 billion websites on the internet. A vanishingly small fraction of them have public APIs. The rest only speak browser.
An agent that is limited to API calls handles maybe 5% of the tasks a human worker does daily. Give that agent a browser, and the coverage approaches everything. That is the gap this article closes.
The global AI agents market stands at $10.91 billion in 2026 and is projected to reach $50.31 billion by 2030, with browser-capable agents at the center of that growth. 27.7% of enterprises are already running agentic browsers in production, up from virtually none two years prior. The tooling has matured fast, and the patterns are settled enough to teach properly.
By the end of this article, you will have a working browser agent that navigates real websites, fills forms, extracts structured data, and connects to an LLM that decides what to do next, all in Python.
Why Playwright, Not Selenium
If you built browser automation five years ago, you built it with Selenium. Selenium is still widely deployed, still works, and is not going anywhere. But for any new project in 2026, Playwright is the default. The reasons are practical, not theoretical.
Selenium communicates with the browser by sending individual HTTP requests to a WebDriver. Every action - click, type, scroll - is a separate request. Playwright uses a persistent WebSocket connection for the entire session. Commands flow through that channel with no per-action round-trip cost. Independent benchmarks consistently show Playwright running 30-50% faster than Selenium at the test-suite level and averaging ~290ms per action versus Selenium's ~536ms. For a browser agent that might execute hundreds of actions, that gap compounds.
Playwright also bundles its own browser binaries. When you install it, you get pre-configured versions of Chromium, Firefox, and WebKit that are guaranteed to work with your Playwright version. No driver version mismatches, no broken CI pipelines because someone updated Chrome. It has built-in auto-waiting before it clicks an element; it verifies the element is visible, enabled, and not animating. You do not have to write time.sleep(2) and hope for the best.
For AI agents specifically, Playwright fires real mouse and keyboard events that mirror how humans interact with browsers. Sites designed to detect automation look for synthetic DOM clicks. Playwright's interaction model is harder to distinguish from genuine human input.
There is also the browser-use library, which sits one level higher. Browser-use is a Python library that gives an LLM a working browser. Under the hood, it uses Playwright to drive the browser, but the LLM reads the page state and decides what to click, type, and extract - no CSS selectors required. You give it a task in plain English, and it figures out the rest.
We will cover both raw Playwright and browser-use in this article, because they serve different needs: Playwright when you want precise, predictable control; browser-use when you want the agent to handle navigation decisions autonomously.
Setting Up the Environment
You need Python 3.10 or higher, an OpenAI API key, and about five minutes.
Step 1: Create a virtual environment
python -m venv browser_agent_env
# macOS / Linux
source browser_agent_env/bin/activate
# Windows
browser_agent_env\Scripts\activate
Step 2: Install dependencies
pip install playwright \
browser-use \
langchain \
langchain-openai \
langgraph \
langchain-community \
python-dotenv
Step 3: Install the browser binaries
This is the step most people miss. Playwright needs to download Chromium, Firefox, and WebKit separately from the Python package. Run this once after installing:
playwright install chromium
If you want all three browser engines: playwright install. Chromium alone is sufficient for most agent work and is smaller to download.
Step 4: Store your API key
Create a .env file in your project directory:
OPENAI_API_KEY=your_openai_api_key_here
Add .env to your .gitignore immediately. Do not commit API keys.
Step 5: Verify everything works
Here is a first script that navigates to a URL, reads the heading, and saves a screenshot. Use example.com, a publicly available test domain maintained by IANA that will not block you.
Save as first_run.py and run python first_run.py:
# first_run.py
# Navigate to a URL, take a screenshot, and extract the page title.
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python first_run.py
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
# Launch Chromium in headless mode (no visible browser window).
# Set headless=False if you want to watch it run during development.
browser = await p.chromium.launch(headless=True)
# A browser context is like a fresh browser profile.
# It isolates cookies, storage, and cache from other contexts.
context = await browser.new_context(
viewport={"width": 1280, "height": 720},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
)
page = await context.new_page()
# Navigate to the URL and wait until the network is idle.
# "networkidle" means no open network connections for 500ms.
# For faster pages, "domcontentloaded" is sufficient.
await page.goto("https://example.com", wait_until="networkidle")
# Extract the page title
title = await page.title()
print(f"Page title: {title}")
# Extract the text content of the h1 heading
h1 = await page.text_content("h1")
print(f"H1 heading: {h1}")
# Take a full-page screenshot and save it to disk
await page.screenshot(path="screenshot.png", full_page=True)
print("Screenshot saved to screenshot.png")
await browser.close()
asyncio.run(main())
What this does:
async_playwright()is the entry point for the entire Playwright session.- The
browser_contextis equivalent to opening a fresh incognito window; cookies, local storage, and cache are isolated from everything else. wait_until="networkidle"tells Playwright to wait until the page has finished all its network activity before your code continues, which is the safest wait strategy for dynamic pages.
If this runs and saves a screenshot, your environment is working correctly.
Web Navigation and Scraping
The reason you need Playwright instead of requests + BeautifulSoup is JavaScript rendering. Modern websites deliver a skeleton of HTML and then build the actual content dynamically after the page loads: React, Vue, Angular, Next.js. A plain HTTP request fetches the skeleton. Playwright runs a real browser, so it sees exactly what a human sees after all JavaScript has executed.
The target below is books.toscrape.com, a legal scraping sandbox built for practice. It paginates results, uses dynamic class names for ratings, and closely mirrors the structure of real e-commerce product pages.
Save as scrape_books.py and run python scrape_books.py:
# scrape_books.py
# Scrape book titles, prices, and ratings from books.toscrape.com
# This is a legal scraping sandbox site built for practice.
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python scrape_books.py
import asyncio
import json
from playwright.async_api import async_playwright
async def scrape_books(max_pages: int = 3) -> list[dict]:
"""
Scrape book listings from books.toscrape.com across multiple pages.
Returns a list of dicts with title, price, rating, and page number.
"""
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(viewport={"width": 1280, "height": 720})
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"
print(f"Scraping page {page_num}: {url}")
await page.goto(url, wait_until="domcontentloaded")
# Wait for the product cards to be visible before extracting.
# This is critical on JavaScript-heavy pages where content loads after the HTML.
# timeout=10000 means wait up to 10 seconds before raising an error.
await page.wait_for_selector("article.product_pod", timeout=10000)
# Get all book cards on the current page
books = await page.query_selector_all("article.product_pod")
for book in books:
# Extract title from the tag's title attribute
title_el = await book.query_selector("h3 a")
title = await title_el.get_attribute("title") if title_el else "N/A"
# Extract price text
price_el = await book.query_selector(".price_color")
price = await price_el.inner_text() if price_el else "N/A"
# Extract star rating from the CSS class name.
# e.g. <p class="star-rating Three"> → "Three"
rating_el = await book.query_selector("p.star-rating")
rating_class = await rating_el.get_attribute("class") if rating_el else ""
rating = rating_class.replace("star-rating", "").strip()
results.append({
"title": title,
"price": price,
"rating": rating,
"page": page_num
})
print(f" Extracted {len(books)} books from page {page_num}")
await browser.close()
return results
async def main():
books = await scrape_books(max_pages=2)
print(f"\nTotal books scraped: {len(books)}")
print(json.dumps(books[:3], indent=2))
asyncio.run(main())
What this does:
wait_for_selector()is the key call here. Instead of sleeping for a fixed time and hoping the content has loaded, it watches the DOM and proceeds the moment the target element appears, or raises aTimeoutErrorif it does not appear within the timeout window. That is the right behavior: fail fast and explicitly rather than silently extracting from an empty page.- The rating extraction deserves attention. The star rating is encoded as a CSS class (
star-rating Three), not a number. The code strips "star-rating" from the class string to get the text value. This is the kind of thing you only know by inspecting the actual HTML. When you hand this task to a raw LLM with no browser, it has no way to know what the class structure looks like. With Playwright, you can inspect it directly and extract it exactly.
Form Completion and Multi-Step Flows
Filling forms is where browser agents earn their keep and where most automation scripts fail. The reason is that web forms are not just inputs and buttons. They fire focus, input, change, and blur events in sequence. JavaScript validation listens for those events. If you inject a value into an input field by directly setting value in the DOM (as older automation tools often do), the validation listeners never fire and the form breaks.
Playwright's fill() and click() methods fire real browser events in the right order, which is why they work on form validation that would block lower-level approaches.
The target below is the-internet.herokuapp.com/login, a public test site maintained specifically for automation practice. It accepts tomsmith / SuperSecretPassword! as valid credentials and returns clear success/failure messages.
Save as form_submit.py and run python form_submit.py:
# form_submit.py
# Complete and submit a multi-field login form on a public demo site.
# Target: https://the-internet.herokuapp.com/login (public test site)
# Prerequisites: pip install playwright && playwright install chromium
# How to run: python form_submit.py
import asyncio
from playwright.async_api import async_playwright
async def login_and_verify(username: str, password: str) -> dict:
"""
Attempt to log in to a demo site and return whether it succeeded.
Handles: input filling, button clicking, and result verification.
"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://the-internet.herokuapp.com/login")
# Wait for the form to be visible before interacting.
# state="visible" is the default but makes the intent explicit.
await page.wait_for_selector("#username", state="visible")
# fill() clears the field first, then types the value.
# It fires the focus, input, and change events in order.
await page.fill("#username", username)
await page.fill("#password", password)
# click() fires real mouse events -- mousedown, mouseup, click.
# This triggers JavaScript listeners that a plain DOM click misses.
await page.click("button[type='submit']")
# Wait for the page to settle after submission
await page.wait_for_load_state("networkidle")
# Check for success or failure messages
flash_message = await page.text_content(".flash")
success = "You logged into a secure area!" in (flash_message or "")
await browser.close()
return {"success": success, "message": flash_message}
async def main():
result = await login_and_verify("tomsmith", "SuperSecretPassword!")
print(f"Login result: {result}")
# Test with invalid credentials
result = await login_and_verify("wrong", "credentials")
print(f"Login result (invalid): {result}")
asyncio.run(main())
Comments
No comments yet. Start the discussion.