How to scrape otto.de?

I have a very lazy question but I was receiving a lot of blocks when trying to load:

I specifically want to scrape this category:

and extract all the product titles and prices.

Can someone help point me in the right direction for this?

To scrape (http://Otto.de), use Selenium with Chrome Driver due to its dynamic, JavaScript-rendered content.

Steps include:

  1. Load Page: Open Otto.de and handle cookie prompts with Selenium.
  2. Search/Navigation: Use CSS selectors or XPath to locate search bars and results.
  3. Data Extraction: Parse product names, prices, and other details.
  4. Pagination: Automate pagination or scrolling to retrieve all results.
  5. Save Data: Export to CSV or JSON for easy analysis.

To avoid being blocked while scraping websites like Otto.de, you need to employ several techniques that mimic human browsing behavior and help you evade bot detection systems.

Here are some strategies that can minimize your chances of being blocked:

  1. Rate Limiting: Avoid sending too many requests in a short period. Implement delays (randomized intervals) between requests
  2. IP Rotation and Proxies: Use rotating proxies to spread your requests across multiple IP addresses.
  3. User-Agent Rotation: Regularly rotate the User-Agent header to mimic requests from different browsers and devices.
  4. Headless Browsers and Browser Automation: Use tools like Selenium with headless browsers (e.g., Chromium or Firefox in headless mode) to simulate actual user interaction with the website.
  5. Mimic Human Interaction: Emulate real user behavior, such as clicking, scrolling, and waiting for page loads.
  6. Handling CAPTCHAs: If the site uses CAPTCHA to block scraping, you can either solve the CAPTCHA using services like 2Captcha

Thanks @Rohitash can I ask for you to provide some example code? either nodejs or python with setup instructions?

Here is the Python code snippet below

Let me know, if you have any further questions



From the points earlier mentioned


Headless Mode: The browser runs without a GUI for better performance. You can remove --headless to see it in action.
User-Agent Rotation: The script uses fake_useragent to rotate user-agents and avoid detection.
Random Delays: Mimics human behavior to prevent being flagged as a bot.
CSV Export: Saves product data to products.csv.


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
from fake_useragent import UserAgent
import csv

Configure Chrome options for stealth and user-agent rotation

options = webdriver.ChromeOptions()
options.add_argument(“–headless”) # Optional: Run in headless mode
options.add_argument(“–disable-blink-features=AutomationControlled”)
options.add_argument(f"user-agent={UserAgent().random}")

Path to your ChromeDriver

driver_path = ‘/path/to/chromedriver’
service = Service(driver_path)

Create the WebDriver instance

driver = webdriver.Chrome(service=service, options=options)

Open Otto.de

driver.get(“https://www.otto.de”)

Handle cookie consent if present

try:
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, “//button[contains(text(), ‘Accept’)]”))
).click()
print(“Cookie consent handled.”)
except Exception as e:
print(“No cookie consent found or error:”, e)

Search for products

search_term = “laptops”
try:
search_bar = driver.find_element(By.NAME, “search”)
search_bar.send_keys(search_term)
search_bar.submit()
print(f"Searching for ‘{search_term}’
")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, “product-wrapper”)))
except Exception as e:
print(“Search error:”, e)

Extract product details and handle pagination

with open(“products.csv”, “w”, newline=‘’, encoding=‘utf-8’) as csvfile:
fieldnames = [“Name”, “Price”]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

while True:
    try:
        products = driver.find_elements(By.CLASS_NAME, "product-wrapper")
        for product in products:
            try:
                name = product.find_element(By.CLASS_NAME, "product-title").text
                price = product.find_element(By.CLASS_NAME, "product-price").text
                writer.writerow({"Name": name, "Price": price})
                print(f"Name: {name}, Price: {price}")
            except Exception as e:
                print("Error extracting product details:", e)

        # Check for and click the next page button
        next_button = driver.find_element(By.XPATH, "//button[contains(@class, 'pagination-next')]")
        next_button.click()
        time.sleep(random.uniform(3, 6))  # Random delay for human-like interaction
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "product-wrapper")))
    except Exception as e:
        print("Pagination finished or error navigating:", e)
        break

Close the browser

driver.quit()
print(“Scraping complete. Data saved to ‘products.csv’.”)

@Rohitash I find this code a little bit hard to follow and implement on my end as it’s broken up with explanations in between. would you mind creating a gist with the some full code that I would be able to run on my end so that I can test it? thanks!

@proxyrackevan I will respond your query soon !!

@Rohitash While following the instructions, I encountered an issue with the ChromeDriver path. The original code included a line:

driver_path = '/path/to/chromedriver'

This placeholder path led to an error (ValueError: The path is not a valid file) when attempting to run the script after manually specifying the exact location of ChromeDriver on my system.
It was not immediately clear where to get ChromeDriver or how to find the correct path.
As suggested by @proxyrackevan creating a gist with some full code would be easier to follow

Here is the working script. It will generate a file products.csv with data

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get(‘https://www.otto.de/’)
sleep(2)

cookies = driver.find_element(By.ID, “onetrust-accept-btn-handler”)
cookies.click()
sleep(4)

search_bar = driver.find_element(By.CSS_SELECTOR, “.squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxz”)
search_bar.click()
search_bar.send_keys(“Gefunden auf Otto.de” + Keys.RETURN)
sleep(5)

initial_height = driver.execute_script(“return document.body.scrollHeight”)
scroll_position = 0
total_scrolls = 20

for _ in range(total_scrolls):
driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
scroll_position += initial_height / total_scrolls
sleep(5 / total_scrolls)
sleep(8)

page_source = driver.page_source
soup = BeautifulSoup(page_source, ‘html.parser’)

product_elements = soup.find_all(‘article’, attrs={‘data-product-listing-type’: ‘SearchResultPage’})
print(f’Found {len(product_elements)} product elements.')

with open(‘products.csv’, mode=‘w’, newline=‘’, encoding=‘utf-8’) as file:
writer = csv.writer(file)
writer.writerow([‘Product Number’, ‘Product Title’, ‘Product Price’])

for idx, product in enumerate(product_elements, 1):
    title_element = product.find('p', class_='find_tile__name pl_copy100')
    title = title_element.get_text(strip=True) if title_element else 'No Title Found'       

    price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue')  # Try with one class
    if not price_element:
        price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red')  # Try with another class
    price = price_element.get_text(strip=True) if price_element else 'No Price Found'        

    writer.writerow([idx, title, price])    
    print(f"{idx}. Title: {title}")
    print(f"{idx}. Price: {price}")

driver.quit()

Thank you for posting the full code. However, Python is a language where indentation matters, and this is all gone when copy and pasting this code from your response. Could you please send a gist with the code formatted as it should be?

I am not able to attach the file here.
Here is the link, you can download the file

@Rohitash I think the easiest way to share the code here would be in a gist, as this would allow anyone to easily view it without needing to download anything and would also allow anyone to easily copy and paste into their own code with the correct formatting. would you mind making a gist of this and posting it here? thank you.

ok, let me share the gist

Thanks @Rohitash i was hitting my head to extract data from product url but it get failed after extracting some items but this code resolve my problem i have update above code to extract the next pages and it is also working fine.
@terminal1 if you need here is the code
`

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
driver.get('https://www.otto.de/')
sleep(5)

cookies = driver.find_element(By.ID, "onetrust-accept-btn-handler")
cookies.click()
sleep(8)

search_bar = driver.find_element(By.CSS_SELECTOR, ".squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxz")
search_bar.click()
search_bar.send_keys("Gefunden auf Otto.de" + Keys.RETURN)
sleep(8)

initial_height = driver.execute_script("return document.body.scrollHeight")
scroll_position = 0
total_scrolls = 20
with open('products.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Product Number', 'Product Title', 'Product Price'])
    while True:
        for _ in range(total_scrolls):
            driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
            scroll_position += initial_height / total_scrolls
            sleep(5 / total_scrolls)
        sleep(8)

        page_source = driver.page_source
        soup = BeautifulSoup(page_source, 'html.parser')

        product_elements = soup.find_all('article', attrs={'data-product-listing-type': 'SearchResultPage'})
        print(f'Found {len(product_elements)} product elements.')



        for idx, product in enumerate(product_elements, 1):
            title_element = product.find('p', class_='find_tile__name pl_copy100')
            title = title_element.get_text(strip=True) if title_element else 'No Title Found'

            price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue')  # Try with one class
            if not price_element:
                price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red')  # Try with another class
            price = price_element.get_text(strip=True) if price_element else 'No Price Found'

            writer.writerow([idx, title, price])
            print(f"{idx}. Title: {title}")
            print(f"{idx}. Price: {price}")
        cat_rule = driver.find_element(By.CSS_SELECTOR,'div#avContent,div#reptile-tilelist-bracket').get_attribute('data-rule')

        nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')

        if nextpage:
            nextpage = json.loads(nextpage)
            url = driver.current_url.split("?")[0]
            url = f"{url}?l=gq&o={nextpage.get('o')}"
            driver.get(url)
        else:
            break
driver.quit()

`
Thank you

@darkrace let me check

Thanks for sharing the snippet. It worked but partially. I was able to get the Product and price but did not load the next page. Also the code stops executing on slow networks due to this reason I suspect:
The code uses sleep() frequently:

sleep(5)
sleep(8)
sleep(5 / total_scrolls)

This makes the script unreliable, as it either wastes time or fails when the page is slower than expected.

if internet is slow we need to use WebDriverWait(driver, 60).until( EC.presence_of_element_located((By.XPATH, "<ELEMENT>")))

for each find element to wait untill element doesnt load properly.

@darkrace but why do we have to wait for 60 secs if any element would load faster that that? Is there any event based trigger that would not rely on timer based triggers.
Also, could you help with the pagination related issue I mentioned in my last message. The code was not able to load the next page products throwing some json error

1). @proxyrackubair 60 sec is max wait, Code will only wait until page load element in DOM (if elem load in 2 sec it jump to next statment in after 2sec ). 60 sec is only the max wait that if element won’t load till 60 sec, code will beak.
2) for pagination issue i see in above code i missed import json add this to top you will not see error.

So, now i didn’t get the Json error but it’s not able to get the next page element. Getting error at nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')
with error: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"li#reptile-paging-bottom-next > button"} Thought the div id is correctly mentioned in the code.