How to scrape otto.de?

terminal1 · November 2, 2024, 6:01am

I have a very lazy question but I was receiving a lot of blocks when trying to load:

I specifically want to scrape this category:

and extract all the product titles and prices.

Can someone help point me in the right direction for this?

Rohitash · November 6, 2024, 4:23pm

To scrape (http://Otto.de), use Selenium with Chrome Driver due to its dynamic, JavaScript-rendered content.

Steps include:

Load Page: Open Otto.de and handle cookie prompts with Selenium.
Search/Navigation: Use CSS selectors or XPath to locate search bars and results.
Data Extraction: Parse product names, prices, and other details.
Pagination: Automate pagination or scrolling to retrieve all results.
Save Data: Export to CSV or JSON for easy analysis.

To avoid being blocked while scraping websites like Otto.de, you need to employ several techniques that mimic human browsing behavior and help you evade bot detection systems.

Here are some strategies that can minimize your chances of being blocked:

Rate Limiting: Avoid sending too many requests in a short period. Implement delays (randomized intervals) between requests
IP Rotation and Proxies: Use rotating proxies to spread your requests across multiple IP addresses.
User-Agent Rotation: Regularly rotate the User-Agent header to mimic requests from different browsers and devices.
Headless Browsers and Browser Automation: Use tools like Selenium with headless browsers (e.g., Chromium or Firefox in headless mode) to simulate actual user interaction with the website.
Mimic Human Interaction: Emulate real user behavior, such as clicking, scrolling, and waiting for page loads.
Handling CAPTCHAs: If the site uses CAPTCHA to block scraping, you can either solve the CAPTCHA using services like 2Captcha

terminal1 · November 7, 2024, 4:53am

Thanks @Rohitash can I ask for you to provide some example code? either nodejs or python with setup instructions?

Rohitash · November 8, 2024, 3:08am

Here is the Python code snippet below…
Let me know, if you have any further questions…

From the points earlier mentioned…

Headless Mode: The browser runs without a GUI for better performance. You can remove --headless to see it in action.
User-Agent Rotation: The script uses fake_useragent to rotate user-agents and avoid detection.
Random Delays: Mimics human behavior to prevent being flagged as a bot.
CSV Export: Saves product data to products.csv.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
from fake_useragent import UserAgent
import csv

Configure Chrome options for stealth and user-agent rotation

options = webdriver.ChromeOptions()
options.add_argument(“–headless”) # Optional: Run in headless mode
options.add_argument(“–disable-blink-features=AutomationControlled”)
options.add_argument(f"user-agent={UserAgent().random}")

Path to your ChromeDriver

driver_path = ‘/path/to/chromedriver’
service = Service(driver_path)

Create the WebDriver instance

driver = webdriver.Chrome(service=service, options=options)

Open Otto.de

driver.get(“https://www.otto.de”)

Handle cookie consent if present

try:
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, “//button[contains(text(), ‘Accept’)]”))
).click()
print(“Cookie consent handled.”)
except Exception as e:
print(“No cookie consent found or error:”, e)

Search for products

search_term = “laptops”
try:
search_bar = driver.find_element(By.NAME, “search”)
search_bar.send_keys(search_term)
search_bar.submit()
print(f"Searching for ‘{search_term}’…")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, “product-wrapper”)))
except Exception as e:
print(“Search error:”, e)

Extract product details and handle pagination

with open(“products.csv”, “w”, newline=‘’, encoding=‘utf-8’) as csvfile:
fieldnames = [“Name”, “Price”]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()

while True:
    try:
        products = driver.find_elements(By.CLASS_NAME, "product-wrapper")
        for product in products:
            try:
                name = product.find_element(By.CLASS_NAME, "product-title").text
                price = product.find_element(By.CLASS_NAME, "product-price").text
                writer.writerow({"Name": name, "Price": price})
                print(f"Name: {name}, Price: {price}")
            except Exception as e:
                print("Error extracting product details:", e)

        # Check for and click the next page button
        next_button = driver.find_element(By.XPATH, "//button[contains(@class, 'pagination-next')]")
        next_button.click()
        time.sleep(random.uniform(3, 6))  # Random delay for human-like interaction
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "product-wrapper")))
    except Exception as e:
        print("Pagination finished or error navigating:", e)
        break

Close the browser

driver.quit()
print(“Scraping complete. Data saved to ‘products.csv’.”)

proxyrackevan · November 13, 2024, 4:00pm

@Rohitash I find this code a little bit hard to follow and implement on my end as it’s broken up with explanations in between. would you mind creating a gist with the some full code that I would be able to run on my end so that I can test it? thanks!

Rohitash · November 14, 2024, 11:56am

@proxyrackevan I will respond your query soon !!

proxyrackubair · November 14, 2024, 6:54pm

@Rohitash While following the instructions, I encountered an issue with the ChromeDriver path. The original code included a line:

driver_path = '/path/to/chromedriver'

This placeholder path led to an error (ValueError: The path is not a valid file) when attempting to run the script after manually specifying the exact location of ChromeDriver on my system.
It was not immediately clear where to get ChromeDriver or how to find the correct path.
As suggested by @proxyrackevan creating a gist with some full code would be easier to follow

Rohitash · November 15, 2024, 2:05pm

Here is the working script. It will generate a file products.csv with data

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get(‘https://www.otto.de/’)
sleep(2)

cookies = driver.find_element(By.ID, “onetrust-accept-btn-handler”)
cookies.click()
sleep(4)

search_bar = driver.find_element(By.CSS_SELECTOR, “.squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxz”)
search_bar.click()
search_bar.send_keys(“Gefunden auf Otto.de” + Keys.RETURN)
sleep(5)

initial_height = driver.execute_script(“return document.body.scrollHeight”)
scroll_position = 0
total_scrolls = 20

for _ in range(total_scrolls):
driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
scroll_position += initial_height / total_scrolls
sleep(5 / total_scrolls)
sleep(8)

page_source = driver.page_source
soup = BeautifulSoup(page_source, ‘html.parser’)

product_elements = soup.find_all(‘article’, attrs={‘data-product-listing-type’: ‘SearchResultPage’})
print(f’Found {len(product_elements)} product elements.')

with open(‘products.csv’, mode=‘w’, newline=‘’, encoding=‘utf-8’) as file:
writer = csv.writer(file)
writer.writerow([‘Product Number’, ‘Product Title’, ‘Product Price’])

for idx, product in enumerate(product_elements, 1):
    title_element = product.find('p', class_='find_tile__name pl_copy100')
    title = title_element.get_text(strip=True) if title_element else 'No Title Found'       

    price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue')  # Try with one class
    if not price_element:
        price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red')  # Try with another class
    price = price_element.get_text(strip=True) if price_element else 'No Price Found'        

    writer.writerow([idx, title, price])    
    print(f"{idx}. Title: {title}")
    print(f"{idx}. Price: {price}")

driver.quit()

proxyrackevan · November 15, 2024, 2:56pm

Thank you for posting the full code. However, Python is a language where indentation matters, and this is all gone when copy and pasting this code from your response. Could you please send a gist with the code formatted as it should be?

Rohitash · November 15, 2024, 4:26pm

I am not able to attach the file here.
Here is the link, you can download the file

proxyrackevan · November 18, 2024, 2:53pm

@Rohitash I think the easiest way to share the code here would be in a gist, as this would allow anyone to easily view it without needing to download anything and would also allow anyone to easily copy and paste into their own code with the correct formatting. would you mind making a gist of this and posting it here? thank you.

Rohitash · November 18, 2024, 4:05pm

ok, let me share the gist

Rohitash · November 18, 2024, 4:23pm

gist.github.com

https://gist.github.com/exoticaitsolutions/99989c32255c3ff7257c11c1f2d88e7e

Otto.py

https://gist.github.com/exoticaitsolutions/6d6d811e8ae94865d91493070a2e2056from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
driver.get('https://www.otto.de/')

This file has been truncated. show original

darkrace · November 26, 2024, 8:29am

Thanks @Rohitash i was hitting my head to extract data from product url but it get failed after extracting some items but this code resolve my problem i have update above code to extract the next pages and it is also working fine.
@terminal1 if you need here is the code
`

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
driver.get('https://www.otto.de/')
sleep(5)

cookies = driver.find_element(By.ID, "onetrust-accept-btn-handler")
cookies.click()
sleep(8)

search_bar = driver.find_element(By.CSS_SELECTOR, ".squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxz")
search_bar.click()
search_bar.send_keys("Gefunden auf Otto.de" + Keys.RETURN)
sleep(8)

initial_height = driver.execute_script("return document.body.scrollHeight")
scroll_position = 0
total_scrolls = 20
with open('products.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Product Number', 'Product Title', 'Product Price'])
    while True:
        for _ in range(total_scrolls):
            driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
            scroll_position += initial_height / total_scrolls
            sleep(5 / total_scrolls)
        sleep(8)

        page_source = driver.page_source
        soup = BeautifulSoup(page_source, 'html.parser')

        product_elements = soup.find_all('article', attrs={'data-product-listing-type': 'SearchResultPage'})
        print(f'Found {len(product_elements)} product elements.')



        for idx, product in enumerate(product_elements, 1):
            title_element = product.find('p', class_='find_tile__name pl_copy100')
            title = title_element.get_text(strip=True) if title_element else 'No Title Found'

            price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue')  # Try with one class
            if not price_element:
                price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red')  # Try with another class
            price = price_element.get_text(strip=True) if price_element else 'No Price Found'

            writer.writerow([idx, title, price])
            print(f"{idx}. Title: {title}")
            print(f"{idx}. Price: {price}")
        cat_rule = driver.find_element(By.CSS_SELECTOR,'div#avContent,div#reptile-tilelist-bracket').get_attribute('data-rule')

        nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')

        if nextpage:
            nextpage = json.loads(nextpage)
            url = driver.current_url.split("?")[0]
            url = f"{url}?l=gq&o={nextpage.get('o')}"
            driver.get(url)
        else:
            break
driver.quit()

`
Thank you

Rohitash · November 26, 2024, 11:56am

@darkrace let me check

proxyrackubair · November 27, 2024, 9:32am

Thanks for sharing the snippet. It worked but partially. I was able to get the Product and price but did not load the next page. Also the code stops executing on slow networks due to this reason I suspect:
The code uses sleep() frequently:

sleep(5)
sleep(8)
sleep(5 / total_scrolls)

This makes the script unreliable, as it either wastes time or fails when the page is slower than expected.

darkrace · November 27, 2024, 9:57am

if internet is slow we need to use WebDriverWait(driver, 60).until( EC.presence_of_element_located((By.XPATH, "<ELEMENT>")))

for each find element to wait untill element doesnt load properly.

proxyrackubair · November 27, 2024, 2:19pm

@darkrace but why do we have to wait for 60 secs if any element would load faster that that? Is there any event based trigger that would not rely on timer based triggers.
Also, could you help with the pagination related issue I mentioned in my last message. The code was not able to load the next page products throwing some json error

darkrace · November 27, 2024, 3:30pm

1). @proxyrackubair 60 sec is max wait, Code will only wait until page load element in DOM (if elem load in 2 sec it jump to next statment in after 2sec ). 60 sec is only the max wait that if element won’t load till 60 sec, code will beak.
2) for pagination issue i see in above code i missed import json add this to top you will not see error.

proxyrackubair · November 28, 2024, 1:11pm

So, now i didn’t get the Json error but it’s not able to get the next page element. Getting error at nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')
with error: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"li#reptile-paging-bottom-next > button"} Thought the div id is correctly mentioned in the code.

Topic		Replies	Views
How can I scrape alcampo.es if I want to know the availability of the products? Scraping Help	8	224	December 20, 2024
About the Scraping Help category Scraping Help	0	62	November 2, 2024
See the amount of traffic that is going through like 50kb so you know its working Mobile Proxies	1	137	April 28, 2025
Feature requests for Mobile Proxies Application Mobile Proxies	1	124	January 20, 2025
Unique ip feature and pof support? Mobile Proxies	1	138	April 28, 2025