I have a very lazy question but I was receiving a lot of blocks when trying to load:
I specifically want to scrape this category:
and extract all the product titles and prices.
Can someone help point me in the right direction for this?
I have a very lazy question but I was receiving a lot of blocks when trying to load:
I specifically want to scrape this category:
and extract all the product titles and prices.
Can someone help point me in the right direction for this?
To scrape (http://Otto.de), use Selenium with Chrome Driver due to its dynamic, JavaScript-rendered content.
Steps include:
To avoid being blocked while scraping websites like Otto.de, you need to employ several techniques that mimic human browsing behavior and help you evade bot detection systems.
Here are some strategies that can minimize your chances of being blocked:
Thanks @Rohitash can I ask for you to provide some example code? either nodejs or python with setup instructions?
Here is the Python code snippet belowâŠ
Let me know, if you have any further questionsâŠ
From the points earlier mentionedâŠ
Headless Mode: The browser runs without a GUI for better performance. You can remove --headless
to see it in action.
User-Agent Rotation: The script uses fake_useragent
to rotate user-agents and avoid detection.
Random Delays: Mimics human behavior to prevent being flagged as a bot.
CSV Export: Saves product data to products.csv
.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
from fake_useragent import UserAgent
import csv
options = webdriver.ChromeOptions()
options.add_argument(ââheadlessâ) # Optional: Run in headless mode
options.add_argument(ââdisable-blink-features=AutomationControlledâ)
options.add_argument(f"user-agent={UserAgent().random}")
driver_path = â/path/to/chromedriverâ
service = Service(driver_path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(âhttps://www.otto.deâ)
try:
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, â//button[contains(text(), âAcceptâ)]â))
).click()
print(âCookie consent handled.â)
except Exception as e:
print(âNo cookie consent found or error:â, e)
search_term = âlaptopsâ
try:
search_bar = driver.find_element(By.NAME, âsearchâ)
search_bar.send_keys(search_term)
search_bar.submit()
print(f"Searching for â{search_term}ââŠ")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, âproduct-wrapperâ)))
except Exception as e:
print(âSearch error:â, e)
with open(âproducts.csvâ, âwâ, newline=ââ, encoding=âutf-8â) as csvfile:
fieldnames = [âNameâ, âPriceâ]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
while True:
try:
products = driver.find_elements(By.CLASS_NAME, "product-wrapper")
for product in products:
try:
name = product.find_element(By.CLASS_NAME, "product-title").text
price = product.find_element(By.CLASS_NAME, "product-price").text
writer.writerow({"Name": name, "Price": price})
print(f"Name: {name}, Price: {price}")
except Exception as e:
print("Error extracting product details:", e)
# Check for and click the next page button
next_button = driver.find_element(By.XPATH, "//button[contains(@class, 'pagination-next')]")
next_button.click()
time.sleep(random.uniform(3, 6)) # Random delay for human-like interaction
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "product-wrapper")))
except Exception as e:
print("Pagination finished or error navigating:", e)
break
driver.quit()
print(âScraping complete. Data saved to âproducts.csvâ.â)
@Rohitash I find this code a little bit hard to follow and implement on my end as itâs broken up with explanations in between. would you mind creating a gist with the some full code that I would be able to run on my end so that I can test it? thanks!
@proxyrackevan I will respond your query soon !!
@Rohitash While following the instructions, I encountered an issue with the ChromeDriver path. The original code included a line:
driver_path = '/path/to/chromedriver'
This placeholder path led to an error (ValueError: The path is not a valid file
) when attempting to run the script after manually specifying the exact location of ChromeDriver on my system.
It was not immediately clear where to get ChromeDriver or how to find the correct path.
As suggested by @proxyrackevan creating a gist with some full code would be easier to follow
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get(âhttps://www.otto.de/â)
sleep(2)
cookies = driver.find_element(By.ID, âonetrust-accept-btn-handlerâ)
cookies.click()
sleep(4)
search_bar = driver.find_element(By.CSS_SELECTOR, â.squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxzâ)
search_bar.click()
search_bar.send_keys(âGefunden auf Otto.deâ + Keys.RETURN)
sleep(5)
initial_height = driver.execute_script(âreturn document.body.scrollHeightâ)
scroll_position = 0
total_scrolls = 20
for _ in range(total_scrolls):
driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
scroll_position += initial_height / total_scrolls
sleep(5 / total_scrolls)
sleep(8)
page_source = driver.page_source
soup = BeautifulSoup(page_source, âhtml.parserâ)
product_elements = soup.find_all(âarticleâ, attrs={âdata-product-listing-typeâ: âSearchResultPageâ})
print(fâFound {len(product_elements)} product elements.')
with open(âproducts.csvâ, mode=âwâ, newline=ââ, encoding=âutf-8â) as file:
writer = csv.writer(file)
writer.writerow([âProduct Numberâ, âProduct Titleâ, âProduct Priceâ])
for idx, product in enumerate(product_elements, 1):
title_element = product.find('p', class_='find_tile__name pl_copy100')
title = title_element.get_text(strip=True) if title_element else 'No Title Found'
price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue') # Try with one class
if not price_element:
price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red') # Try with another class
price = price_element.get_text(strip=True) if price_element else 'No Price Found'
writer.writerow([idx, title, price])
print(f"{idx}. Title: {title}")
print(f"{idx}. Price: {price}")
driver.quit()
Thank you for posting the full code. However, Python is a language where indentation matters, and this is all gone when copy and pasting this code from your response. Could you please send a gist with the code formatted as it should be?
I am not able to attach the file here.
Here is the link, you can download the file
@Rohitash I think the easiest way to share the code here would be in a gist, as this would allow anyone to easily view it without needing to download anything and would also allow anyone to easily copy and paste into their own code with the correct formatting. would you mind making a gist of this and posting it here? thank you.
ok, let me share the gist
Thanks @Rohitash i was hitting my head to extract data from product url but it get failed after extracting some items but this code resolve my problem i have update above code to extract the next pages and it is also working fine.
@terminal1 if you need here is the code
`
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
import csv
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.otto.de/')
sleep(5)
cookies = driver.find_element(By.ID, "onetrust-accept-btn-handler")
cookies.click()
sleep(8)
search_bar = driver.find_element(By.CSS_SELECTOR, ".squirrel_searchfield.js_squirrel_searchbar__input.svelte-11jrfxz")
search_bar.click()
search_bar.send_keys("Gefunden auf Otto.de" + Keys.RETURN)
sleep(8)
initial_height = driver.execute_script("return document.body.scrollHeight")
scroll_position = 0
total_scrolls = 20
with open('products.csv', mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Product Number', 'Product Title', 'Product Price'])
while True:
for _ in range(total_scrolls):
driver.execute_script(f"window.scrollTo(0, {scroll_position + initial_height / total_scrolls});")
scroll_position += initial_height / total_scrolls
sleep(5 / total_scrolls)
sleep(8)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
product_elements = soup.find_all('article', attrs={'data-product-listing-type': 'SearchResultPage'})
print(f'Found {len(product_elements)} product elements.')
for idx, product in enumerate(product_elements, 1):
title_element = product.find('p', class_='find_tile__name pl_copy100')
title = title_element.get_text(strip=True) if title_element else 'No Title Found'
price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue') # Try with one class
if not price_element:
price_element = product.find('span', class_='find_tile__retailPrice pl_headline50 find_tile__priceValue find_tile__priceValue--red') # Try with another class
price = price_element.get_text(strip=True) if price_element else 'No Price Found'
writer.writerow([idx, title, price])
print(f"{idx}. Title: {title}")
print(f"{idx}. Price: {price}")
cat_rule = driver.find_element(By.CSS_SELECTOR,'div#avContent,div#reptile-tilelist-bracket').get_attribute('data-rule')
nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')
if nextpage:
nextpage = json.loads(nextpage)
url = driver.current_url.split("?")[0]
url = f"{url}?l=gq&o={nextpage.get('o')}"
driver.get(url)
else:
break
driver.quit()
`
Thank you
@darkrace let me check
Thanks for sharing the snippet. It worked but partially. I was able to get the Product and price but did not load the next page. Also the code stops executing on slow networks due to this reason I suspect:
The code uses sleep()
frequently:
sleep(5)
sleep(8)
sleep(5 / total_scrolls)
This makes the script unreliable, as it either wastes time or fails when the page is slower than expected.
if internet is slow we need to use WebDriverWait(driver, 60).until( EC.presence_of_element_located((By.XPATH, "<ELEMENT>")))
for each find element to wait untill element doesnt load properly.
@darkrace but why do we have to wait for 60 secs if any element would load faster that that? Is there any event based trigger that would not rely on timer based triggers.
Also, could you help with the pagination related issue I mentioned in my last message. The code was not able to load the next page products throwing some json error
1). @proxyrackubair 60 sec is max wait, Code will only wait until page load element in DOM (if elem load in 2 sec it jump to next statment in after 2sec ). 60 sec is only the max wait that if element wonât load till 60 sec, code will beak.
2) for pagination issue i see in above code i missed import json
add this to top you will not see error.
So, now i didnât get the Json error but itâs not able to get the next page element. Getting error at nextpage = driver.find_element(By.CSS_SELECTOR,'li#reptile-paging-bottom-next > button').get_attribute('data-page')
with error: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"li#reptile-paging-bottom-next > button"}
Thought the div id is correctly mentioned in the code.