tgoop.com/DataScienceN/864
Create:
Last Update:
Last Update:
• Find the first occurrence of a tag.
first_link = soup.find('a')• Find all occurrences of a tag.
all_links = soup.find_all('a')• Find tags by their CSS class.
articles = soup.find_all('div', class_='article-content')• Find a tag by its ID.
main_content = soup.find(id='main-container')
• Find tags by other attributes.
images = soup.find_all('img', attrs={'data-src': True})• Find using a list of multiple tags.
headings = soup.find_all(['h1', 'h2', 'h3'])
• Find using a regular expression.
import re
links_with_blog = soup.find_all('a', href=re.compile(r'blog'))
• Find using a custom function.
# Finds tags with a 'class' but no 'id'
tags = soup.find_all(lambda tag: tag.has_attr('class') and not tag.has_attr('id'))
• Limit the number of results.
first_five_links = soup.find_all('a', limit=5)• Use CSS Selectors to find one element.
footer = soup.select_one('#footer > p')• Use CSS Selectors to find all matching elements.
article_links = soup.select('div.article a')• Select direct children using CSS selector.
nav_items = soup.select('ul.nav > li')IV. Extracting Data with
BeautifulSoup• Get the text content from a tag.
title_text = soup.title.get_text()
• Get stripped text content.
link_text = soup.find('a').get_text(strip=True)• Get all text from the entire document.
all_text = soup.get_text()
• Get an attribute's value (like a URL).
link_url = soup.find('a')['href']• Get the tag's name.
tag_name = soup.find('h1').name• Get all attributes of a tag as a dictionary.
attrs_dict = soup.find('img').attrsV. Parsing with
lxml and XPath• Import the library.
from lxml import html
• Parse HTML content with
lxml.tree = html.fromstring(response.content)
• Select elements using an XPath expression.
# Selects all <a> tags inside <div> tags with class 'nav'
links = tree.xpath('//div[@class="nav"]/a')
• Select text content directly with XPath.
# Gets the text of all <h1> tags
h1_texts = tree.xpath('//h1/text()')
• Select an attribute value with XPath.
# Gets all href attributes from <a> tags
hrefs = tree.xpath('//a/@href')
VI. Handling Dynamic Content (
Selenium)• Import the
webdriver.from selenium import webdriver
• Initialize a browser driver.
driver = webdriver.Chrome() # Requires chromedriver
• Navigate to a webpage.
driver.get('http://example.com')• Find an element by its ID.
element = driver.find_element('id', 'my-element-id')• Find elements by CSS Selector.
elements = driver.find_elements('css selector', 'div.item')• Find an element by XPath.
button = driver.find_element('xpath', '//button[@type="submit"]')• Click a button.
button.click()
• Enter text into an input field.
search_box = driver.find_element('name', 'q')
search_box.send_keys('Python Selenium')• Wait for an element to become visible.
BY Data Science Jupyter Notebooks
Share with your friend now:
tgoop.com/DataScienceN/864
