Working with Data Sources 2

Web Scriping:

1. We can also use requests.get to get the HTML file form a webpage.

2. If we would like to extract the content from the webpage, we can use BeautifulSoup Library.

  from bs4 import BeautifulSoup

  parser = BeautifulSoup(content, ‘html.parser‘) #initial the parser, pass the content by using BeautifulSoup

  body = parser.body # extract the <p></p> from the parser 

  p = body.p #Get body from <p></p>

  head = parser.head

  title_text = head.title.text #Get the content from <title></title>

3. We can use find_all function to find all the relevant content in the webpage. The find_all function can only being usd to bs4 elements.(tag)

  head = parser.find_all("head") # Find all the files with tag head and save them as a list into variable head.

  title = head[0].find_all("title")

  title_text = title[0].text

4. Find_all function can also find the content by its id. Find_all always return a list.

  second_paragraph_text = parser.find_all("p", id ="second")[0].text

5. Find_all function can also find the content by class.

  second_inner_paragraph_text = parser.find_all("p", class_= "inner-text")[1].text # "p" indicates the tag of the class.

6. We can also use CSS selector to find the specific content. Same as find_all method. selector method also works on the sb4 format and return a list.

  first_outer_text = parser.select(".outer-text")[0].text 

  second_text = parser.select("#second")[0].text

时间: 2025-01-04 02:27:03

Working with Data Sources 2的相关文章

Spark SQL and DataFrame Guide(1.4.1)——之Data Sources

数据源(Data Sources) Spark SQL通过DataFrame接口支持多种数据源操作.一个DataFrame可以作为正常的RDD操作,也可以被注册为临时表. 1. 通用的Load/Save函数 默认的数据源适用所有操作(可以用spark.sql.sources.default设置默认值) 之后,我们就可以使用hadoop fs -ls /user/hadoopuser/在此目录下找到namesAndFavColors.parquet文件. 手动指定数据源选项 我们可以手动指定数据源

Export to Microsoft Excel On a Dynamics AX Form With Multiple Data Sources【转】

AX 2012 now makes it really easy to output to Excel from a form. Quite simply all you need to do is add a Command button to the form and link it to the command Export to Microsoft Excel. This is great for list pages or any form with a single data sou

Spark SQL External Data Sources JDBC官方实现写测试

通过Spark SQL External Data Sources JDBC实现将RDD的数据写入到MySQL数据库中. jdbc.scala重要API介绍: /** * Save this RDD to a JDBC database at `url` under the table name `table`. * This will run a `CREATE TABLE` and a bunch of `INSERT INTO` statements. * If you pass `tru

Spark SQL External Data Sources JDBC简易实现

在spark1.2版本中最令我期待的功能是External Data Sources,通过该API可以直接将External Data Sources注册成一个临时表,该表可以和已经存在的表等通过sql进行查询操作.External Data Sources API代码存放于org.apache.spark.sql包中. 具体的分析可参见OopsOutOfMemory的两篇精彩博文: http://blog.csdn.net/oopsoom/article/details/42061077 ht

Working with Data Sources 8

Data Schema is the table which contains all the data types of another table. 1. Add column in schema table for the main table by using ALTER TABLE... ADD ALTER TABLE facts ADD awesomeness integer; # have to mention datatype 2. Delete column from sche

Working with Data Sources 4

Querying SQLite from Python 1. We use connect() in the library sqlite3 to connect the database we would like to query. Once it is connected, the target database is the only one database we are currently connecting. import sqlite3 conn = sqlite3.conne

Working with Data Sources 3

SQL And Database: 1.SQL query is to request the data from the database. 2.We use SELECT command to pick the specific column from the database. SELECT rank,major # This command will return a list which contains the data completely follow the order of

handsontable-developer guide-data binding,data sources

数据绑定: 1.表格中得数据是引用了数据源中的数据:表格中数据改变,数据源中得数据也改变:数据源中得数据改变,通过render方法,表格中的数据也改变: 2.如果想把数据源中的数据和表格中的数据分开:JSON.parse(JSON.stringify(data2)) 3.保存之前clone表格,使用afterChange的var tmpData = JSON.parse(JSON.stringify(data3));语句. afterChange:cell改变之后,会触发function(cha

Working with Data Sources 7

1.Working with dates in SQL: in SQL query, we can select date by using where. < means before that date, and > means after that date: SELECT * FROM facts WHERE updated_at > "2015-10-30 16:00" and updated_at <"2015-11-02 15:00&quo

Working with Data Sources

1. The API is the link to request data from the host. 2. A endpoint is a route to retrive different data from the API. 3. Status codes indicate information about what happened with a request. Here are some codes that are relevant to GET requests: 200