在爬取12306站点名时发现,BeautifulSoup检索不到station_version的节点
因为script标签在</html>之外,如果用‘lxml’解析器会忽略这一部分,而使用html5lib则不会。
... 1 <!-- 购物车 --> 2 <div style="display: none;" class="buy-cart"><div class="cart-hd"><span class="num">0</span> 3 </div> 4 <div class="cart-bd" style="display: none;"><div class="cart-bd-top"><h3><span id="hbTrainDate">候补购票需求列表</span> 5 <a id="hbClear" href="javascript:void(0)" shape="rect">[清空]</a> 6 </h3> 7 <a href="javascript:void(0)" class="close" shape="rect">×</a> 8 </div> 9 <div class="cart-bd-con"><ul class="cart-tlist"></ul> 10 </div> 11 <div class="cart-bd-ft"><p class="cart-ft-tips">1、候补订单需求中可包含2个相邻乘车日期,每个乘车日期可包含2个不同“车次+席别”的组合需求。</p> 12 <p class="cart-ft-tips">2、排位是指您的订单在待兑现订单中的位置。当前排位仅供参考,实际排位以支付成功后为准。</p> 13 <a id="hbSubmit" href="javascript:void(0)" class="btn72 fr" shape="rect">添加乘客</a> 14 </div> 15 </div> 16 </div> 17 </body> 18 </html> # 用‘lxml’得到的汤到此为止 19 <script type="text/javascript" src="/otn/resources/js/framework/station_name.js?station_version=1.9115" xml:space="preserve"></script> 20 <script type="text/javascript" src="/otn/resources/js/framework/favorite_name.js" xml:space="preserve"></script> 21 <script type="text/javascript" src="/otn/resources/merged/queryLeftTicket_end_js.js?scriptVersion=1.9158" xml:space="preserve"></script> ...
1 >>> url = "https://kyfw.12306.cn/otn/leftTicket/init?linktypeid=dc&fs=%E4%B8%87%E5%B7%9E,WYW&ts=%E8%A5%BF%E5%AE%89,XAY&date=2019-11-05&flag=N,N,Y" 2 ... response = requests.get(url, timeout=10) 3 ... response.encoding = ‘utf-8‘ 4 ... lxml = bs(response.text, ‘lxml‘) 5 ... html5lib = bs(response.text, ‘html5lib‘) 6 ... response.close() 7 >>> lxml.find_all(src=re.compile(".*station_version.*")) 8 [] 9 >>> html5lib.find_all(src=re.compile(".*station_version.*")) 10 [<script src="/otn/resources/js/framework/station_name.js?station_version=1.9115" type="text/javascript" xml:space="preserve"></script>]
原文地址:https://www.cnblogs.com/wawawawa-briefnote/p/11801636.html
时间: 2024-10-10 09:47:24