用Python爬取六大平台的弹幕、评论,看这一篇就够了( 五 )


https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=&page_size=10https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20区别在于参数last_id和page_size 。page_size在第一条url中的值为10,从第二条url开始固定为20 。last_id在首条url中值为空,从第二条开始会不断发生变化,经过我的研究,last_id的值就是从前一条url中的最后一条评论内容的用户id(应该是用户id);网页数据格式为json格式 。
 
实战代码import requestsimport pandas as pdimport timeimport randomheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}df = pd.DataFrame()try:a = 0while True:if a == 0:url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&page_size=10'else:# 从id_list中得到上一条页内容中的最后一个id值url = f'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id={id_list[-1]}&page_size=20'print(url)res = requests.get(url, headers=headers).json()id_list = []# 建立一个列表保存id值for i in res['data']['comments']:ids = i['id']id_list.append(ids)uname = i['userInfo']['uname']addTime = i['addTime']content = i.get('content', '不存在')# 用get提取是为了防止键值不存在而发生报错,第一个参数为匹配的key值,第二个为缺少时输出text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]})df = pd.concat([df, text])a += 1time.sleep(random.uniform(2, 3))except Exception as e:print(e)df.to_csv('哥斯拉大战金刚_评论.csv', mode='a+', encoding='utf-8', index=False)结果展示:

用Python爬取六大平台的弹幕、评论,看这一篇就够了

文章插图
 
知乎本文以爬取知乎热点话题《如何看待网传腾讯实习生向腾讯高层提出建议颁布拒绝陪酒相关条令?》为例,讲解如爬取知乎回答!
网页地址:
https://www.zhihu.com/question/478781972 
分析网页经过查看网页源代码等方式,确定该网页回答内容为动态加载的,需要进入浏览器的开发者工具进行抓包 。进入Noetwork→XHR,用鼠标在网页向下拉取,得到我们需要的数据包:
用Python爬取六大平台的弹幕、评论,看这一篇就够了

文章插图
 
得到的真实url:
https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=defaulthttps://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=defaulturl有很多不必要的参数,大家可以在浏览器中自行删减 。两条url的区别在于后面的offset参数,首条url的offset参数为0,第二条为5,offset是以公差为5递增;网页数据格式为json格式 。


推荐阅读