今日头条文章评论内容爬取

2021年9月17日 32点热度 0条评论 来源: Ta来自江湖

因为业务要求,需要爬取今日头条文章相关评论内容。经过分析,今日头条评论接口有很多个(主要包括PC端和app端)。

经过分析发现app端较pc端更好爬取,主要是从大量爬取被封IP的概率考虑。下面主要以http://is-hl.snssdk.com/article/v4/tab_comments/这个链接进行分析,其他几个区别不大,可以迁移。

一级评论内容URL:http://is-hl.snssdk.com/article/v4/tab_comments/?group_id=6635154779754463757&item_id=6635154779754463757&aggr_type=1&count=20&offset=20&tab_index=0&fold=1&iid=53137311418&device_id=57714824519&ac=wifi&channel=samsungapps&aid=13&app_name=news_article&version_code=701&version_name=7.0.1&device_platform=android&ab_version=611287%2C650250%2C486953%2C647938%2C648204%2C642200%2C452159%2C571131%2C641920%2C639003%2C239098%2C612192%2C641906%2C170988%2C643890%2C642339%2C594604%2C374118%2C641855%2C642664%2C644565%2C648685%2C633720%2C613177%2C550042%2C435213%2C603543%2C586998%2C609623%2C642975%2C627128%2C649426%2C614097%2C522766%2C648762%2C416055%2C621360%2C646597%2C639580%2C643097%2C630238%2C558139%2C555254%2C640008%2C635503%2C603442%2C596392%2C550818%2C630577%2C598626%2C644845%2C634911%2C646253%2C603386%2C603399%2C603404%2C603405%2C642681%2C649811%2C646564%2C648850%2C629152%2C607361%2C471797%2C609338%2C326532%2C631168%2C641414%2C646381%2C637865%2C644620%2C638168%2C648057%2C631389%2C644945%2C622716%2C644036%2C622132%2C622993%2C649184%2C640997%2C641075%2C643790%2C631607%2C633139%2C643839%2C637419%2C554836%2C549647%2C644131%2C621574%2C572465%2C649269%2C644057%2C615292%2C606547%2C442255%2C642353%2C648265%2C630218%2C546701%2C649327%2C281292%2C633176%2C632885%2C610675%2C622045%2C325614%2C620936%2C649526%2C642450%2C634871%2C646070%2C625066%2C614990%2C649284%2C498375%2C613887%2C638335%2C467515%2C644238%2C631638%2C650051%2C648895%2C648270%2C595556%2C647930%2C640690%2C638195%2C589102%2C633487%2C457481%2C649401&ab_client=a1%2Cc4%2Ce1%2Cf1%2Cg2%2Cf7&ab_group=94567%2C102753%2C181428&ab_feature=94567%2C102753&abflag=3&ssmix=a&device_type=SM-A8000&device_brand=samsung&language=zh&os_api=23&os_version=6.0.1&openudid=1869be23a123ab41&manifest_version_code=701&resolution=1080*1920&dpi=480&update_version_code=70108&_rticket=1544875730759&fp=crT_crTZPrGSFlDqFSU1F2KIFzKe&tma_jssdk_version=1.5.3.2&rom_version=23&plugin=26958&ts=1544875730&as=a2054e91026d3cdec44355&mas=0037f78d55165d05d8ec7f161068fbb831cca448e606686ef1

具体参数为:

经过分析,最后只需要的参数为:

其中,offset为偏移量,count为每次提取的数量,每次最多可以提取50条,item_id和group_id为文章的id, ts为每次请求的时间戳

二级评论接口:

http://lf-hl.snssdk.com/2/comment/v3/reply_list/?

需要参数为:

其中,offset为偏移量,count为每次提取的数量,每次最多可以提取50条,id为评论回复id,通过id = comment['comment']['id']获取

可以获取用户昵称、头像、评论内容、评论时间、评论点赞数、评论回复数等信息。

具体代码请看GitHub: 我的GitHub地址

    原文作者:Ta来自江湖
    原文地址: https://blog.csdn.net/codingforhaifeng/article/details/88801546
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系管理员进行删除。