Create jianshu.py by LuoZijun · Pull Request #1 · malone6/Jianshu

LuoZijun · 2017-06-22T11:52:19Z

性能更快哦。

malone6 · 2017-06-23T06:38:05Z

你的代码写太棒了，佩服佩服！！！我对你的代码还要好好研究下
win对signal支持不好，使用不了。我在linux上试了下，那里么多线程并发，速度很快。我的水平比较渣，多线程都是用框架解决。谢谢你提供的写法，我再学习下相关知识。后期对jianshu爬虫做下优化。

LuoZijun · 2017-06-23T07:38:09Z

@malone6 这个写法没有限定任务数，我自己电脑内存比较大，现在已经加上去啦。另外也写了个知乎的:

https://gist.github.com/LuoZijun/7c0a163d4fa8017f30b1780d0dd1fd25

LuoZijun · 2017-06-23T07:41:21Z

@malone6 另外，看你写了很多爬虫，你既然已经爬好数据了，不知道能否分析下数据呢？ 😃

malone6 · 2017-06-23T08:09:46Z

@LuoZijun 知乎的我之前也是用scrapy写，结果考虑不周，最后渣笔记本内存爆满。
业余写了几个爬虫，感觉数据不好就放弃分析了（分析了一个瓜子二手车的）。而且自己在分析方面的知识也比较欠缺，所以现在也就在不断学习采集和分析及展示。之前的几个小爬虫也尽量抽出时间进行重新整理代码并分析。

LuoZijun · 2017-06-23T08:28:59Z

@malone6 知乎有速度限制，所以只能慢慢爬 ==

简书只有每天的总流量限制（大概 1 GB），爬取速度要快些。

你笔记本爬可能不太合适啊，得一直开着呢。

malone6 · 2017-06-23T08:49:00Z

简书是按流量限制？好神奇，哈哈。大神研究的真清楚。请问方便提供社交账号（知乎或简书类的）吗，方便交流和关注你的知识分享

LuoZijun · 2017-06-23T08:52:47Z

@malone6 是啊，我爬简书半个小时左右，就爬了770MB 左右的 JSON 数据。然后整个 IP 就被屏蔽了。不过没关系第二天接着爬。

知乎的爬虫限制还是比较严的。

Github 就是我的社交账号(微博没用过，知乎&简书都只是偶尔看看，没写过什么东西 ==).

malone6 · 2017-06-23T09:18:03Z

@LuoZijun 知乎估计被人爬的太多了，所以就限制比较严。
目前看，简书可以高速抓取。限制1G还好，数据也不少。可以慢慢爬，或者有条件搞个分布式然后ip代理之类的。

没事。github已关注你，多向你学习

LuoZijun · 2017-06-23T11:00:18Z

@malone6 一起学习 :)

LuoZijun · 2017-06-26T10:43:55Z

@malone6 我今天看了下，大概爬了 1.082952 GB 数据，大概 265318 个用户的信息。
如果你需要的话，我传给你一个压缩包。

现在爬取的人大部分只有几个粉丝，没什么意义。

malone6 · 2017-06-26T11:12:24Z

@LuoZijun 好呀。你把可以把数据传到网盘之类的，提供给我一个下载方式，我的邮箱malone5@163.com。目前觉得，从推荐作者作为入口不是太好的策略，但还没想到更好的方式。我看看你的这份数据和我那份差别大不大

LuoZijun · 2017-06-27T08:34:01Z

@malone6

jianshu.tar.bz2

另外，今天加上去知乎的自动登录了，你的那个登录方式限制太严了，几乎没法使用。

更新在这里:

https://gist.github.com/LuoZijun/7c0a163d4fa8017f30b1780d0dd1fd25/revisions

malone6 · 2017-06-28T04:49:11Z

我把数据导入mongodb中。大概查了下数据，问题有点大。26w数据有17w是空的(只有一个slug)，不哪里除了问题。粉丝比较多的人只能抓到几百个关系，关系感觉暂时没必要，保存关系耗费太大了。

LuoZijun · 2017-06-28T06:13:32Z

@malone6

这个只有 slug 一个字段有数据，这个是早期爬的时候，没有在写入的时候判断数据格式，当时就开了一下，以为没啥数据，就没在意。没想到跑得太快了，还是有这么多数据==
稍晚一点，我写个脚本把这类数据重新同步(在现在的爬取代码里面不存这个问题了)。

粉丝比较多的人只能抓到几百个，这个应该也是早期，考虑加个机制只更新 关注 & 被关注 数据。
因为现在逻辑是能爬多少是多少。

关系保存问题，这个还是需要的，需要根据这个关系列表检查下去。避免重新爬的时候从头开始。

LuoZijun · 2017-06-28T10:21:22Z

@malone6

脚本已更新。现在在修复数据。

更新代码: https://gist.github.com/LuoZijun/7c0a163d4fa8017f30b1780d0dd1fd25/revisions

LuoZijun · 2017-06-29T02:43:48Z

@malone6 今早看的数据，的确恢复了很多 :))

malone6 · 2017-06-29T02:49:24Z

哈哈，好的。这两天忙，没有试。明天我研究下😊😊 来自魅族 MX4

…

-------- 原始邮件 -------- 发件人：LuoZijun <notifications@github.com> 时间：周四 6月29日 10:43 收件人：malone6/Jianshu <Jianshu@noreply.github.com> 抄送：malone6 <malone5@163.com>,Mention <mention@noreply.github.com> 主题：Re: [malone6/Jianshu] Create jianshu.py (#1)

@malone6 今早看的数据，的确恢复了很多 :)) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/malone6/Jianshu","title":"malone6/Jianshu","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in ***@***.*** in #1: @malone6 今早看的数据，的确恢复了很多 :))"}],"action":{"name":"View Pull Request","url":"#1 (comment)"}}}

LuoZijun · 2017-06-30T03:35:34Z

@malone6 到时候好了，再发给你。

LuoZijun · 2017-07-05T11:07:17Z

@malone6 新版本: https://drive.google.com/open?id=0By6cKQoy12SxZC1HTnJpd1l3bWs

大概 493449 份用户资料（ ~= 82MB ）。

malone6 · 2017-07-06T16:38:57Z

厉害了！这份样本数据好啊！你抓了多久？我解压如数据库用了近一天……搞完比你的少了1w多数据，难道是解压丢了？

LuoZijun · 2017-07-07T02:41:10Z

@malone6 应该是一边压缩的时候又爬了那么多。那个数据量是压缩完的。

现在其实有50多万了。但是后续的基本都没写过字。

LuoZijun · 2017-07-07T03:25:43Z

@malone6 这个没有抓多久，因为简书没有反爬虫，所以还是比较快的。

你解压慢，应该是你要换笔记本啦。

最后，统计数据有时间能不能多出几个维度的分析呢？比如 字数 之类的。

粉丝 数大于某个值，但是又不是 签约作者 的数量（因为签约作者数量感觉比较少）。

你有时间的话，可以出个详细 :))

malone6 · 2017-07-07T03:31:26Z

搞的，肯定的。最近在老家，忙事情，今晚我先大致看下，后面出个详细的！谢谢你的数据☺出详细的告诉你来自魅族 MX4

…

-------- 原始邮件 -------- 发件人：LuoZijun <notifications@github.com> 时间：周五 7月7日 11:25 收件人：malone6/Jianshu <Jianshu@noreply.github.com> 抄送：malone6 <malone5@163.com>,Mention <mention@noreply.github.com> 主题：Re: [malone6/Jianshu] Create jianshu.py (#1)

@malone6 这个没有抓多久，因为简书没有反爬虫，所以还是比较快的。你解压慢，应该是你要换笔记本啦。最后，统计数据有时间能不能多出几个维度的分析呢？比如字数之类的。粉丝数大于某个值，但是又不是签约作者的数量（因为签约作者数量感觉比较少）。你有时间的话，可以出个详细 :)) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/malone6/Jianshu","title":"malone6/Jianshu","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in ***@***.*** in #1: @malone6 这个没有抓多久，因为简书没有反爬虫，所以还是比较快的。\r\n\r\n你解压慢，应该是你要换笔记本啦。\r\n\r\n最后，统计数据有时间能不能多出几个维度的分析呢？比如 `字数` 之类的。\r\n\r\n`粉丝` 数大于某个值，但是又不是 `签约作者` 的数量（因为签约作者数量感觉比较少）。\r\n\r\n你有时间的话，可以出个详细 :))\r\n"}],"action":{"name":"View Pull Request","url":"#1 (comment)"}}}

LuoZijun · 2017-07-07T05:32:58Z

@malone6 不客气 :)

LuoZijun · 2017-07-13T06:22:01Z

@malone6 一个更高效和节约资源（内存）的爬虫模型: https://gist.github.com/LuoZijun/a41fc0c94099cc61b410112088696621

基于 Python3.5 以后才有的 EventLoop 模型。

有时间，我把这个 https://gist.github.com/LuoZijun/7c0a163d4fa8017f30b1780d0dd1fd25 重写下，
应该在一些内存更少的环境会更适用。

Create jianshu.py

9256444

Repository owner deleted a comment from fede-s Dec 30, 2023

github-staff deleted a comment from sandeep661 Apr 26, 2024

github-staff deleted a comment from cloudi5development May 12, 2024

Conversation

LuoZijun commented Jun 22, 2017

Uh oh!

malone6 commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 23, 2017

Uh oh!

malone6 commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 23, 2017

Uh oh!

malone6 commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 23, 2017

Uh oh!

malone6 commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 23, 2017

Uh oh!

LuoZijun commented Jun 26, 2017

Uh oh!

malone6 commented Jun 26, 2017

Uh oh!

LuoZijun commented Jun 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malone6 commented Jun 28, 2017

Uh oh!

LuoZijun commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuoZijun commented Jun 28, 2017

Uh oh!

LuoZijun commented Jun 29, 2017

Uh oh!

malone6 commented Jun 29, 2017 via email

Uh oh!

LuoZijun commented Jun 30, 2017

Uh oh!

LuoZijun commented Jul 5, 2017

Uh oh!

malone6 commented Jul 6, 2017

Uh oh!

LuoZijun commented Jul 7, 2017

Uh oh!

LuoZijun commented Jul 7, 2017

Uh oh!

malone6 commented Jul 7, 2017 via email

Uh oh!

LuoZijun commented Jul 7, 2017

Uh oh!

LuoZijun commented Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuoZijun commented Jun 27, 2017 •

edited

Loading

LuoZijun commented Jun 28, 2017 •

edited

Loading

LuoZijun commented Jul 13, 2017 •

edited

Loading