亚洲熟女综合色一区二区三区,亚洲精品中文字幕无码蜜桃,亚洲va欧美va日韩va成人网,亚洲av无码国产一区二区三区,亚洲精品无码久久久久久久

Python爬蟲實(shí)戰(zhàn)之使用Scrapy爬起點(diǎn)網(wǎng)的完本小說

一.概述

本篇的目的是用scrapy來爬取起點(diǎn)小說網(wǎng)的完本小說,使用的環(huán)境ubuntu,至于scrapy的安裝就自行百度了。

二.創(chuàng)建項(xiàng)目

scrapy startproject name 通過終端進(jìn)入到你創(chuàng)建項(xiàng)目的目錄下輸入上面的命令就可以完成項(xiàng)目的創(chuàng)建.name是項(xiàng)目名字.

三.item的編寫

我這里定義的item中的title用來存書名,desc用來存書的內(nèi)容.、

import scrapy

class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
desc = scrapy.Field()
pass

四.pipelines的編寫

在pipelines可以編寫存儲(chǔ)數(shù)據(jù)的形式,我這里就是使用txt形式的文件來存儲(chǔ)每一本書

import json
import codecs

#以txt的形式存儲(chǔ)
class TutorialPipeline(object):
#def __init__(self):

def process_item(self, item, spider):
//根據(jù)書名來創(chuàng)建文件,item.get('title')就可以獲取到書名
self.file = codecs.open(item.get('title')+'.txt', 'w', encoding='utf-8')
self.file.write(item.get("desc")+ "\n")
return item

def spider_closed(self, spider):
self.file.close()

五.Setting的編寫

只要將下面代碼中的tutorial替換成自己項(xiàng)目的名字就可以

BOT_NAME = 'tutorial'
#USER_AGENT
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
# start MySQL database configure setting

# end of MySQL database configure setting

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,

}

六.spider的編寫

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
from scrapy.http import Request

class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
//我這里是下載起點(diǎn)體育類的完本小說,所以通過for來創(chuàng)建每一個(gè)頁面的url,因?yàn)槊恳粋€(gè)只是page不同而已,而page是根據(jù)全部的本數(shù)/頁數(shù)而來
start_urls = [
"http://fin.qidian.com/?size=-1&sign=-1&tag=-1&chanId=8&subCateId=-1&orderId=&update=-1&page="+str(page)+"&month=-1&style=1&vip=-1" for page in range(1,292/20)
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
//獲取每一個(gè)書的url
book =hxs.select('//div[@class="book-mid-info"]/h4/a//@href').extract()

for bookurl in book:
//根據(jù)獲取到的書本url跳轉(zhuǎn)到每本書的頁面
yield Request("http:"+bookurl, self.parseBook, dont_filter=True)

def parseBook(self,response):
hxs = HtmlXPathSelector(response)
//獲取免費(fèi)閱讀的url
charterurl = hxs.select('//div[@class="book-info "]//a[@class="red-btn J-getJumpUrl "]/@href').extract()
//每一本書都創(chuàng)建一個(gè)item
item = TutorialItem()
for url in charterurl:
通過免費(fèi)閱讀的url進(jìn)入書的第一章
yield Request("http:"+url,meta={'item': item},callback=self.parseCharter, dont_filter=True)

def parseCharter(self ,response):
hxs = HtmlXPathSelector(response)
//獲取書名
names = hxs.select('//div[@class="info fl"]/a[1]/text()').extract()
//獲取上面?zhèn)鬟f過來的item
item = response.meta['item']
for name in names:
//將書名存入到item的title字段中
names = item.get('title')
if None==names:
item['title'] = name
//獲取章節(jié)名
biaoti = hxs.select('//h3[@class="j_chapterName"]/text()').extract()
content = ''
for biaot in biaoti:
content=content+biaot+"\n"
//獲取每一章的內(nèi)容
s = hxs.select('//div[@class="read-content j_readContent"]//p/text()').extract()
for srt in s:
//將章節(jié)和內(nèi)容拼接起來存入到item的desc中
content = content + srt
desc = item.get('desc')
if None==desc:
item['desc'] =content
else:
item['desc']=desc+content
if content=='':
yield item

#獲取下一章的內(nèi)容
chapters = hxs.select('//div[@class="chapter-control dib-wrap"]/a[@id="j_chapterNext"]//@href').extract()
for chapter in chapters:
#print "https:" + chapter
yield Request("http:" + chapter, meta={'item': item},callback=self.parseCharter, dont_filter=True)

七.總結(jié)

通過上面的代碼雖然可以獲取所有書的內(nèi)容,但是起點(diǎn)是有vip限制的,也就是說必須用起點(diǎn)的vip帳號(hào)登錄才能查看完本的小說,因此這有點(diǎn)遺憾,我沒有起點(diǎn)小說網(wǎng)的會(huì)員.

相關(guān)新聞

歷經(jīng)多年發(fā)展,已成為國內(nèi)好評(píng)如潮的Linux云計(jì)算運(yùn)維、SRE、Devops、網(wǎng)絡(luò)安全、云原生、Go、Python開發(fā)專業(yè)人才培訓(xùn)機(jī)構(gòu)!