- 浏览: 85932 次
文章分类
最新评论
-
cuisuqiang:
smallbee 写道信息: Initializing Coy ...
apache tomcat负载均衡实验记录 -
hwy1782:
数据库分库分表使用的是TDDL
淘宝网技术分析(整理中) -
smallbee:
信息: Initializing Coyote HTTP/1. ...
apache tomcat负载均衡实验记录 -
likebin:
受用,值得学习
新浪微博架构分析
Introduction
In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with Scrapy, a very powerful, and yet simple, scraping and web-crawling framework.
For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple spider using Scrapy, which will crawl the blog and store the extracted data into a MongoDB database.
We will consider that you have a working MongoDB server, and that you have installed the pymongo
and scrapy
python packages, both installable with pip
.
If you have never toyed around with Scrapy, you should first read this short tutorial.
First step, identify the URL pattern(s)
In this example, we’ll see how to extract the following information from each isbullsh.it blogpost :
- title
- author
- tag
- release date
- url
We’re lucky, all posts have the same URL pattern: http://isbullsh.it/YYYY/MM/title
. These links can be found in the different pages of the site homepage.
What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.
Building the spider
We create a Scrapy project, following the instructions from their tutorial. We obtain the following project structure:
isbullshit_scraping/
├── isbullshit
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── isbullshit_spiders.py
└── scrapy.cfg
We begin by defining, in items.py
, the item structure which will contain the extracted information:
from scrapy.item import Item, Field
class IsBullshitItem(Item):
title = Field()
author = Field()
tag = Field()
date = Field()
link = Field()
Now, let’s implement our spider, in isbullshit_spiders.py
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem
class IsBullshitSpider(CrawlSpider):
name = 'isbullshit'
start_urls = ['http://isbullsh.it'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost')]
# r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs
def parse_blogpost(self, response):
...
Our spider inherits from CrawlSpider
, which “provides a convenient mechanism for following links by defining a set of rules”. More info here.
We then define two simple rules:
- Follow links pointing to
http://isbullsh.it/page/X
- Extract information from pages defined by a URL of pattern
http://isbullsh.it/YYYY/MM/title
, using the callback methodparse_blogpost
.
Extracting the data
To extract the title, author, etc, from the HTML code, we’ll use the scrapy.selector.HtmlXPathSelector object
, which uses the libxml2
HTML parser. If you’re not familiar with this object, you should read theXPathSelector
documentation.
We’ll now define the extraction logic in the parse_blogpost
method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):
def parse_blogpost(self, response):
hxs = HtmlXPathSelector(response)
item = IsBullshitItem()
# Extract title
item['title'] = hxs.select('//header/h1/text()').extract() # XPath selector for title
# Extract author
item['tag'] = hxs.select("//header/div[@class='post-data']/p/a/text()").extract() # Xpath selector for tag(s)
...
return item
Note: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a Scrapy shell. That only works if the data position is coherent throughout all the pages you crawl.
Store the results in MongoDB
Each time that the parse_blogspot
method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.
First, we need to add a couple of things to settings.py
:
ITEM_PIPELINES = ['isbullshit.pipelines.MongoDBPipeline',]
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "isbullshit"
MONGODB_COLLECTION = "blogposts"
Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).
Here is our pipelines.py
file :
import pymongo
from scrapy.exceptions import DropItem
from scrapy.conf import settings
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
valid = True
for data in item:
# here we only check if the data is not null
# but we could do any crazy validation we want
if not data:
valid = False
raise DropItem("Missing %s of blogpost from %s" %(data, item['url']))
if valid:
self.collection.insert(dict(item))
log.msg("Item wrote to MongoDB database %s/%s" %
(settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
level=log.DEBUG, spider=spider)
return item
Release the spider!
Now, all we have to do is change directory to the root of our project and execute
$ scrapy crawl isbullshit
The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.
Pretty neat, hm?
Conclusion
This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out Selenium. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.
Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and getting into trouble.
Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)
Please, don’t be a d*ick.
发表评论
-
RPC, RMI 和服务化
2013-11-05 14:49 638Spring的RMI集成,虽然使用方便,但是是基于长连接的, ... -
Java NIO API详解(转)
2013-09-05 11:00 869原文连接: http://www.blogjava.ne ... -
JHAT
2012-08-15 13:53 1299http://jovial.com/hat/doc/RE ... -
OSGi教程
2011-11-23 11:15 859http://developer.51cto.com/art/ ... -
Memcached入门
2011-11-07 22:45 1050Linux 是 CentOS 5.5 一、源码包准备 ... -
新浪微博架构分析
2011-11-06 21:37 4970对应视频 http://video.sina.com.cn/ ... -
解剖Twitter:Twitter系统架构设计分析
2011-11-05 00:13 2083随着信息爆炸的加剧,微博客网站Twitter横空出世了。 ... -
网站架构(页面静态化,图片服务器分离,负载均衡)方案全解析
2011-10-26 14:12 19701、HTML静态化其实大家都知道,效率最高、消耗最小的就是 ... -
一致性Hash
2011-10-21 18:19 850consistent hashing 算法早在 1997 ... -
apahce存储静态文件
2011-10-03 16:50 1098公司要把静态文件放到apahce上 默认根目录是/us ... -
Hadoop环境搭建记录
2011-09-29 19:31 975实验中可以参考的文章: http://blog.sina.c ... -
Memcached学习笔记——windows上初步使用
2011-09-26 11:56 785原文:http://j2ee.blog.sohu.com/70 ...
相关推荐
pip install scrapy-crawl-once 用法 要启用它,请修改settings.py: SPIDER_MIDDLEWARES = { # ... 'scrapy_crawl_once.CrawlOnceMiddleware': 100, # ... } DOWNLOADER_MIDDLEWARES = { # ... 'scrapy_crawl...
Website Scraping with Python: Using BeautifulSoup and Scrapy By 作者: Gábor László Hajba ISBN-10 书号: 1484239245 ISBN-13 书号: 9781484239247 Edition 版本: 1st ed. 出版日期: 2018-09-15 pages 页数: ...
# raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported") # spname = args[0] for spname in args: self.crawler_process.crawl(spname, **opts.spargs) self.crawler...
使用scrapy编写的简单的b站弹幕信息爬虫
使用scrapy编写的简单的西刺代理信息爬虫
crawl_workspacecrawl
使用scrapy框架编写的百度百科爬虫
有时需要根据项目的实际需求向...scrapy crawl myspider -a category=electronics 然后在spider里这样写: import scrapy class MySpider(scrapy.Spider): name = 'myspider' def __init__(self, category=None,
This book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy ...
Closely examine website scraping and data processing: the technique of extracting data from websites in a format suitable for further analysis. You’ll review which tools to use, and compare their ...
crawl_time = scrapy.Field() crawl_update_time = scrapy.Field() def get_insert_sql(self): insert_sql = """ insert into lagou_job(title, url, salary, job_city, work_years, degree_need, job_type, ...
$ scrapy crawl atlanticpacific $ scrapy crawl beautylegmm $ scrapy crawl fancy $ scrapy crawl garypeppergirl $ scrapy crawl itscamilleco $ scrapy crawl madamejulietta $ scrapy crawl ohmyvogue $ scrapy...
用Pyinstaller打包Scrapy项目,crawl.py文件是关键,具体内容看我的博客http://blog.csdn.net/La_vie_est_belle?ref=toolbar
Learning Scrapy, talking how to use the scrapy framework to crawl web pages.
scrapy 是 python 写的爬虫框架,代码架构借鉴于django,灵活多样,功能强大。
爬虫scrapy框架小实例,在dos窗口项目所在目录,使用scrapy crawl basic 直接爬取,显示内容和网站的内容一样。
This book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy ...
当前,所有爬虫增量抓取的开关已经打开,如果需要,可以手动关闭,/spiders/***.py文件的FLAG_INTERRUPT = True常量20110406 ~ 20130715 ~ now scrapy crawl xwlb20100613 ~ 20110405 scrapy crawl xwlb120100506 ~ ...
Amazon_Website_Scraping_Scrapy 使用Scrapy python库抓取亚马逊网站和商店:标题,评分和评论跑蜘蛛转到亚马逊/蜘蛛并键入此命令scrapy crawl amazonbot
主要介绍了如何在django中运行scrapy框架,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下