赵宏田,Python社区专栏作者
博客:https://zhuanlan.zhihu.com/pythoncoder
今天分享一下前段时间抓取链家上北京二手房数据的项目。本次分享分为两部分,第一部分介绍如何使用scrapy抓取二手房数据,第二部分我将抓下来的数据进行了一些简单的分析和可视化。最后我会贴上数据,感兴趣的朋友可以深入分析
Github地址:https://github.com/HunterChao/Crawler
1、使用scrapy抓取二手房数据
文章目录结构
D:. │ run.py │ scrapy.cfg │ └─LianJia │ items.py │ pipelines.py │ settings.py │ __init__.py │ ├─spiders │ lianjia.py │ __init__.py
lianjia.py是程序的主要运行文件,run.py为程序启动文件。在pycharm下执行run.py即可启动程序。
项目分析:
链接的构造:我们通过抓取首页可以获得北京市各城区的名称(如:东城、西城、朝阳)及对应的拼音,进一步通过遍历每个城区对应的页码数(Pn)即可构造出各城区的二手房链接。
信息的抓取:在进入各个城区的二手房页面时,可匹配出每个房源的详细信息。这里需要注意的是,由于我想将各房源的经纬度信息获取以便可视化到地图上,需要找到每个房源的详情页链接,进入该链接,匹配出经纬度相关的字段。(resblockPosition)
数据字段:item.py
# -*- coding: utf-8 -*-
import scrapy
class LianjiaItem(scrapy.Item): # 标签 小区 户型 面积 关注人数 观看人数 发布时间 价格 均价 详情链接 经纬度 城区 title = scrapy.Field() community = scrapy.Field() model = scrapy.Field() area = scrapy.Field() focus_num = scrapy.Field() watch_num = scrapy.Field() time = scrapy.Field() price = scrapy.Field() average_price = scrapy.Field() link = scrapy.Field() Latitude = scrapy.Field() city = scrapy.Field()
主要运行函数:lianjia.py
# -*- coding: utf-8 -*- import scrapy import requests import re import time from lxml import etree from ..items import LianjiaItem from scrapy_redis.spiders import RedisSpider
class LianjiaSpider(RedisSpider): name = 'lianjiaspider' redis_key = 'lianjiaspider:urls' start_urls = 'http://bj.lianjia.com/ershoufang/'
def start_requests(self): user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 \ Safari/537.36 SE 2.X MetaSr 1.0' headers = {'User-Agent': user_agent} yield scrapy.Request(url=self.start_urls, headers=headers, method='GET', callback=self.parse)
def parse(self, response): user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 \ Safari/537.36 SE 2.X MetaSr 1.0' headers = {'User-Agent': user_agent} lists = response.body.decode('utf-8') selector = etree.HTML(lists) area_list = selector.xpath('/html/body/div[3]/div[2]/dl[2]/dd/div[1]/div/a') for area in area_list: try: area_han = area.xpath('text()').pop() # 地点
area_pin = area.xpath('@href').pop().split('/')[2] # 拼音 area_url = 'http://bj.lianjia.com/ershoufang/{}/'.format(area_pin) print(area_url) yield scrapy.Request(url=area_url, headers=headers, callback=self.detail_url, meta={"id1":area_han,"id2":area_pin} ) except Exception: pass
def get_latitude(self,url): # 进入每个房源链接抓经纬度 p = requests.get(url) contents = etree.HTML(p.content.decode('utf-8')) latitude = contents.xpath('/ html / body / script[19]/text()').pop() time.sleep(3) regex = '''resblockPosition(. )''' items = re.search(regex, latitude) content = items.group()[:-1] # 经纬度 longitude_latitude = content.split(':')[1] return longitude_latitude[1:-1]
def detail_url(self,response): 'http://bj.lianjia.com/ershoufang/dongcheng/pg2/' for i in range(1,101): url = 'http://bj.lianjia.com/ershoufang/{}/pg{}/'.format(response.meta["id2"],str(1)) time.sleep(2) try: contents = requests.get(url) contents = etree.HTML(contents.content.decode('utf-8')) houselist = contents.xpath('/html/body/div[4]/div[1]/ul/li') for house in houselist: try: item = LianjiaItem() item['title'] = house.xpath('div[1]/div[1]/a/text()').pop() item['community'] = house.xpath('div[1]/div[2]/div/a/text()').pop() item['model'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[1] item['area'] = house.xpath('div[1]/div[2]/div/text()').pop().split('|')[2] item['focus_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[0] item['watch_num'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[1] item['time'] = house.xpath('div[1]/div[4]/text()').pop().split('/')[2] item['price'] = house.xpath('div[1]/div[6]/div[1]/span/text()').pop() item['average_price'] = house.xpath('div[1]/div[6]/div[2]/span/text()').pop() item['link'] = house.xpath('div[1]/div[1]/a/@href').pop() item['city'] = response.meta["id1"] self.url_detail = house.xpath('div[1]/div[1]/a/@href').pop() item['Latitude'] = self.get_latitude(self.url_detail) except Exception: pass yield item except Exception: pass
抓取效果:
2、北京二手房数据的简单分析
北京二手房数据:https://pan.baidu.com/share/init?shareid=237386059 |