requests项目实战--抓取猫眼电影排行

发布时间:2019-05-06 00:25:44编辑:Run阅读(4830)

    requests项目实战--抓取猫眼电影排行

    目标 url : https://maoyan.com/board/4?offset=0

    提取出猫眼电影TOP100的电影名称,主演,上映时间,评分,图片等信息,提取的结果以文本的形式保存起来。


    环境:安装requests库,lxml--xpath解析

    pip3 install requests

    pip3 install lxml


    抓取分析:

    offset为偏移量,一共10页,每页10部电影,offset=90为最后一页,offset每次+=10则是下一页的url地址。

    image.png


    xpath内容提取:

    获取每一页的所有电影名:

    //p[@class='name']/a/text()

    image.png


    获取每一页所有的主演名:

    //p[@class='star']/text()

    image.png


    获取每一页的所有电影上映时间:

    //p[@class='releasetime']/text()

    image.png


    获取每一页所有的电影评分

    //p[@class='score']/i/text()

    image.png


    获取每一页所有电影图片url地址

    //img[@class='board-img']/@src

    image.png



    完整代码:

    #!/usr/bin/env python
    # coding: utf-8
    
    import requests
    from lxml import etree
    import time
    import json
    
    
    class Item:
        movie_name = None   # 电影名
        to_star = None  # 主演
        release_time = None   # 上映时间
        score = None   # 评分
        picture_address = None   # 图片地址
    
    
    class GetMaoYan:
        def get_html(self, url):
            try:
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
                }
                response = requests.get(url=url, headers=headers)
                if response.status_code == 200:
                    return response.text
                return None
            except Exception:
                return None
    
        def get_content(self, html):
            items = []
            # normalize-space 去空格,换行符
            content = etree.HTML(html)
            all_list = content.xpath("//dl[@class='board-wrapper']/dd")
            for i in all_list:
                item = Item()
                item.movie_name = i.xpath("normalize-space(.//p[@class='name']/a/text())")
                item.to_star = i.xpath("normalize-space(.//p[@class='star']/text())")
                item.release_time = i.xpath("normalize-space(.//p[@class='releasetime']/text())")
                x, y = i.xpath(".//p[@class='score']/i/text()")
                item.score = x + y
                item.picture_address = i.xpath("normalize-space(./a/img[@class='board-img']/@data-src)")
                items.append(item)
            return items
    
        def write_to_txt(self, items):
            content_dict = {
                'movie_name': None,
                'to_star': None,
                'release_time': None,
                'score': None,
                'picture_address': None
                            }
            with open('result.txt', 'a', encoding='utf-8') as f:
                for item in items:
                    content_dict['movie_name'] = item.movie_name
                    content_dict['to_star'] = item.to_star
                    content_dict['release_time'] = item.release_time
                    content_dict['score'] = item.score
                    content_dict['picture_address'] = item.picture_address
                    print(content_dict)
                    f.write(json.dumps(content_dict, ensure_ascii=False) + '\n')
    
        def main(self, offset):
            url = 'https://maoyan.com/board/4?offset=' + str(offset)
            html = self.get_html(url)
            items = self.get_content(html)
            self.write_to_txt(items)
    
    
    if __name__ == '__main__':
        st = GetMaoYan()
        for i in range(10):
            st.main(offset=i*10)
            time.sleep(1)

    运行结果:

    image.png


    文本结果:

    image.png

关键字