icrawler が JSONDecodeError で動作しないときの解決方法 -

追記

2021/02/01現在の icrawlerでTypeErrorが出て画像がダウンロードされない問題について対策記事を書きましたので御覧ください。

現在の iclrawler はGoogleの仕様変更によりJSONパーサーでエラーが出るようになっています。

以下の手順で修正可能です。

エラーメッセージ

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

1	json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

原因

ライブラリがGoogleの仕様変更に対応していない。

対策

google.pyを修正する。

場所はここになります。見つからなければ、google.pyで検索してください。
for macOS

/Users/USERNAME/Library/Python/3.7/lib/python/site-packages/icrawler/builtin/google.py

またはこのあたり
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/icrawler/builtin/google.py

Windowsの場合はこちら

C:\Users\USERNAME\AppData\Local\Programs\Python\Python36\Lib\site-packages\icrawler\builtin\google.py

ソースコードを検索して、 class GoogleParser(Parser): が記述されているところを探してください。

ここをごっそり入れ替えます。

Pythonはインデントが重要な意味を持ちますので、ペーストミスにご注意ください。

修正前

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        image_divs = soup.find_all('script')
        for div in image_divs:
            txt = div.string
            if txt is None or not txt.startswith('AF_initDataCallback'):
                continue
            if 'ds:1' not in txt:
                continue
            txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
                         "\\2", txt, 0, re.DOTALL)

            meta = json.loads(txt)
            data = meta[31][0][12][2]

            uris = [img[1][3][0] for img in data if img[0] == 1]
            return [{'file_url': uri} for uri in uris]

class GoogleParser(Parser):

def parse(self, response):

soup = BeautifulSoup(

response.content.decode('utf-8', 'ignore'), 'lxml')

image_divs = soup.find_all('script')

for div in image_divs:

txt = div.string

if txt is None or not txt.startswith('AF_initDataCallback'):

continue

if 'ds:1' not in txt:

continue

txt = re.sub(r"^AF_initDataCallback${.*key: 'ds:(\d)'.+data:function\(${return (.+)}}\);?$",

"\\2", txt, 0, re.DOTALL)

meta = json.loads(txt)

data = meta[31][0][12][2]

uris = [img[1][3][0] for img in data if img[0] == 1]

return [{'file_url': uri} for uri in uris]

修正後

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
            return [{'file_url': uri} for uri in uris]

class GoogleParser(Parser):

def parse(self, response):

soup = BeautifulSoup(

response.content.decode('utf-8', 'ignore'), 'lxml')

#image_divs = soup.find_all('script')

image_divs = soup.find_all(name='script')

for div in image_divs:

#txt = div.text

txt = str(div)

#if not txt.startswith('AF_initDataCallback'):

if 'AF_initDataCallback' not in txt:

continue

if 'ds:0' in txt or 'ds:1' not in txt:

continue

#txt = re.sub(r"^AF_initDataCallback${.*key: 'ds:(\d)'.+data:function\(${return (.+)}}\);?$",

# "\\2", txt, 0, re.DOTALL)

#meta = json.loads(txt)

#data = meta[31][0][12][2]

#uris = [img[1][3][0] for img in data if img[0] == 1]

uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)

return [{'file_url': uri} for uri in uris]

参考

Google Crawler is down #65

https://github.com/hellock/icrawler/issues/65

原因

対策

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル