Coroutines

Scrapy supports the coroutine syntax (i.e. async def).

Supported callables

The following callables may be defined as coroutines using async def, and hence use coroutine syntax (e.g. await, async for, async with):

Using Deferred-based APIs

In addition to native coroutine APIs Scrapy has some APIs that return a Deferred object or take a user-supplied function that returns a Deferred object. These APIs are also asynchronous but don’t yet support native async def syntax. In the future we plan to add support for the async def syntax to these APIs or replace them with other APIs where changing the existing ones isn’t possible.

These APIs have a coroutine-based implementation and a Deferred-based one:

The following user-supplied methods can return Deferred objects (the methods that can also return coroutines are listed in Supported callables):

In most cases you can use these APIs in code that otherwise uses coroutines, by wrapping a Deferred object into a Future object or vice versa. See Integrating Deferred code and asyncio code for more information about this.

For example: a custom scheduler needs to define an open() method that can return a Deferred object. You can write a method that works with Deferreds and returns one directly, or you can write a coroutine and convert it into a function that returns a Deferred with deferred_f_from_coro_f().

General usage

There are several use cases for coroutines in Scrapy.

Code that would return Deferreds when written for previous Scrapy versions, such as downloader middlewares and signal handlers, can be rewritten to be shorter and cleaner:

from itemadapter import ItemAdapter


class DbPipeline:
    def _update_item(self, data, item):
        adapter = ItemAdapter(item)
        adapter["field"] = data
        return item

    def process_item(self, item):
        adapter = ItemAdapter(item)
        dfd = db.get_some_data(adapter["id"])
        dfd.addCallback(self._update_item, item)
        return dfd

becomes:

from itemadapter import ItemAdapter


class DbPipeline:
    async def process_item(self, item):
        adapter = ItemAdapter(item)
        adapter["field"] = await db.get_some_data(adapter["id"])
        return item

Coroutines may be used to call asynchronous code. This includes other coroutines, functions that return Deferreds and functions that return awaitable objects such as Future. This means you can use many useful Python libraries providing such code:

class MySpiderDeferred(Spider):
    # ...
    async def parse(self, response):
        additional_response = await treq.get("https://additional.url")
        additional_data = await treq.content(additional_response)
        # ... use response and additional_data to yield items and requests


class MySpiderAsyncio(Spider):
    # ...
    async def parse(self, response):
        async with aiohttp.ClientSession() as session:
            async with session.get("https://additional.url") as additional_response:
                additional_data = await additional_response.text()
        # ... use response and additional_data to yield items and requests

Note

Many libraries that use coroutines, such as aio-libs, require the asyncio loop and to use them you need to enable asyncio support in Scrapy.

Note

If you want to await on Deferreds while using the asyncio reactor, you need to wrap them.

Common use cases for asynchronous code include:

  • requesting data from websites, databases and other services (in start(), callbacks, pipelines and middlewares);

  • storing data in databases (in pipelines and middlewares);

  • delaying the spider initialization until some external event (in the spider_opened handler);

  • calling asynchronous Scrapy methods like ExecutionEngine.download() (see the screenshot pipeline example).

Inline requests

The spider below shows how to send a request and await its response all from within a spider callback:

from scrapy import Spider, Request


class SingleRequestSpider(Spider):
    name = "single"
    start_urls = ["https://example.org/product"]

    async def parse(self, response, **kwargs):
        additional_request = Request("https://example.org/price")
        additional_response = await self.crawler.engine.download_async(
            additional_request
        )
        yield {
            "h1": response.css("h1").get(),
            "price": additional_response.css("#price").get(),
        }

You can also send multiple requests in parallel:

import asyncio

from scrapy import Spider, Request


class MultipleRequestsSpider(Spider):
    name = "multiple"
    start_urls = ["https://example.com/product"]

    async def parse(self, response, **kwargs):
        additional_requests = [
            Request("https://example.com/price"),
            Request("https://example.com/color"),
        ]
        tasks = []
        for r in additional_requests:
            task = self.crawler.engine.download_async(r)
            tasks.append(task)
        responses = await asyncio.gather(*tasks)
        yield {
            "h1": response.css("h1::text").get(),
            "price": responses[0][1].css(".price::text").get(),
            "price2": responses[1][1].css(".color::text").get(),
        }