Lily Operation Guide

When the lily daemon is started, it acts as a long-running process which coordinates the requested Jobs and directs data to specific storage targets. In order to interact with the daemon, the CLI also accepts commands which are communicated to the running daemon. This guide covers the management of jobs and tasks which result in data extraction from the Filecoin chain that the Lily node is following.


API endpoint

The daemon coordinates access to its API using the files within repo path. The repo path contains information about where the JSON RPC API is (api file located in repo path) located, along with an authentication token used (token file) for writeable interactions.

.lily/           <-- repo path root
├── api          <-- for information about the JSON RPC API location
├── token        <-- for write interactions
└── config.toml

For more details, see the Lotus API documentation.

The IP/port to which the daemon binds can be managed via the --api flag. By default, the daemon binds to localhost and will be unavailable externally unless bound to a publicly-accessible IP.


Jobs

Once the Lily daemon process is running and fully synced, you may manage its Jobs through the CLI. (Note: behavior is undefined if Lily is not fully synced, so one may gate the Job creation to wait for the sync to complete with lily sync wait && lily ....).

Jobs may be executed on a lily node via the job run command. The job run command accepts the following flags:

  • --window duration after which job execution will be canceled while processing a tipset. (A duration string is a possibly signed sequence of decimal numbers, each with an optional fraction and a unit suffix, such as “300ms”, “-1.5h” or “2h45m”. Valid time units are “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h”.)
  • --tasks specify the list of tasks to run. Some tasks will be heavier than others. A tipset is only processed when all the specified tasks have finished. Only then will Lily move on to the next epoch. The default value of this flag is all tasks lily is capable of performing.
  • --storage specifies which of the configured storage outputs the job will write to.
  • --name specifies the name of the job for easy identification later. The provided value will appear as reporter in the visor_processing_reports table.
  • --restart-delay duration to wait before restarting job after it ends execution. (A duration string is a possibly signed sequence of decimal numbers, each with an optional fraction and a unit suffix, such as “300ms”, “-1.5h” or “2h45m”. Valid time units are “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h”.)
  • --restart-on-failure specifies if a job should be restarted if it fails.
  • --restart-on-completion specifies if a job should be restarted once it completes.

Currently, Lily is capable of executing the following types of jobs:

Watch

The Watch command subscribes to incoming tipsets from the filecoin blockchain and indexes them as they arrive.

Since it may be the case that tipsets arrive at a rate greater than lily’s rate of indexing, the watch job maintains a buffer of tipsets to index. Consumption of this queue can be configured via the --workers flag. Increasing the value provided to the --workers flag will allow the watch job to index tipsets simultaneously. (Note: this will use a significant amount of system resources).

While watching a network with distributed consensus, participants may occasionally have intermittent connectivity problems, which cause some nodes to have different views of the blockchain HEAD. Eventually, the connectivity issues resolve, the nodes connect, and they reconcile their differences. The node which is found to be on the wrong branch will reorganize its local state to match the new network consensus by “unwinding” the incorrect chain of tipsets up to the point of disagreement (the “fork”) and then applying the correct chain of tipsets. This can be referred to as a “reorg.” The number of tipsets that are “unwound” from the incorrect chain is called “reorg depth.”

Lily uses a “confidence” FIFO cache, which gives the operator confidence that the tipsets that are being processed and persisted are unlikely to be reorganized. A confidence of 100 would establish a cache that will fill with as many tipsets. Once the 101st tipset is unshifted onto the cache stack, the 1st tipset would be popped off the bottom and have the Tasks processed over it. In the event of a reorg, the most recent tipsets are shifted off the top, and the correct tipsets are unshifted in their place.

Example:

$ lily job run watch --confidence=10

A visualization of the confidence cache during normal operation. Data is only extracted from tipsets marked with (process):

             *unshift*        *unshift*      *unshift*       *unshift*
                │  │            │  │            │  │            │  │
             ┌──▼──▼──┐      ┌──▼──▼──┐      ┌──▼──▼──┐      ┌──▼──▼──┐
             │        │      │  ts10  │      │  ts11  │      │  ts12  │
   ...  ---> ├────────┤ ---> ├────────┤ ---> ├────────┤ ---> ├────────┤ --->  ...
             │  ts09  │      │  ts09  │      │  ts10  │      │  ts11  │
             ├────────┤      ├────────┤      ├────────┤      ├────────┤
             │  ts08  │      │  ts08  │      │  ts09  │      │  ts10  │
             ├────────┤      ├────────┤      ├────────┤      ├────────┤
             │  ...   │      │  ...   │      │  ...   │      │  ...   │
             ├────────┤      ├────────┤      ├────────┤      ├────────┤
             │  ts02  │      │  ts02  │      │  ts03  │      │  ts04  │
             ├────────┤      ├────────┤      ├────────┤      ├────────┤
             │  ts01  │      │  ts01  │      │  ts02  │      │  ts03  │
             ├────────┤      ├────────┤      ├────────┤      ├────────┤
             │  ts00  │      │  ts00  │      │  ts01  │      │  ts02  │
             └────────┘      └────────┘      └──│──│──┘      └──│──│──┘
                                                ▼  ▼  *pop*     ▼  ▼  *pop*
                                             ┌────────┐      ┌────────┐
              (confidence=10 :: length=10)   │  ts00  │      │  ts01  │
                                             └────────┘      └────────┘
                                              (process)       (process)


A visualization of the confidence cache during a reorg of depth=2:

  *unshift*    *shift*    *shift*  *unshift*  *unshift*  *unshift*
     │  │       ▲  ▲       ▲  ▲       │  │       │  │       │  │
   ┌─▼──▼─┐   ┌─│──│─┐   ┌─│──│─┐   ┌─│──│─┐   ┌─▼──▼─┐   ┌─▼──▼─┐
   │ ts10 │   │      │   │ │  │ │   │ │  │ │   │ ts10'│   │ ts11'│
   ├──────┤   ├──────┤   ├─│──│─┤   ├─▼──▼─┤   ├──────┤   ├──────┤
   │ ts09 │   │ ts09 │   │      │   │ ts09'│   │ ts09'│   │ ts10'│
   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤
   │ ts08 │   │ ts08 │   │ ts08 │   │ ts08 │   │ ts08 │   │ ts09'│
   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤
   │ ...  │ > │ ...  │ > │ ...  │ > │ ...  │ > │ ...  │ > │ ...  │
   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤
   │ ts02 │   │ ts02 │   │ ts02 │   │ ts02 │   │ ts02 │   │ ts03 │
   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤
   │ ts01 │   │ ts01 │   │ ts01 │   │ ts01 │   │ ts01 │   │ ts02 │
   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤   ├──────┤
   │ ts00 │   │ ts00 │   │ ts00 │   │ ts00 │   │ ts00 │   │ ts01 │
   └──────┘   └──────┘   └──────┘   └──────┘   └──────┘   └─│──│─┘
                                                            ▼  ▼  *pop*
               reorg                            reorg     ┌──────┐
               occurs                          resolves   │ ts00 │
                here                             here     └──────┘
                                                          (process)

Low confidence values may risk that Lily extracts data from tipsets that are not part of the final chain, which is something that you may want or not, depending on what you are trying to index.

Constraints

  • Lily’s sync of the filecoin blockchain must be completed before starting a watch job. Users of lily may run the lily sync wait command to check if their node has completed syncing the chain. The command will exit when sync is complete.

Walk

The Walk command will index the state based on the list of tasks (--tasks) provided over the specified range (--from, --to). Each epoch will be indexed serially, starting from the heaviest tipset at the upper height (--to) to the lower height (--from).

Constraints

  • Walk jobs may only be executed over epoch ranges lily has state. Walking ranges lily does not have a state for will result in the error: blockstore: block not found being written to the errors_detected column of the visor_processing_reports table.

Find

The Find job searches for gaps in a database storage system by executing the SQL gap_find() function over the visor_processing_reports table. Find will query the database for gaps based on the list of tasks (--tasks) provided over the specified range (--from,--to). An epoch is considered to have gaps if and only if:

  1. Tasks specified by the --tasks flag are not present at each epoch within the specified range.
  2. Task specified by the --tasks flag do not have the status 'OK' at each epoch within the specified range. The find job results are written to the visor_gap_reports table with the status 'GAP'.

Constraints

  • The Find job must be executed BEFORE a Fill job. These jobs must NOT be performed simultaneously.
  • Executing the Find job over epoch ranges which have no processing reports will create GAP results for all epoch/tasks combinations.
  • Do not execute the Find job over epoch ranges imported from the lily data archive as there are no processing reports for imported data.

Fill

The Fill job queries the visor_gap_reports table for gaps to fill and indexes the data reported to have gaps. A gap in the visor_gap_reports table is any row with the status 'GAP'. Fill will index gaps based on the list of tasks (--tasks) provided over the specified range (--from, --to). Each epoch and its corresponding list of tasks found in the visor_gap_reports table will be indexed independently. When the gap is successfully filled, its corresponding entry in the visor_gap_reports table will be updated with' FILLED' status.

Constraints

  • The Fill job must be executed AFTER a Find job. These jobs must NOT be performed simultaneously.
  • Walk jobs may only be executed over epoch ranges lily has state. Walking epoch ranges which lily does not have state for will result in the error: blockstore: block not found being written to the errors_detected column of the visor_processing_reports table.
  • Do not execute the Fill job over epoch ranges imported from the lily data archive as there are no processing reports for imported data.

Survey

The Survey job collects information on the filecoin peer agents lily is connected to. The frequency at which lily collects this information may be configured via the --interval flag. Currently, this job only has one task type: peeragents. Note that this task name is distinct from the task names watch, walk, find, and fill jobs accept.

Distributed TipSet Workers

Lily may be configured to use Redis as a distributed task queue to distribute work across multiple lily instances. When configured this way the system of lily nodes can consist of multiple Workers and a single Notifier :

  • The Notifier puts tasks into a queue for execution by a Worker.
  • The Workers remove tasks from the queue for execution.

The configuration described here should be used when multiple machines are available for running lily nodes as it allows the work of processing tasks to be distributed across a pool of machines. Internally, lily uses asynq for distributed task queue orchestration.

Notifier

A lily node may be configured to operate as a notifier using the notify subcommand on the watch, walk, and fill commands. For example:

$ lily job run --tasks=block_header watch notify --queue="Notifier1"

will cause the watch job to insert tasks into the queue based on the configuration with the name Notifier1. In this mode of operation, lily will distribute watch tasks to a Redis queue for consumption by Workers instead of performing the indexing locally. The values specified by the --task flag will be passed along to the Workers, however, the --storage flag will be ignored since lily isn’t processing the watch tasks locally.

Similar to the above watch command, the walk command below:

$ lily job run walk --from=100 --to=200 notify --queue="Notifier1"

will cause the walk job to insert tasks into the queue configuration with the name Notifier1 as they are walked by the notifier.

Worker

A lily node may be configured to operate as a worker using the tipset-worker command. A tipset-worker consumes tasks (produced by a notifier) from a Redis queue and processes them. For example, the below command starts a tipset-worker that consumes tasks based on the configuration with the name Worker1:

$ lily job run --storage="Database1" tipset-worker --queue="Worker1"

In this mode of operation, lily will pull tasks from a Redis instance for processing, writing the results to the storage backend at configuration Database1. Below is an example of a Worker configuration:

[Queue.Workers.Worker1]
  [Queue.Workers.Worker1.RedisConfig]
    Network = "tcp"
    Addr = "localhost:6379"
    Username = "default"
    PasswordEnv = "LILY_REDIS_PASSWORD"
    DB = 0
    PoolSize = 0
[Queue.Workers.Worker1.WorkerConfig]
    # Defines how many tasks a worker may process in parallel
    Concurrency = 1
    # Logging level of internal task system
    LoggerLevel = "debug"
    # Priorities of corresponding queues:
    #   processes Watch tasks 50% of the time.
    #   processes Fill tasks 30% of the time.
    #   processes Index tasks 10% of the time.
    #   processes Walk tasks 10% of the time
    WatchQueuePriority = 5
    FillQueuePriority = 3
    IndexQueuePriority = 1
    WalkQueuePriority = 1
    # If true, the queues with higher priority are always processed first.
    # Else queues with lower priority are processed only if all the other queues with higher priorities are empty.
    StrictPriority = false
    # Specifies the duration to wait to let workers finish their tasks
    ShutdownTimeout = 30000000000

Tasks

Lily provides several tasks to capture different aspects of the blockchain state, tasks are used as input to execute Jobs. Jobs accepts tasks to run as a comma-separated list. The data extracted by a task is stored in its related Model. Links to Tasks and their corresponding Models can be fond below:

Job Managment

When Jobs are launched on a running daemon, a new Job (with ID) is created. These jobs may be managed using the lily job CLI:

# launch a job, if no tasks are provided all tasks will run by default.
$ lily job run [watch|walk|find|fill|survey|tipset-worker]

# to launch specific tasks with a watch job
$ lily job run --tasks=actor_state,actor,message,receipt watch

# shows all Jobs and their status
$ lily job list

# stop a running Job
$ lily job stop

# allows a stopped Job to be resumed (new Jobs are not created this way)
$ lily job start

Example job list output:

$ lily job list
[
	{
		"ID": 1,
		"Name": "custom-job-name",
		"Type": "watch",
		"Error": "",
		"Tasks": [
            "actor_state",
            "actor",
            "message",
            "receipt",
		],
		"Running": true,
		"RestartOnFailure": false,
		"RestartOnCompletion": false,
		"RestartDelay": 0,
		"Params": {
			"buffer": "5",
			"confidence": "5",
		},
		"StartedAt": "2022-05-27T20:58:43.054786488Z",
		"EndedAt": "0001-01-01T00:00:00Z"
	}
]

Job Performance

Lily captures details about each Task completed within the configured storage in a table called visor_processing_reports. This table includes the height, state_root, reporter (via --name flag), task, started/completed timestamps, status, and errors (if any). This provides task-level insight into how Lily is progressing with the provided Jobs and any internal errors.


Metrics & Debugging

Prometheus Metrics

Lily automatically exposes an HTTP endpoint which exposes internal performance metrics. The endpoint is intended to be consumed by a Prometheus server.

Prometheus metrics are exposed by default on http://0.0.0.0:9991/metrics and may be bound to a custom IP/port by passing --prometheus-port="0.0.0.0:9991" on daemon startup with your custom values. lily help monitoring provides more information.

A description of the metrics is included inline in the reply. A sample may be captured using curl:

$ curl 0.0.0.0:9991/metrics -o lily_prom_sample.txt

Logging

Lily emits logs about each module contained within the runtime. Their level of verbosity can be managed on a per-module basis. A full list of registered modules can be retrieved on the console with $ lily log list. All modules have defaults set to prevent verbose output. Logging levels are one of DEBUG, INFO, WARN, ERROR, DPANIC, PANIC, FATAL. Logging can also be generated in colorized text (default), plain text, and JSON. See lily help monitoring for more information.

Examples:

# set level for all modules via envvar
$ export GOLOG_LOG_LEVEL="debug"
# set levels for multiple modules via envvar (can be different levels per module)
$ export GOLOG_LOG_LEVEL_NAMED="chain:debug,chainxchg:info"
# set levels for multiple modules via arg on startup
$ lily daemon --log-level-named="chain:debug,chainxchg:info"
# set levels for multiple modules via CLI (requires daemon to be running, one level per command)
$ lily log set-level --system chain --system chainxchg debug

Tracing

Lily is capable of exposing traces during normal runtime. This behavior is disabled by default because there is a performance impact for these traces to be captured. These traces are produced using Jaeger and are compatible with OpenCensus.

Jaeger tracing can be enabled by passing the --jaeger-tracing flag on daemon startup. There are other configuration values which have “reasonable” default values, but should be reviewed for your use case before enabling tracing. lily help monitoring provides more information about these aspects.

For example: by default, Lily uses probabilistic sampling with a rate of 0.0001. During testing it can be easier to override to remove sampling by setting the following environment variables:

export JAEGER_SAMPLER_TYPE=const
export JAEGER_SAMPLER_PARAM=1

or by specifying --jaeger-sampler-type=const jaeger-sampler-param=1.

Default tracing values are preconfigured to work with OpenTelemetry’s default agent ports and assumes the agent is bound to localhost. Configuration of JAEGER_* envvars or --jaeger-* args may be required if your setup is custom.


Go debug profiles

Lily exposes runtime profiling endpoints during normal runtime. This behavior is always available, but waits for interaction through the exposed HTTP endpoint before capturing this data.

By default, the profiling endpoint is exposed at http://0.0.0.0:1234/debug/pprof. This will serve up valid HTML to be viewed through a browser client or this endpoint can be connected to using the go pprof tool using the appropriate endpoint for the type of profile to be captured. (See interacting with the pprof HTTP endpoint for more information.)

Example: Capture local heap profile and load into pprof for analysis:

$ curl 0.0.0.0:1234/debug/pprof/heap -o heap.pprof.out
$ go tool pprof ./path/to/binary ./heap.pprof.out

Inspect profile interactively via http://localhost:1234/debug/pprof and host a web interface at http://localhost:8000 (which opens automatically once profile is captured):

$ go tool pprof -http :8000 :1234/debug/pprof/heap