Lily Operation Guide
When the lily daemon is started, it acts as a long-running process which coordinates the requested Jobs and directs data to specific storage targets. In order to interact with the daemon, the CLI also accepts commands which are communicated to the running daemon. This guide covers the management of jobs and tasks which result in data extraction from the Filecoin chain that the Lily node is following.
API endpoint
The daemon coordinates access to its API using the files within repo path. The repo path contains information about where the JSON RPC API is (api
file located in repo path) located, along with an authentication token used (token
file) for writeable interactions.
.lily/ <-- repo path root
├── api <-- for information about the JSON RPC API location
├── token <-- for write interactions
└── config.toml
For more details, see the Lotus API documentation.
The IP/port to which the daemon binds can be managed via the --api
flag. By default, the daemon binds to localhost
and will be unavailable externally unless bound to a publicly-accessible IP.
Jobs
Once the Lily daemon process is running and fully synced, you may manage its Jobs through the CLI. (Note: behavior is undefined if Lily is not fully synced, so one may gate the Job creation to wait for the sync to complete with lily sync wait && lily ...
.).
Jobs may be executed on a lily node via the job run
command. The job run
command accepts the following flags:
--window
duration after which job execution will be canceled while processing a tipset. (A duration string is a possibly signed sequence of decimal numbers, each with an optional fraction and a unit suffix, such as “300ms”, “-1.5h” or “2h45m”. Valid time units are “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h”.)--tasks
specify the list of tasks to run. Some tasks will be heavier than others. A tipset is only processed when all the specified tasks have finished. Only then will Lily move on to the next epoch. The default value of this flag is all tasks lily is capable of performing.--storage
specifies which of the configured storage outputs the job will write to.--name
specifies the name of the job for easy identification later. The provided value will appear asreporter
in thevisor_processing_reports
table.--restart-delay
duration to wait before restarting job after it ends execution. (A duration string is a possibly signed sequence of decimal numbers, each with an optional fraction and a unit suffix, such as “300ms”, “-1.5h” or “2h45m”. Valid time units are “ns”, “us” (or “µs”), “ms”, “s”, “m”, “h”.)--restart-on-failure
specifies if a job should be restarted if it fails.--restart-on-completion
specifies if a job should be restarted once it completes.
Currently, Lily is capable of executing the following types of jobs:
Watch
The Watch command subscribes to incoming tipsets from the filecoin blockchain and indexes them as they arrive.
Since it may be the case that tipsets arrive at a rate greater than lily’s rate of indexing, the watch job maintains a
buffer of tipsets to index. Consumption of this queue can be configured via the --workers
flag. Increasing the value
provided to the --workers
flag will allow the watch job to index tipsets simultaneously.
(Note: this will use a significant amount of system resources).
While watching a network with distributed consensus, participants may occasionally have intermittent connectivity problems, which cause some nodes to have different views of the blockchain HEAD. Eventually, the connectivity issues resolve, the nodes connect, and they reconcile their differences. The node which is found to be on the wrong branch will reorganize its local state to match the new network consensus by “unwinding” the incorrect chain of tipsets up to the point of disagreement (the “fork”) and then applying the correct chain of tipsets. This can be referred to as a “reorg.” The number of tipsets that are “unwound” from the incorrect chain is called “reorg depth.”
Lily uses a “confidence” FIFO cache, which gives the operator confidence that the tipsets that are being processed and persisted are unlikely to be reorganized. A confidence of 100 would establish a cache that will fill with as many tipsets. Once the 101st tipset is unshifted onto the cache stack, the 1st tipset would be popped off the bottom and have the Tasks processed over it. In the event of a reorg, the most recent tipsets are shifted off the top, and the correct tipsets are unshifted in their place.
Example:
$ lily job run watch --confidence=10
A visualization of the confidence cache during normal operation. Data is only extracted from tipsets marked with (process)
:
*unshift* *unshift* *unshift* *unshift*
│ │ │ │ │ │ │ │
┌──▼──▼──┐ ┌──▼──▼──┐ ┌──▼──▼──┐ ┌──▼──▼──┐
│ │ │ ts10 │ │ ts11 │ │ ts12 │
... ---> ├────────┤ ---> ├────────┤ ---> ├────────┤ ---> ├────────┤ ---> ...
│ ts09 │ │ ts09 │ │ ts10 │ │ ts11 │
├────────┤ ├────────┤ ├────────┤ ├────────┤
│ ts08 │ │ ts08 │ │ ts09 │ │ ts10 │
├────────┤ ├────────┤ ├────────┤ ├────────┤
│ ... │ │ ... │ │ ... │ │ ... │
├────────┤ ├────────┤ ├────────┤ ├────────┤
│ ts02 │ │ ts02 │ │ ts03 │ │ ts04 │
├────────┤ ├────────┤ ├────────┤ ├────────┤
│ ts01 │ │ ts01 │ │ ts02 │ │ ts03 │
├────────┤ ├────────┤ ├────────┤ ├────────┤
│ ts00 │ │ ts00 │ │ ts01 │ │ ts02 │
└────────┘ └────────┘ └──│──│──┘ └──│──│──┘
▼ ▼ *pop* ▼ ▼ *pop*
┌────────┐ ┌────────┐
(confidence=10 :: length=10) │ ts00 │ │ ts01 │
└────────┘ └────────┘
(process) (process)
A visualization of the confidence cache during a reorg of depth=2:
*unshift* *shift* *shift* *unshift* *unshift* *unshift*
│ │ ▲ ▲ ▲ ▲ │ │ │ │ │ │
┌─▼──▼─┐ ┌─│──│─┐ ┌─│──│─┐ ┌─│──│─┐ ┌─▼──▼─┐ ┌─▼──▼─┐
│ ts10 │ │ │ │ │ │ │ │ │ │ │ │ ts10'│ │ ts11'│
├──────┤ ├──────┤ ├─│──│─┤ ├─▼──▼─┤ ├──────┤ ├──────┤
│ ts09 │ │ ts09 │ │ │ │ ts09'│ │ ts09'│ │ ts10'│
├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤
│ ts08 │ │ ts08 │ │ ts08 │ │ ts08 │ │ ts08 │ │ ts09'│
├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤
│ ... │ > │ ... │ > │ ... │ > │ ... │ > │ ... │ > │ ... │
├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤
│ ts02 │ │ ts02 │ │ ts02 │ │ ts02 │ │ ts02 │ │ ts03 │
├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤
│ ts01 │ │ ts01 │ │ ts01 │ │ ts01 │ │ ts01 │ │ ts02 │
├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤ ├──────┤
│ ts00 │ │ ts00 │ │ ts00 │ │ ts00 │ │ ts00 │ │ ts01 │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └─│──│─┘
▼ ▼ *pop*
reorg reorg ┌──────┐
occurs resolves │ ts00 │
here here └──────┘
(process)
Low confidence values may risk that Lily extracts data from tipsets that are not part of the final chain, which is something that you may want or not, depending on what you are trying to index.
Constraints
- Lily’s sync of the filecoin blockchain must be completed before starting a watch job. Users of lily may run the
lily sync wait
command to check if their node has completed syncing the chain. The command will exit when sync is complete.
Walk
The Walk command will index the state based on the list of tasks (--tasks
) provided over the specified range
(--from
, --to
). Each epoch will be indexed serially, starting from the heaviest tipset at the upper height
(--to
) to the lower height (--from
).
Constraints
- Walk jobs may only be executed over epoch ranges lily has state. Walking ranges lily does not have a state for will result in the error:
blockstore: block not found
being written to theerrors_detected
column of thevisor_processing_reports
table.
Find
The Find job searches for gaps in a database storage system by executing the SQL gap_find()
function over the
visor_processing_reports
table. Find will query the database for gaps based on the list of tasks (--tasks
) provided
over the specified range (--from
,--to
). An epoch is considered to have gaps if and only if:
- Tasks specified by the
--tasks
flag are not present at each epoch within the specified range. - Task specified by the
--tasks
flag do not have the status'OK'
at each epoch within the specified range. The find job results are written to thevisor_gap_reports
table with the status'GAP'
.
Constraints
- The Find job must be executed BEFORE a Fill job. These jobs must NOT be performed simultaneously.
- Executing the Find job over epoch ranges which have no processing reports will create GAP results for all epoch/tasks combinations.
- Do not execute the Find job over epoch ranges imported from the lily data archive as there are no processing reports for imported data.
Fill
The Fill job queries the visor_gap_reports
table for gaps to fill and indexes the data reported to have gaps.
A gap in the visor_gap_reports
table is any row with the status 'GAP'
. Fill will index gaps based on the list
of tasks (--tasks
) provided over the specified range (--from
, --to
). Each epoch and its corresponding list of
tasks found in the visor_gap_reports
table will be indexed independently. When the gap is successfully
filled, its corresponding entry in the visor_gap_reports
table will be updated with' FILLED' status.
Constraints
- The Fill job must be executed AFTER a Find job. These jobs must NOT be performed simultaneously.
- Walk jobs may only be executed over epoch ranges lily has state. Walking epoch ranges which lily does not have state for will result in the error:
blockstore: block not found
being written to theerrors_detected
column of thevisor_processing_reports
table. - Do not execute the Fill job over epoch ranges imported from the lily data archive as there are no processing reports for imported data.
Survey
The Survey job collects information on the filecoin peer agents lily is connected to. The frequency at which lily collects this information may be configured via the --interval
flag. Currently, this job only has one task type: peeragents
. Note that this task name is distinct from the task names watch
, walk
, find
, and fill
jobs accept.
Distributed TipSet Workers
Lily may be configured to use Redis as a distributed task queue to distribute work across multiple lily instances. When configured this way the system of lily nodes can consist of multiple Workers
and a single Notifier
:
- The
Notifier
puts tasks into a queue for execution by aWorker
. - The
Workers
remove tasks from the queue for execution.
The configuration described here should be used when multiple machines are available for running lily nodes as it allows the work of processing tasks to be distributed across a pool of machines. Internally, lily uses asynq for distributed task queue orchestration.
Notifier
A lily node may be configured to operate as a notifier using the notify
subcommand on the watch
, walk
, and fill
commands. For example:
$ lily job run --tasks=block_header watch notify --queue="Notifier1"
will cause the watch job to insert tasks into the queue based on the configuration with the name Notifier1
. In this mode of operation, lily will distribute watch tasks to a Redis queue for consumption by Workers
instead of performing the indexing locally. The values specified by the --task
flag will be passed along to the Workers
, however, the --storage
flag will be ignored since lily isn’t processing the watch tasks locally.
Similar to the above watch command, the walk command below:
$ lily job run walk --from=100 --to=200 notify --queue="Notifier1"
will cause the walk job to insert tasks into the queue configuration with the name Notifier1
as they are walked by the notifier.
Worker
A lily node may be configured to operate as a worker using the tipset-worker
command. A tipset-worker
consumes tasks (produced by a notifier
) from a Redis queue and processes them. For example, the below command starts a tipset-worker
that consumes tasks based on the configuration with the name Worker1
:
$ lily job run --storage="Database1" tipset-worker --queue="Worker1"
In this mode of operation, lily will pull tasks from a Redis instance for processing, writing the results to the storage backend at configuration Database1
. Below is an example of a Worker
configuration:
[Queue.Workers.Worker1]
[Queue.Workers.Worker1.RedisConfig]
Network = "tcp"
Addr = "localhost:6379"
Username = "default"
PasswordEnv = "LILY_REDIS_PASSWORD"
DB = 0
PoolSize = 0
[Queue.Workers.Worker1.WorkerConfig]
# Defines how many tasks a worker may process in parallel
Concurrency = 1
# Logging level of internal task system
LoggerLevel = "debug"
# Priorities of corresponding queues:
# processes Watch tasks 50% of the time.
# processes Fill tasks 30% of the time.
# processes Index tasks 10% of the time.
# processes Walk tasks 10% of the time
WatchQueuePriority = 5
FillQueuePriority = 3
IndexQueuePriority = 1
WalkQueuePriority = 1
# If true, the queues with higher priority are always processed first.
# Else queues with lower priority are processed only if all the other queues with higher priorities are empty.
StrictPriority = false
# Specifies the duration to wait to let workers finish their tasks
ShutdownTimeout = 30000000000
Tasks
Lily provides several tasks to capture different aspects of the blockchain state, tasks are used as input to execute Jobs. Jobs accepts tasks to run as a comma-separated list. The data extracted by a task is stored in its related Model. Links to Tasks and their corresponding Models can be fond below:
Job Managment
When Jobs are launched on a running daemon, a new Job (with ID) is created. These jobs may be managed using the lily job
CLI:
# launch a job, if no tasks are provided all tasks will run by default.
$ lily job run [watch|walk|find|fill|survey|tipset-worker]
# to launch specific tasks with a watch job
$ lily job run --tasks=actor_state,actor,message,receipt watch
# shows all Jobs and their status
$ lily job list
# stop a running Job
$ lily job stop
# allows a stopped Job to be resumed (new Jobs are not created this way)
$ lily job start
Example job list output:
$ lily job list
[
{
"ID": 1,
"Name": "custom-job-name",
"Type": "watch",
"Error": "",
"Tasks": [
"actor_state",
"actor",
"message",
"receipt",
],
"Running": true,
"RestartOnFailure": false,
"RestartOnCompletion": false,
"RestartDelay": 0,
"Params": {
"buffer": "5",
"confidence": "5",
},
"StartedAt": "2022-05-27T20:58:43.054786488Z",
"EndedAt": "0001-01-01T00:00:00Z"
}
]
Job Performance
Lily captures details about each Task completed within the configured storage in a table called visor_processing_reports
. This table includes the height, state_root, reporter (via --name
flag), task, started/completed timestamps, status, and errors (if any). This provides task-level insight into how Lily is progressing with the provided Jobs and any internal errors.
Metrics & Debugging
Prometheus Metrics
Lily automatically exposes an HTTP endpoint which exposes internal performance metrics. The endpoint is intended to be consumed by a Prometheus server.
Prometheus metrics are exposed by default on http://0.0.0.0:9991/metrics
and
may be bound to a custom IP/port by passing --prometheus-port="0.0.0.0:9991"
on daemon startup with your custom values. lily help monitoring
provides
more information.
A description of the metrics is included inline in the reply. A sample may be captured using curl:
$ curl 0.0.0.0:9991/metrics -o lily_prom_sample.txt
Logging
Lily emits logs about each module contained within the runtime. Their level of
verbosity can be managed on a per-module basis. A full list of registered
modules can be retrieved on the console with $ lily log list
. All modules
have defaults set to prevent verbose output. Logging levels are one of DEBUG, INFO, WARN, ERROR, DPANIC, PANIC, FATAL
. Logging can also be generated in
colorized text (default), plain text, and JSON. See lily help monitoring
for
more information.
Examples:
# set level for all modules via envvar
$ export GOLOG_LOG_LEVEL="debug"
# set levels for multiple modules via envvar (can be different levels per module)
$ export GOLOG_LOG_LEVEL_NAMED="chain:debug,chainxchg:info"
# set levels for multiple modules via arg on startup
$ lily daemon --log-level-named="chain:debug,chainxchg:info"
# set levels for multiple modules via CLI (requires daemon to be running, one level per command)
$ lily log set-level --system chain --system chainxchg debug
Tracing
Lily is capable of exposing traces during normal runtime. This behavior is disabled by default because there is a performance impact for these traces to be captured. These traces are produced using Jaeger and are compatible with OpenCensus.
Jaeger tracing can be enabled by passing the --jaeger-tracing
flag on daemon
startup. There are other configuration values which have “reasonable” default
values, but should be reviewed for your use case before enabling
tracing. lily help monitoring
provides more information about these aspects.
For example: by default, Lily uses probabilistic sampling with a rate of 0.0001. During testing it can be easier to override to remove sampling by setting the following environment variables:
export JAEGER_SAMPLER_TYPE=const
export JAEGER_SAMPLER_PARAM=1
or by specifying --jaeger-sampler-type=const jaeger-sampler-param=1
.
Default tracing values are preconfigured to work with OpenTelemetry’s default
agent ports and assumes the agent is bound to localhost
. Configuration of
JAEGER_*
envvars or --jaeger-*
args may be required if your setup is
custom.
Go debug profiles
Lily exposes runtime profiling endpoints during normal runtime. This behavior is always available, but waits for interaction through the exposed HTTP endpoint before capturing this data.
By default, the profiling endpoint is exposed at
http://0.0.0.0:1234/debug/pprof
. This will serve up valid HTML to be viewed
through a browser client or this endpoint can be connected to using the go pprof tool
using the appropriate endpoint for the type of profile to be
captured. (See
interacting with the pprof HTTP endpoint
for more information.)
Example: Capture local heap profile and load into pprof for analysis:
$ curl 0.0.0.0:1234/debug/pprof/heap -o heap.pprof.out
$ go tool pprof ./path/to/binary ./heap.pprof.out
Inspect profile interactively via http://localhost:1234/debug/pprof
and host
a web interface at http://localhost:8000
(which opens automatically once
profile is captured):
$ go tool pprof -http :8000 :1234/debug/pprof/heap