Creating a Pipeline
This guide walks through the process of creating a custom pipeline for Project Eidolon, allowing you to define how modules connect and interact with each other.
Basic Structure
A pipeline is defined in a YAML file with a .yaml extension in the src/pipelines/ directory. The file should have the following structure:
pipeline:
name: custom_pipeline_name
description: "A description of what this pipeline does"
execution:
timeout: 300s
retries: 2
error_policy: halt
modules:
- id: module1
module: module1_name
- id: module2
module: module2_name
depends_on: [module1]
input:
module2_input: module1.module1_output
config:
option1: value1
run_mode: reactive
Step-by-Step Guide
Step 1: Create a new YAML file
Create a new file in the src/pipelines/ directory with a descriptive name, such as osint_pipeline.yaml.
Step 2: Define the pipeline properties
Start by defining the pipeline properties at the top of the file:
pipeline:
name: osint_pipeline
description: "OSINT data collection and analysis pipeline"
execution:
timeout: 300s
retries: 3
error_policy: halt
The name will be used when running the pipeline via the command line, and the description helps document the pipeline's purpose.
Step 3: List the modules
Add all the modules that should be part of the pipeline under the modules key:
pipeline:
# ...pipeline properties...
modules:
- id: crawler
module: data_crawler
run_mode: loop
- id: analyzer
module: data_analyzer
run_mode: reactive
The id field defines how the module is referenced within the pipeline, while the module field specifies the actual module name (directory) to load.
Step 4: Define dependencies
For each module that depends on data from another module, add a depends_on list:
pipeline:
# ...pipeline properties...
modules:
- id: crawler
module: data_crawler
- id: analyzer
module: data_analyzer
depends_on: [crawler]
Step 5: Map inputs and outputs
Connect specific module inputs to other modules' outputs using the input field:
pipeline:
# ...pipeline properties...
modules:
- id: crawler
module: data_crawler
outputs:
- collected_data
- id: analyzer
module: data_analyzer
depends_on: [crawler]
input:
raw_data: crawler.collected_data
The input dictionary connects:
- Keys: Input names in the current module
- Values: Qualified references to outputs from dependency modules (module_id.output_name)
Step 6: Configure module behavior
Add module-specific configuration via the config field and specify execution behavior with run_mode:
pipeline:
# ...pipeline properties...
modules:
- id: analyzer
module: data_analyzer
depends_on: [crawler]
input:
raw_data: crawler.collected_data
config:
batch_size: 100
cleanup: true
run_mode: reactive
Available run modes include:
- loop: Module runs continuously in a loop (default)
- once: Module runs once and completes
- reactive: Module runs whenever new data is received
- on_trigger: Module runs only when explicitly triggered
Input Mapping Rules
When setting up input mappings, follow these rules:
- The input name (key) must match an input defined in the module's
module.yaml - The output reference (value) should use the format
module_id.output_name - The types of the connected inputs and outputs must be compatible
Advanced Pipeline Features
Module Configuration
Each module can have custom configuration options that override the defaults:
- id: analyzer
module: data_analyzer
config:
batch_size: 100
cleanup: true
stopwords: ["the", "and", "is"]
Run Mode Selection
The run_mode field controls how a module executes:
- id: monitor
module: keyword_monitor
run_mode: loop # Runs continuously in a loop
- id: analyzer
module: data_analyzer
run_mode: reactive # Runs when new data is received
- id: reporter
module: data_reporter
run_mode: once # Runs once and completes
Error Handling
The execution section controls pipeline-wide error handling:
execution:
timeout: 300s # Maximum execution time before shutdown
retries: 2 # Number of retries if critical errors occur
error_policy: halt # How to handle errors (halt, continue, isolate, log_only)
Common Patterns
Linear Pipeline
Modules run in sequence with each depending on the previous:
pipeline:
name: linear_pipeline
modules:
- id: collector
module: data_collector
- id: processor
module: data_processor
depends_on: [collector]
input:
raw_data: collector.collected_data
- id: analyzer
module: data_analyzer
depends_on: [processor]
input:
processed_data: processor.normalized_data
Star Pattern
One central module feeds data to multiple independent modules:
pipeline:
name: star_pipeline
modules:
- id: publisher
module: central_publisher
- id: consumer_a
module: consumer_a
depends_on: [publisher]
input:
input_data: publisher.published_data
- id: consumer_b
module: consumer_b
depends_on: [publisher]
input:
input_data: publisher.published_data
- id: consumer_c
module: consumer_c
depends_on: [publisher]
input:
input_data: publisher.published_data
Aggregation Pattern
Multiple modules feed into a single aggregator:
pipeline:
name: aggregation_pipeline
modules:
- id: source_a
module: data_source_a
- id: source_b
module: data_source_b
- id: source_c
module: data_source_c
- id: aggregator
module: aggregator
depends_on: [source_a, source_b, source_c]
input:
source_a_data: source_a.data_a
source_b_data: source_b.data_b
source_c_data: source_c.data_c
Testing a Pipeline
To test your pipeline, run it using the eidolon run command:
You should see log messages showing:
- The pipeline being loaded
- Modules being discovered and loaded
- Input and output connections being established
- Modules running in the correct dependency order
Troubleshooting
Common Issues
-
Module not found: Ensure the module name in the pipeline matches the directory name in
src/modules/ -
Input/output mismatch: Check that the input and output names match those defined in the module configuration files
-
Type mismatch: Ensure the types of connected inputs and outputs are compatible
-
Circular dependencies: Avoid creating circular dependencies between modules
Debugging Tips
-
Run with increased log level to see more details:
-
Check module configurations to ensure inputs and outputs are correctly defined:
-
List all available modules to verify name spelling:
Best Practices
- Descriptive naming: Use clear, descriptive names for pipelines and maintain a consistent naming convention
- Minimal dependencies: Only include necessary dependencies between modules
- Logical grouping: Group related modules in the same pipeline
- Documentation: Add comments in your pipeline YAML file to document non-obvious connections
- Incremental testing: Build your pipeline incrementally, testing as you add each module
Example: Full OSINT Pipeline
Here's a complete example of an OSINT pipeline that:
- Manages a list of target URLs for crawling
- Cleans and processes the URLs for effective crawling
- Performs web crawling to collect data
- Analyzes and extracts entities from the collected data
- Generates visualization data for intelligence reports
- Outputs results to both a dashboard and an archival system
pipeline:
name: comprehensive_osint_pipeline
description: "Complete OSINT pipeline for data collection, analysis, and reporting"
execution:
timeout: 600s
retries: 3
error_policy: halt
modules:
# Web crawling layer
- id: url_list
module: aethon_urllist
run_mode: loop
config:
initial_targets:
- "https://example.com/target1"
- "https://example.org/target2"
max_urls: 1000
- id: url_cleaner
module: aethon_urlclean
depends_on: [url_list]
input:
urls_to_clean: url_list.discovered_urls
run_mode: reactive
config:
strip_parameters: true
normalize_domains: true
- id: web_crawler
module: aethon_crawler
depends_on: [url_cleaner]
input:
target_urls: url_cleaner.clean_urls
run_mode: reactive
config:
depth: 2
respect_robots_txt: true
# Analysis layer
- id: entity_extractor
module: entity_extractor
depends_on: [web_crawler]
input:
source_content: web_crawler.page_content
run_mode: reactive
config:
entity_types: ["PERSON", "ORG", "GPE", "EVENT"]
- id: relationship_analyzer
module: relationship_analyzer
depends_on: [entity_extractor]
input:
entities: entity_extractor.extracted_entities
run_mode: reactive
# Visualization layer
- id: intel_visualizer
module: scryer
depends_on: [entity_extractor, relationship_analyzer]
input:
entity_data: entity_extractor.extracted_entities
relationship_data: relationship_analyzer.entity_relationships
run_mode: reactive
# Output layer
- id: dashboard_interface
module: hermes
depends_on: [intel_visualizer]
input:
visualization_data: intel_visualizer.intelligence_visualizations
run_mode: reactive
- id: archive_manager
module: osiris
depends_on: [entity_extractor, relationship_analyzer]
input:
entities_to_archive: entity_extractor.extracted_entities
relationships_to_archive: relationship_analyzer.entity_relationships
run_mode: reactive
See Also
- Pipeline Overview - Overview of the pipeline system
- Module Methods - How modules communicate through the message bus
- Module Configuration - How to define module inputs and outputs