Command Line Interface (CLI)
============================

This page documents the main CLI entry points under ``mmml/cli`` with example
invocations, minimal input files, and Slurm submission scripts.

Prerequisites
-------------

- Python virtual environment with mmml installed and model/data dependencies available
- Access to GPU/CPU as required by the chosen command
- For Slurm examples, an HPC partition with CUDA modules or suitable CPU nodes


make_res.py
-----------

Purpose
  Create a residue (or small molecule) template and minimal inputs for subsequent steps.

Usage
  .. code-block:: bash

     python -m mmml.cli.make_res \
       --resname WAT \
       --pdb water.pdb \
       --out water_res

Inputs
  - ``--resname``: residue name (e.g., WAT, ETH, ACE)
  - ``--pdb``: input PDB containing the residue

Outputs
  - Directory ``water_res`` with processed residue files


make_box.py
-----------

Purpose
  Build boxes from residues and write PDB/PSF (or equivalent) for simulation setup.

Usage
  .. code-block:: bash

     python -m mmml.cli.make_box \
       --residue water_res \
       --count 1000 \
       --box 30 \
       --out water_box

Inputs
  - ``--residue``: residue directory from ``make_res``
  - ``--count``: number of molecules
  - ``--box``: box edge length (Å)

Outputs
  - Directory ``water_box`` with PDB (and auxiliary) files


make_training.py
----------------

Purpose
  Prepare and/or run training for a PhysNetJAX (or compatible) model.

Common flags
  - ``--data``: path to dataset (npz)
  - ``--tag``: run name tag
  - ``--model``: model definition (JSON/INP); if omitted, a default EF model is created
  - ``--n_train`` / ``--n_valid``: split sizes
  - ``--num_epochs``: number of epochs
  - ``--batch_size``: batch size
  - ``--learning_rate``: optimizer learning rate
  - ``--num_atoms``: number of atoms per structure (auto-detected from data if not specified)
  - ``--ckpt_dir``: checkpoints directory

Usage (basic - num_atoms auto-detected)
  .. code-block:: bash

     python -m mmml.cli.make_training \
       --data data/dimers.npz \
       --tag physnet_run1 \
       --num_epochs 5 \
       --batch_size 4 \
       --learning_rate 1e-3 \
       --ckpt_dir checkpoints/physnet_run1

Usage (explicit num_atoms)
  .. code-block:: bash

     python -m mmml.cli.make_training \
       --data data/dimers.npz \
       --tag physnet_run1 \
       --num_atoms 60 \
       --num_epochs 5 \
       --batch_size 4 \
       --learning_rate 1e-3 \
       --ckpt_dir checkpoints/physnet_run1

Outputs
  - Checkpoints in ``checkpoints/physnet_run1``
  - Parameter snapshots ``paramsYYYY-mm-dd_HH-MM-SS.json``

Notes
  - The ``--num_atoms`` parameter is now auto-detected from max(N) in the dataset
  - **Padding is automatically removed** if detected (e.g., 60 padded → 10 actual atoms)
  - Training uses only the actual number of atoms for efficiency
  - Unpadded files are saved for reuse (e.g., ``data_train_unpadded.npz``)
  - You can still specify ``--num_atoms`` explicitly if needed


run_sim.py
----------

Purpose
  Run a short ASE+MM/ML hybrid simulation (or energy/force evaluation) using a trained model.

Common flags
  - ``--pdbfile``: input PDB to load
  - ``--checkpoint``: path to trained model checkpoint directory
  - ``--n-monomers`` / ``--n-atoms-monomer``: topology assumptions for ML partitions
  - ``--temperature``: target temperature (K) for MD
  - ``--num-steps`` / ``--timestep``: MD length and integration step (fs)
  - ``--output-prefix``: prefix for trajectory/outputs

Usage
  .. code-block:: bash

     python -m mmml.cli.run_sim \
       --pdbfile water_box/water.pdb \
       --checkpoint checkpoints/physnet_run1 \
       --n-monomers 1000 \
       --n-atoms-monomer 3 \
       --temperature 100 \
       --timestep 0.1 \
       --num-steps 10000 \
       --output-prefix md_simulation

Outputs
  - Trajectory ``md_simulation_trajectory_100K_10000steps.traj``
  - Console logs of energy/temperature


calculator.py
-------------

Purpose
  Provides a generic ASE calculator for trained MMML models. Can be used as a Python module or from command line.

Common flags
  - ``--checkpoint``: path to checkpoint file or directory
  - ``--cutoff``: neighbor list cutoff distance (Angstroms)
  - ``--use-dcmnet-dipole``: use DCMNet dipole if available
  - ``--test-molecule``: test with predefined molecule (CO2, H2O, CH4, NH3)

Usage as module
  .. code-block:: python

     from mmml.cli.calculator import MMMLCalculator
     from ase import Atoms
     
     calc = MMMLCalculator.from_checkpoint('checkpoints/my_model')
     atoms = Atoms('CO2', positions=[[0,0,0], [1.16,0,0], [-1.16,0,0]])
     atoms.calc = calc
     
     energy = atoms.get_potential_energy()
     forces = atoms.get_forces()
     dipole = atoms.get_dipole_moment()

Usage from command line
  .. code-block:: bash

     python -m mmml.cli.calculator \
       --checkpoint checkpoints/my_model \
       --test-molecule CO2

Outputs
  - Energy, forces, dipole moment, and atomic charges for test molecule


clean_data.py
-------------

Purpose
  Clean and validate NPZ datasets by removing structures with quality issues and keeping only essential training fields.

Common flags
  - ``input``: input NPZ file to clean
  - ``-o, --output``: output NPZ file (cleaned)
  - ``--max-force``: maximum allowed force magnitude (eV/Å), default: 10.0
  - ``--min-distance``: minimum allowed interatomic distance (Å), default: 0.4
  - ``--no-check-distances``: skip distance checks (faster, recommended)
  - ``--quiet``: suppress output

Essential fields kept
  - E, F, R, Z, N: Required for energy/force training
  - D, Dxyz: Optional dipole data
  - All other fields (cube_*, orbital_*, metadata) are removed

Usage (recommended - fast, keeps 99%+ data)
  .. code-block:: bash

     python -m mmml.cli.clean_data input.npz -o cleaned.npz --no-check-distances

Usage (stricter - removes overlapping atoms)
  .. code-block:: bash

     python -m mmml.cli.clean_data input.npz -o cleaned.npz \
       --max-force 10.0 --min-distance 0.4

Custom thresholds
  .. code-block:: bash

     python -m mmml.cli.clean_data input.npz -o cleaned.npz \
       --max-force 5.0 --min-distance 0.3

Outputs
  - Cleaned NPZ file with only essential fields (E, F, R, Z, N, D, Dxyz)
  - Invalid structures removed
  - Statistics about removed structures and failure reasons

Notes
  - Use ``--no-check-distances`` for faster cleaning and higher data retention (recommended)
  - Only removes clear SCF failures, keeping good training data
  - Automatically strips unnecessary QM fields (orbital energies, cube data, etc.)


inspect_checkpoint.py
---------------------

Purpose
  Inspect model checkpoints and infer configuration from parameter structure.

Common flags
  - ``--checkpoint``: path to checkpoint file or directory
  - ``--save-config``: save inferred configuration to JSON file
  - ``--quiet``: suppress detailed output

Usage
  .. code-block:: bash

     python -m mmml.cli.inspect_checkpoint --checkpoint model/best_params.pkl

Save configuration
  .. code-block:: bash

     python -m mmml.cli.inspect_checkpoint --checkpoint model/ \\
       --save-config inferred_config.json

Outputs
  - Total parameter count
  - Parameter structure breakdown by component
  - Inferred model configuration (features, iterations, etc.)
  - Optionally saves configuration to JSON


convert_npz_traj.py
-------------------

Purpose
  Convert NPZ datasets to ASE trajectory format for visualization.

Common flags
  - ``input``: input NPZ file
  - ``-o, --output``: output trajectory file (.traj, .xyz, .pdb, etc.)
  - ``--max-structures``: maximum number of structures to convert
  - ``--stride``: use every Nth structure
  - ``--quiet``: suppress output

Usage
  .. code-block:: bash

     python -m mmml.cli.convert_npz_traj data.npz -o trajectory.traj

Convert subset
  .. code-block:: bash

     python -m mmml.cli.convert_npz_traj data.npz -o traj.traj \\
       --max-structures 100 --stride 10

To XYZ format
  .. code-block:: bash

     python -m mmml.cli.convert_npz_traj data.npz -o structures.xyz

Outputs
  - ASE trajectory file (can be viewed with ``ase gui``)
  - Removes padding automatically
  - Includes energies and forces if available


split_dataset.py
----------------

Purpose
  Split datasets into train/valid/test sets with optional unit conversion.

Common flags
  - ``input``: input NPZ file (single file mode)
  - ``--efd``: energy/force/dipole file (multi-file mode)
  - ``--grid``: ESP grid file (multi-file mode)
  - ``-o, --output-dir``: output directory
  - ``--train, --valid, --test``: split ratios (default: 0.8/0.1/0.1)
  - ``--convert-units``: convert Hartree→eV and Hartree/Bohr→eV/Å
  - ``--seed``: random seed for reproducibility

Usage (single file)
  .. code-block:: bash

     python -m mmml.cli.split_dataset data.npz -o splits/

Usage (with unit conversion)
  .. code-block:: bash

     python -m mmml.cli.split_dataset data.npz -o splits/ --convert-units

Usage (multiple files - EFD + Grid)
  .. code-block:: bash

     python -m mmml.cli.split_dataset \\
       --efd energies_forces_dipoles.npz \\
       --grid grids_esp.npz \\
       -o training_data --convert-units

Custom split ratios
  .. code-block:: bash

     python -m mmml.cli.split_dataset data.npz -o splits/ \\
       --train 0.7 --valid 0.15 --test 0.15

Outputs
  - data_train.npz, data_valid.npz, data_test.npz
  - split_indices.npz (reproducible split indices)
  - Optionally converts units to ASE standard (eV, eV/Å)


explore_data.py
---------------

Purpose
  Explore and visualize NPZ datasets with statistical summaries.

Common flags
  - ``input``: input NPZ file
  - ``--detailed``: detailed analysis including geometry
  - ``--plots``: generate distribution plots
  - ``--output-dir``: output directory for plots
  - ``--quiet``: suppress output

Usage
  .. code-block:: bash

     python -m mmml.cli.explore_data data.npz

With plots
  .. code-block:: bash

     python -m mmml.cli.explore_data data.npz --plots --output-dir exploration

Detailed analysis
  .. code-block:: bash

     python -m mmml.cli.explore_data data.npz --detailed --plots --output-dir analysis

Outputs
  - Statistical summaries (energy, forces, dipoles)
  - Bond length analysis (if --detailed)
  - Distribution plots (if --plots)
  - Data quality checks


evaluate_model.py
-----------------

Purpose
  Evaluate trained models on datasets with detailed metrics (under development).

Common flags
  - ``--checkpoint``: model checkpoint directory or file
  - ``--data``: single dataset to evaluate
  - ``--train, --valid, --test``: evaluate on multiple splits
  - ``--detailed``: compute per-structure breakdown
  - ``--plots``: generate correlation and error distribution plots
  - ``--output-dir``: output directory for results

Usage
  .. code-block:: bash

     python -m mmml.cli.evaluate_model --checkpoint model/ --data test.npz

Multiple splits
  .. code-block:: bash

     python -m mmml.cli.evaluate_model --checkpoint model/ \\
       --train train.npz --valid valid.npz --test test.npz \\
       --output-dir evaluation

Outputs
  - Error metrics (MAE, RMSE, R²) for energy, forces, dipoles
  - Correlation plots (if --plots specified)
  - Per-structure analysis (if --detailed specified)


dynamics.py
-----------

Purpose
  Molecular dynamics and vibrational analysis with multiple framework support (ASE, JAX MD).

Common flags
  - ``--checkpoint``: model checkpoint directory or file
  - ``--molecule``: predefined molecule (CO2, H2O, CH4, NH3)
  - ``--structure``: load structure from file (XYZ, PDB, etc.)
  - ``--optimize``: optimize geometry
  - ``--frequencies``: calculate vibrational frequencies
  - ``--ir-spectra``: calculate IR spectrum (requires --frequencies)
  - ``--md``: run molecular dynamics
  - ``--framework``: MD framework (ase or jaxmd)
  - ``--ensemble``: MD ensemble (nve, nvt, npt)
  - ``--temperature``: temperature (K)
  - ``--timestep``: MD timestep (fs)
  - ``--nsteps``: number of MD steps
  - ``--output-dir``: output directory

Usage - Optimization
  .. code-block:: bash

     python -m mmml.cli.dynamics --checkpoint model/ --molecule CO2 \\
       --optimize --output-dir co2_opt

Usage - Vibrational analysis
  .. code-block:: bash

     python -m mmml.cli.dynamics --checkpoint model/ --molecule CO2 \\
       --frequencies --ir-spectra --output-dir co2_vib

Usage - Molecular dynamics (ASE)
  .. code-block:: bash

     python -m mmml.cli.dynamics --checkpoint model/ --molecule CO2 \\
       --md --framework ase --ensemble nvt --temperature 300 --nsteps 10000 \\
       --output-dir co2_md

Usage - Full workflow
  .. code-block:: bash

     python -m mmml.cli.dynamics --checkpoint model/ --structure molecule.xyz \\
       --optimize --frequencies --ir-spectra --md --nsteps 5000 \\
       --output-dir full_analysis

Outputs
  - Optimized geometries (XYZ format)
  - Vibrational frequencies and normal modes
  - IR spectra (plots and data)
  - MD trajectories (ASE trajectory format)
  - Analysis results and statistics


plot_training.py
----------------

Purpose
  Visualize training history and analyze model parameters from saved checkpoints.

Common flags
  - ``history_files``: one or more training history JSON files
  - ``--compare``: compare two training runs (requires 2 history files)
  - ``--params``: parameter pickle file(s) for analysis
  - ``--analyze-params``: analyze and plot parameter structure
  - ``--output-dir``: output directory for plots
  - ``--dpi``: DPI for output images (default: 150)
  - ``--format``: output format (png, pdf, svg, jpg)
  - ``--smoothing``: exponential smoothing factor (0-1, 0=none)
  - ``--summary-only``: only print text summary, no plots

Usage single model
  .. code-block:: bash

     python -m mmml.cli.plot_training \
       checkpoints/my_model/history.json \
       --output-dir plots --dpi 300

Usage comparison
  .. code-block:: bash

     python -m mmml.cli.plot_training \
       model1/history.json model2/history.json \
       --compare --names "Model A" "Model B" \
       --smoothing 0.9

With parameter analysis
  .. code-block:: bash

     python -m mmml.cli.plot_training history.json \
       --params best_params.pkl \
       --analyze-params

Outputs
  - Training history plots showing loss curves and metrics
  - Parameter analysis plots (if requested)
  - Text summary of training performance


Minimal example files
---------------------

Model args (EF) JSON (if constructing a default model)::

  {
    "features": 64,
    "max_degree": 0,
    "num_basis_functions": 32,
    "num_iterations": 2,
    "n_res": 2,
    "cutoff": 8.0,
    "max_atomic_number": 28,
    "zbl": false,
    "efa": false
  }

Dataset layout
  - Single ``npz`` file with arrays at least: ``R`` (positions), ``Z`` (atomic numbers), ``E`` (energies), ``F`` (forces)

Minimal Slurm scripts
---------------------

Training (1 GPU)
  .. code-block:: bash

     #!/bin/bash
     #SBATCH -J mmml-train
     #SBATCH -A your_account
     #SBATCH -p gpu
     #SBATCH -N 1
     #SBATCH -c 8
     #SBATCH --gres=gpu:1
     #SBATCH -t 12:00:00
     #SBATCH -o slurm-%j.out

     module load cuda/12.1  # if needed
     source /path/to/venv/bin/activate

     srun python -m mmml.cli.make_training \
       --data /path/to/data.npz \
       --tag physnet_run1 \
       --num_epochs 20 \
       --batch_size 8 \
       --learning_rate 1e-3 \
       --ckpt_dir /scratch/$USER/mmml_checkpoints/physnet_run1

MD run (CPU or GPU)
  .. code-block:: bash

     #!/bin/bash
     #SBATCH -J mmml-md
     #SBATCH -A your_account
     #SBATCH -p gpu
     #SBATCH -N 1
     #SBATCH -c 8
     #SBATCH --gres=gpu:1
     #SBATCH -t 02:00:00
     #SBATCH -o slurm-%j.out

     module load cuda/12.1  # if needed
     source /path/to/venv/bin/activate

     srun python -m mmml.cli.run_sim \
       --pdbfile /path/to/box.pdb \
       --checkpoint /scratch/$USER/mmml_checkpoints/physnet_run1 \
       --n-monomers 1000 \
       --n-atoms-monomer 3 \
       --temperature 100 \
       --timestep 0.1 \
       --num-steps 10000 \
       --output-prefix md_simulation

Debug (short) job
  .. code-block:: bash

     #!/bin/bash
     #SBATCH -J mmml-debug
     #SBATCH -A your_account
     #SBATCH -p debug
     #SBATCH -N 1
     #SBATCH -c 4
     #SBATCH -t 00:10:00
     #SBATCH -o slurm-%j.out

     source /path/to/venv/bin/activate
     srun python -m mmml.cli.make_res --resname WAT --pdb water.pdb --out water_res


Notes
-----

- For reproducible results, set seeds where provided by flags.
- Ensure the box size in ``run_sim.py`` is physically reasonable for your system.
- If running on CPU-only nodes, remove CUDA module loads.
- The ``calculator.py`` module provides a generic interface that automatically detects model types.
- Use ``plot_training.py`` to visualize and compare training runs from JSON history files.
- All CLI tools support ``--help`` for detailed usage information.