back to work
● 02 · research affiliate · 2026 — present
lbnl microbial trait pipelines
reproducible python data pipelines for heterogeneous microbial trait and signature data: standardizing identifiers, trait names, metadata columns, and study outputs into validation-ready long and wide tables.
- python
- pandas
- duckdb
- kg-microbe
- gtdb
- ncbi
at lawrence berkeley national laboratory, i work on data engineering problems inside computational biology: getting inconsistent microbial trait sources into tables that can actually be searched, compared, and analyzed.
pipeline work
- transformed heterogeneous microbial trait data into reproducible long-format and wide-format tsv outputs.
- standardized trait names, organism identifiers, metadata columns, and study outputs across kg-microbe, gtdb, ncbi, and related source data.
- used duckdb for fast local querying over biological datasets without forcing everything into a heavyweight database.
- added validation and testing workflows so generated outputs are reproducible and analysis-ready.
- worked around large-file and repository constraints, including git lfs and github’s 100 mb file limit for multi-gb biological data.
- integrated bugsigdb-style study and signature data into microbial trait workflows.
the work is less about “using pandas” and more about making messy scientific data durable enough for downstream analysis.