We are seeking a
Data Engineer / Python Developer
to lead the data acquisition and processing efforts for a high-stakes, agentic AI chatbot in the healthcare domain. This is not a traditional BI or ETL role; you will not be building dashboards or moving data for analytics. Instead, you will architect a robust, modular engine capable of crawling, parsing, and normalizing vast amounts of unstructured and structured healthcare data from diverse sources—ranging from dynamic JavaScript websites and PDFs to proprietary vendor formats.
Location: Toronto, ON(1day/week onsite)
Key Responsibilities
Data Collection \& Web Crawling
Advanced Web Scraping:
Build and maintain scalable scrapers for HTML and dynamic, JavaScript-heavy websites using
Scrapy
and
BeautifulSoup
.
Multi-Format Ingestion:
Develop custom parsers to ingest and normalize data from XML, RSS feeds, JSON, PDFs, database dumps, and non-standard vendor formats.
Source Management:
Manage a large catalog of external public and proprietary data sources, ensuring raw data is persisted reliably.
Pipeline Architecture \& Normalization
Modular Engineering:
Design and implement modular, reusable Python components to transform raw, heterogeneous data into standardized intermediate formats (e.g.,
JSON lines
,
Parquet
).
Orchestration:
Build and manage automated pipelines using
Apache Airflow
that can re-run processes, detect changes at the source, and perform incremental updates.
AI Integration Support:
Collaborate with AI engineers to implement data chunking, vectorization logic, and ingestion into
Vector Databases
.
Security \& Compliance:
Implement rigorous data handling protocols to manage
PII
(Personally Identifiable Information) and
PHI
(Protected Health Information) within a secure healthcare environment.
Core Skills
Expert Python:
Deep experience in backend Python development with a focus on data processing.
Web Scraping Stack:
Mastery of
Scrapy
,
BeautifulSoup
, and tools for handling dynamic content (e.g., Selenium, Playwright, or headless browsers).
Orchestration:
Professional experience building and monitoring pipelines in
Apache Airflow
.
Data Formatting:
Proficiency in handling diverse serialization formats (JSONL, Parquet, XML) and unstructured data (PDF parsing).
Experience \& Qualifications
Healthcare Domain:
Prior experience working with sensitive data, including
PII/PHI
and adhering to security compliance (e.g., HIPAA).
Cloud Platforms:
Strong preference for candidates with hands-on experience in
GCP
(BigQuery, Cloud Functions, GCS).
Collaborative Mindset:
Proven ability to work in a team-oriented environment, collaborating closely with AI and Backend engineers.
Best Practices:
Strong grasp of software engineering principles (DRY, SOLID) and data engineering patterns.