👨🏻‍💻 postech.work

Python Developer

Hays • 🌐 In Person

In Person Posted 3 days, 16 hours ago

Job Description

We are seeking a

Data Engineer / Python Developer

to lead the data acquisition and processing efforts for a high-stakes, agentic AI chatbot in the healthcare domain. This is not a traditional BI or ETL role; you will not be building dashboards or moving data for analytics. Instead, you will architect a robust, modular engine capable of crawling, parsing, and normalizing vast amounts of unstructured and structured healthcare data from diverse sources—ranging from dynamic JavaScript websites and PDFs to proprietary vendor formats.

Location: Toronto, ON(1day/week onsite)

Key Responsibilities

Data Collection \& Web Crawling

Advanced Web Scraping:

Build and maintain scalable scrapers for HTML and dynamic, JavaScript-heavy websites using

Scrapy

and

BeautifulSoup

.

Multi-Format Ingestion:

Develop custom parsers to ingest and normalize data from XML, RSS feeds, JSON, PDFs, database dumps, and non-standard vendor formats.

Source Management:

Manage a large catalog of external public and proprietary data sources, ensuring raw data is persisted reliably.

Pipeline Architecture \& Normalization

Modular Engineering:

Design and implement modular, reusable Python components to transform raw, heterogeneous data into standardized intermediate formats (e.g.,

JSON lines

,

Parquet

).

Orchestration:

Build and manage automated pipelines using

Apache Airflow

that can re-run processes, detect changes at the source, and perform incremental updates.

AI Integration Support:

Collaborate with AI engineers to implement data chunking, vectorization logic, and ingestion into

Vector Databases

.

Security \& Compliance:

Implement rigorous data handling protocols to manage

PII

(Personally Identifiable Information) and

PHI

(Protected Health Information) within a secure healthcare environment.

Core Skills

Expert Python:

Deep experience in backend Python development with a focus on data processing.

Web Scraping Stack:

Mastery of

Scrapy

,

BeautifulSoup

, and tools for handling dynamic content (e.g., Selenium, Playwright, or headless browsers).

Orchestration:

Professional experience building and monitoring pipelines in

Apache Airflow

.

Data Formatting:

Proficiency in handling diverse serialization formats (JSONL, Parquet, XML) and unstructured data (PDF parsing).

Experience \& Qualifications

Healthcare Domain:

Prior experience working with sensitive data, including

PII/PHI

and adhering to security compliance (e.g., HIPAA).

Cloud Platforms:

Strong preference for candidates with hands-on experience in

GCP

(BigQuery, Cloud Functions, GCS).

Collaborative Mindset:

Proven ability to work in a team-oriented environment, collaborating closely with AI and Backend engineers.

Best Practices:

Strong grasp of software engineering principles (DRY, SOLID) and data engineering patterns.

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.