Quality filtering and deduplication pipeline

Summary
Pipeline for doing quality filtering and deduplication of data as a preparation for LLM training.