Data Engineer / Machine Learning Engineer at Threadloom
Palo Alto, California, United States
🇺🇸 (Posted Jul 1 2019)
About the company
We curate community content and tell contextual stories. Our curation services are used by 1,000+ forums reaching millions of people. We started in 2015 and have 2 locations – Palo Alto and Bellevue. Our team comes from a mix of startup and big tech backgrounds, but we all share a desire to build a better Internet. We are Stanford StartX alumni (2016).
Threadloom is looking for an experienced data engineer with strong machine learning experience.
This is a foundational role. You will be Threadloom's first engineer solely responsible for building and extending our processing pipelines. Working closely with Product, Ops and Eng it will be your job to design and develop the data warehouses used by all of our services and products. This includes ownership of the processing of billions of documents that power Threadloom Search and Newsletter and upcoming consumer products.
The ideal candidate is passionate about building large-scale, high-volume pipelines that manage and store mission-critical data. This person is conversant with current cloud platforms for parallel processing and storage, and can easily translate product and user requirements to data requirements for storing and managing data. They should also have experience with building machine learning models which classify and rank content and predict user preferences.
The ideal candidate also cares about our end users and is a careful steward of their data, so is also comfortable with modern user privacy standards and has experience applying them in real-world situations.
Skills & requirements
3+ years of relevant work experience
Launching consumer products that people love, at scale
Designing and implementing data pipelines and warehouses
Optimizing servers and pipelines to manage operational costs at scale
Building systems to handle user authentication and PII (e.g. Firebase, OAuth, GDPR)
Deploying production cloud services (e.g., Google Cloud, AWS, Azure)
Languages and tools
Python required, Scala/Java desired
Fluency with the latest tools, libraries, and infrastructure for building and maintaining production-level data pipelines and storage, including
distributed data processing frameworks (e.g., Hadoop, Spark, Flink, Apache Beam)
SQL and NoSQL databases (e.g., MySQL, Postgres, Cassandra, Redis)
stream processing frameworks (e.g., Kafka, Storm, Spark)
search engines (Elastic, Solr)
Built & launched ML models in a production environment
Scaling experimental models from proof-of-concept to live products that handle large-scale data
Comfortable building scalable backends, RESTful web services and APIs