Meet Sean Crimmin, an Analyst at Cruz Street and the lead analyst behind the HubSpot to AWS QuickSight Data Integration project. In this interview, Sean walks us through the key aspects of the project, sharing his insights on the challenges, automation strategies, and data transformation processes that made this integration successful.
Project Overview
Can you give us an overview of this project?
Sean explained that the primary goal of the project was to seamlessly integrate HubSpot data with AWS, enabling real-time data analysis and visualization in Amazon QuickSight. “We designed a scalable, automated analytics pipeline by extracting data from HubSpot, storing it in AWS S3 and Glue, and leveraging SPICE in QuickSight,” he shared. The result was a streamlined approach that allows real-time visualization of CRM data directly in QuickSight.
[Connection] How did you get access to HubSpot data?
Accessing HubSpot data required setting up a private app within HubSpot, which granted API access. Sean expressed that “using the API key, I referenced HubSpot’s API documentation to determine the key endpoints needed for data extraction,” he explained. His approach focused on retrieving data from the Contacts, Companies, and Deals endpoints through GET requests, ensuring all necessary information was available for analysis.
[Storage] Where did you store the data?
“Once the data was retrieved, I saved it in Parquet format, which is highly efficient for large-scale analytics,” Sean noted. He elaborated that these Parquet files were then written to AWS S3 and registered in the AWS Glue Data Catalog, making them easily queryable within AWS.
Data Refresh Strategy
How did you ensure the data remained up to date?
The team implemented an incremental update strategy to avoid unnecessary full refreshes. “Instead of pulling all data every time, we only retrieve new or updated records,” he said. This approach significantly reduced processing time and prevented data duplication.
Additionally, he emphasized the importance of handling date fields properly. “Date fields are a common issue in API data extraction, so I carefully inspected the data types in the extracted DataFrames before saving them to Parquet to ensure proper formatting,” he added.
[Automation] How did you automate the data pipeline?
To eliminate manual data extraction, Sean leveraged AWS Lambda for automation. “We set up a Lambda function that runs on a scheduled basis using Amazon EventBridge,” Sean explained. This function automates API requests to HubSpot, retrieves the latest data, and saves it as Parquet files in S3.
The integration with AWS Glue ensured that data remained queryable in QuickSight. “By using serverless execution, the automation is cost-efficient and requires no manual intervention while maintaining fresh data availability,” he shared.
[API] What tools and programming languages did you use?
Python was the primary language used for this project due to its flexibility and extensive support for data manipulation. “We used several key libraries to streamline the process,” he said, listing:
- requests – for making API calls
- pandas – for structuring and cleaning data
- pyarrow – for converting data to Parquet format
- boto3 – for uploading data to S3
- awswrangler – for interacting with AWS Glue
“These libraries allowed us to handle incremental updates efficiently and ensure smooth data ingestion,” he added.
[Transformations] What HubSpot base tables did you work with?
The primary HubSpot tables accessed included:
- Contacts – containing customer and lead details
- Companies – storing organizational information
- Deals – tracking sales opportunities
Careful data type handling was essential. “We converted date fields to datetime64
in Pandas to avoid format mismatches and ensured that ID and numeric fields were stored as int64
or float64
for accurate aggregations,” Sean explained.
Data Transformations & Joins
Did you perform any data transformations or joins?
“Absolutely,” Sean said. He detailed how he created a joined table combining Contacts and Companies data to enhance visualization and simplify analysis. “This table includes key details like contact names, email, company name, industry, and lifecycle stage, making filtering and segmentation in QuickSight much easier,” he explained.
The join was performed using the company ID as the key, ensuring a structured one-to-many relationship. “By merging these datasets, we unlocked valuable insights, such as tracking engagement levels by company and understanding customer interactions on an organizational level,” Sean added.
He also emphasized that all transformations, including field mappings, data cleaning, and joins, were handled at the DataFrame level before storage. “This approach gives us greater flexibility while optimizing performance,” he noted.
[Visualization] Why did you choose Amazon QuickSight?
QuickSight was the natural choice due to its seamless integration with AWS services. “It offers several key benefits,” Sean highlighted:
- SPICE storage boosts performance and allows for rapid analysis.
- Native support for Parquet files makes it easy to connect data from AWS Glue and S3.
- Dynamic filtering and drill-down capabilities enhance analytics flexibility.
“Once we structured the Parquet files correctly, I connected them to QuickSight, built a new analysis, and designed interactive dashboards to visualize the data effectively,” Sean shared.
Final Outcome
What was the final result of this project?
The end product was a fully automated, scalable, and efficient data pipeline. “This integration enables us to track key business metrics, analyze trends, and make data-driven decisions without any manual data handling,” he said.
The combination of AWS Lambda automation, AWS Glue cataloging, and QuickSight visualization has transformed how the team interacts with their HubSpot data. “With real-time data availability, we can now respond quickly to business needs and uncover insights that drive strategic decisions,” Sean concluded.