Jump to content
What's the most challenging data set you've ever worked with, and how did you overcome the obstacles it presented?

Recommended Comments

4.9 (18)
  • BI analyst
  • Data engineer
  • Data scientist

Posted

The Most Challenging Aspects of Working with Data

The challenges of working with data can in my opinion be viewed from two distinct perspectives: technical and business.

Technical

From a technical standpoint, text data presents unique difficulties, especially when applied in NLP methods. Raw text data is often unstructured, noisy, and requires extensive preprocessing to be transformed into a usable format. This involves tasks like tokenization, removing stop words, handling misspellings, and normalizing text—all of which are crucial to ensure the model’s effectiveness.

Another major technical hurdle is that many NLP methods are unsupervised, meaning they don’t rely on labeled datasets. This shifts the burden of evaluating and fine-tuning models to human judgment, which can introduce subjectivity. Deciding on the best approach to model and assess the data’s quality requires deep expertise and careful evaluation, adding to the complexity.

Business

From a business perspective, stakeholders often request access to "all available" historical data when embarking on analytics or data-driven projects. This demand, while seemingly reasonable, often leads to significant challenges in real-world settings. Over time, businesses undergo changes in processes, systems, and data storage practices. These changes create discrepancies in the data, leading to issues such as inconsistent business logic, formats, missing fields, or outdated structures.

This accumulation of inconsistencies can make data integration a daunting task, requiring extensive data mapping and reconciliation efforts. In some cases, data from different time periods or systems may be completely incompatible, rendering historical analysis difficult or even impossible.

Overcoming the obstacles

Technical) When I work on NLP projects, I always take into account the preprocessing and iterative nature when I estimate tasks. I also make sure there are domain experts attached that know the data and can interpret the usefulness of the model results.

Business) I usually go thoroughly through each data domain and let the business prioritize where to focus our efforts. If data is incompatible, business stakeholders can usually understand the shortcomings if it is explained in business terms. Do not try to something like "there are empty fields in the data". It's usually much better to explain that "legacy business processes did not take into account a specific information", when the client called etc.

×
×
  • Create New...