Key Data Feed Management Challenges: Metadata Discovery Issues

Published 11 December 2015 | Updated 11 June 2020 | Lexy Mayko

Running a complex business with numerous branches or points of sale scattered in different geographical locations requires very well organized data exchange. No less vital here is timely and correct centralized data collecting from all sources and its analysis.

This is a challenging task since depending on the source and type of data mechanisms of its delivery vary. Therefore, businesses are in need of systems that can manage all types of data streams, or feeds, as they are typically called. E-Сommerce is the sphere with a great demand for such software. Contrary to the popular belief, gathering information from all sources in a network of e-stores and marketplaces is way more than a few clicks due to different mechanisms of data retrieval from various shopping carts.

However, it’s next to impossible to design a system that would be able to equally satisfy the demands of various organizations. Data feed management in each company has its peculiarities and requires a personalized approach to data feed management system creation and configuration.

These are the conditions that make data feed management solutions highly demanded on the market. If you are planning to start developing the systems for managing data feeds, or are already running this type of business, you should know that organizing it is far from being a piece of cake. There are numerous hidden aspects that might become stumbling blocks if you are unprepared. This made us prepare an account of the most frequent and common data feed management challenges to help you get ready for them in your work. This post is the first in the challenges in data feed management series, and the rest will follow shortly.

Feed metadata discovery challenges

So, what makes this “data about data” so important and what pitfalls it presents for the functioning of the whole data feed management system (DFMS)?

Metadata is crucial for the system to correctly interpret the data files. It includes the following information:

feed file naming code, which can provide the following: timestamp, report generating object, its attributes, file format, etc.
files arrival patterns, encoding the next data: feed generation frequency, intervals, maximum delay between the event and generation of report about it, etc.
file format specification - necessary for the system to parse the records
data semantics - contains interpretation of each data field, like types of data, encodings, domains, etc.

Typically, issues occur because the metadata info is unavailable, which causes the system extra difficulties with adequate interpretation of feed data files. Below, we are going to go over the key challenges presented by metadata absence.

Metadata incomplete or missing

Aggregated feeds

Often, a source data feed consists of a number of uniform subfeeds with complementary information. With time, the number of subfeed generators varies, as new ones are added, and some of them are temporarily or permanently unavailable.

What is more, the above described data feed can, in its turn, be a part of a large group of feeds with related information. Likewise, the number of feeds in the group has a tendency to change over time, making it a challenging task to discover the metadata for individual feeds.

Insufficient communication with feed sources

This issue is common when source feeds are managed by different organizations or even units within a large company or organization. They are not obliged to document the metadata about its source feeds, so seldom spend time on this. The problem deepens if there is no or little communication between company units.

Discovering metadata via feed history browsing

In the conditions when there is excessive usage of aggregated data feeds and insufficient information from source feed managers, individual subscribers are left to deal with the issue of missing metadata on their own. Traditionally, they make attempts to discover the necessary info by browsing data feed history.

However, even in the case when they have an aggregated feed composed of uniform subfeeds, which is relatively simple, it is easy to make wrong assumptions, especially when the feed history is scarce. Speaking about the group of feeds that are loosely related to each other, the risk of wrong guesses about the individual data feeds structure is much higher. Unfortunately, the fact that assumptions are wrong is frequently discovered only when users of apps that receive feed data start complaining about incorrect info in the generated reports.

Feed changes

During their lifetime, data feeds undergo significant changes. It presents the challenge for subscribers discovering metadata using the history browsing method. First of all, they need to create the specific feed definitions that would at the same time exclude irrelevant files and work well even in case of feed evolution.

Feed changes are virtually inevitable and include file content, structure, arrival patterns, etc. In case the definition is too weak to withstand them, it results in either new files being missed or unwanted files coming in. The longer this issue is unnoticed, the more incorrect data will be received and processed, causing very undesirable consequences.

Reasons of missing or unwanted files in the feed

If the data feed definition is too specific or too generic, there’s a risk of false negatives or positives in file classification occurring under the conditions listed below.

Changes in file naming convention

If the definition was generated using the sample with a certain file naming convention that changed due to an update of the software generating data feeds, the files with the new names can be recognized as those not matching the definition.

Introduction of new data feed sources

If the list of data sources is hardcoded in the definition, it will be valid only till new sources that are not on the list start contributing to the feed.

Definition based on an invalid data sample

This most commonly occurs with complex data feeds consisting of a number of loosely connected subfeeds. In order for the definition to be correct, all subfeeds should be present in the data sample. Otherwise the feeds that were not included in it can be further marked as non corresponding to the definition.

Using wildcards in the feed definition

A wildcard (*) can be used to replace an object name when composing a feed definition. It is mostly used to avoid false negatives in file classification in case of feed changes. However, this also means the risk of including many irrelevant files on the feed.

Another instance is to simplify the definition by using wildcards in place of the values list in some fields of the filename. This also makes the definition vulnerable to non-matching files.

Conclusion

Obviously, neither feed subscribers nor providers can solve the abovementioned issues on their own, since it requires communication of both sides. Moreover, even if the problem is detected, it is challenging to fix it due to the absence of the corresponding mechanism. Therefore, it is up to DFMS designers to devise the mechanism of detecting feed metadata issues as well as their solution scenarios. This will definitely add great value to their products and facilitate their business growth.

If you're a DFM provider wondering how to streamline integration with shopping carts, consider API2Cart. We offer a unified API to connect to 40+ platforms like Shopify, Bigcommerce, Magento, PrestaShop, OpenCart, and many others. If you have any questions or would like to discuss how the power of the API could fit your business, schedule a call or leave us a message.