HYLA TechTalk: We are continuing our ongoing series of blogs that will focus on the insights into the technology we develop and data we utilize across our analytics, trade-in, insurance and processing solutions utilized by carriers, OEMs, retailers and throughout HYLA.
HYLA manages millions of mobile devices through its reverse logistics lifecycle from collection, to grading, to maximum asset value recovery. For us to manage the process effectively and ensure that our customers get the maximum recovery value on their assets, we utilize data and information collected from our customers, vendors, various industry analysts, and mobile device buyers from over 50 different countries. HYLA collects close to 1M records of data related to mobile device pricing on a daily basis.
Mobile devices do not follow a universal model catalog and naming convention. Descriptions of models provided by different sources vary greatly, and extracting model name, model number, storage capacity, carrier, GSM/CDMA, and supported LTE bands can become extremely difficult. For us to effectively utilize this information, we need to be able to map these 1M records on a daily basis to HYLA’s product catalog.
While many machine learning algorithms have been around for a long time and wildly utilized at HYLA, the state of the art machine learning algorithms that improve process efficiency and analytic results are always explored and employed. In the product catalog mapping initiative, new machine learning algorithms help to reach a higher level of process efficiency and increase mapping accuracy.
Product catalog mapping maps external product description strings from various data sources into standard HYLA product catalog. It is the first step in data processing and is important to have a process and algorithms that handle it efficiently and accurately. With millions of data records pouring in daily, if the mapping gets bottlenecked, there will be no further data processing steps and therefore no automated routine business reports nor would predictions get updated on schedule. Or if the mapping algorithm is poorly conducted, then the end results would not be reliable since the products are not mapped correctly.
Various classification platforms/libraries have been tested and scikit-learn is used in this project. Scikit-learn is a free machine learning library for the Python programming language. It includes various classification algorithms that can handle different data and business questions.
The key features to map a product such as a smartphone, are manufacturer, model name, model number, model capacity, service provider, etc. Since the product descriptions are text strings which could be anything from one single word to a few lines of full description, it requires text processing before any classification algorithm is used.
Scikit-learn provides utilities for text processing including feature extraction with tokenization and vectorization. Feature extraction extracts key factors in a format supported by machine learning algorithms from datasets consisting of formats such as text and image. In the product mapping case, it finds the word or words that identify the specific products from the description string. Business expertise and domain knowledge at HYLA on consumer electronic products such as smartphones, tablets, wearables and accessories play an important role in feature identification and extraction. For example, a simple description of “G930F” is mapped and extracted with features as “Galaxy S5”.
Scikit-learn is under active development with latest classification algorithms, such as support vector machine, random forest, etc. Because of the nature of product mapping data and process, Passive-Aggressive (PA) algorithm is used.
The PA algorithms are a family of algorithms for large-scale learning. In short, the ‘passive’ means if the classification prediction is correct, then do nothing; the ‘aggressive’ means if the prediction is not correct, then minimally update the weights to correctly classify, as in Figure 1. The PA algorithm is newly developed and involving, and well employed by tech sectors such as Google.
The PA algorithms use a Hinge Loss Function which is like Perceptron algorithm (1950s) but more efficient, as in Figure 2*.
The classification model with PA algorithm is implemented through DASK, a python parallel computing library, which makes it possible for large scale data training done online. The employment of PA algorithm and implementation process increase the product catalog mapping process efficiency and accuracy.