WCO News 102 — Issue 3 / 2023
Selecting cargo for inspections and examinations using basic methods has limitations, so, to enhance their enforcement capability, Customs administrations need a paradigm shift in the way they deal with targeting. They need to leverage data science, an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy data, both structured and unstructured. To apply data science tools, you need to have a sound data architecture in place, and in this article we delve into the sophisticated architecture designed by Indian Customs.
Data analytics architecture refers to the design of different data systems within an organization and the rules that govern how the data are collected and stored. It is foundational to data-processing operations and artificial intelligence (AI) applications. Indian Customs has developed an architecture for the purpose of automatic data analytics which incorporates five interdependent layers, each contributing to a holistic and dynamic approach to Customs targeting and detection. This article aims to highlight the architecture’s capabilities in big data analytics for the purpose of precise targeting.
LAYER 1: DECLARED AND UNDECLARED DATA
The first layer involves the collection, collation (assembly of written information into a standard order or sequence) and cross-validation of available data at the time of importation of goods. This encompasses both declared and undeclared data.
The declared data, which provide the initial data points, comprise, among other things, data submitted in the import declaration and in the import general manifest and data contained in the invoice, bill of lading and licences. If the documents submitted by importers and carriers are in paper or PDF formats, the data are not readily available for automatic risk analysis, and much reliance is placed on the use of optical character recognition (OCR) tools for the development of automated solutions, whereby the available data are converted to machine-readable formats for system-based risk analysis.
The “undeclared” data are those obtained from Open-Source Intelligence (OSINT) and proprietary data repositories. They form additional sources to fine-tune the analysis. They also include scanned image data generated by non-intrusive inspection equipment. Data can be tracked back to the source systems with an audit trail.
The purpose of this layer is to collate unprocessed data (the primary data collected from the source) related to a specified import declaration. The range of data extends beyond just the individual import to encompass information from numerous comparable imports. This collective data set becomes the input for algorithms or models hosted in subsequent layers, including image analytics, automated classification and score-based generative models. Through input from this layer, these algorithms and models discern unique patterns and anomalies that are relevant to the specified import.
LAYER 2: FEATURE ENGINEERING
Raw data are not suitable to train machine-learning algorithms. Instead, data scientists devote much time to feature engineering. This is the process of transforming raw data into features (also known as variables or attributes) that are suitable for machine-learning models. Understanding the training data and the targeted problem is an indispensable part of feature engineering, which is not an exclusively manual process. It is performed by using algorithms to analyse and cluster unlabelled data sets without human intervention (unsupervised machine learning (ML)), feature augmentation techniques and econometric models.
Layer 2 is made up of the numerous “themes” that exist in each import transaction: for example, importer, supplier, country of origin and country of shipment. From the exploration and comprehensive understanding of the combined data set in Layer 1, multiple “features” have been created for each of these themes. Thus, Layer 2 has multiple themes and associated features containing important insights derived from raw data.
The primary objective of this layer is to incorporate a wide array of trade aspects captured directly or indirectly within the data. For example, the theme “importer” could include features like the frequency of imports, commodity of import and type of importer. Attributes such as the geodemographics of the importer are taken into consideration for the development of possible risk insights on the import patterns of the importer.
Features can be created using unsupervised machine learning. The codification of supplier details and item description details, for subsequent efficient targeting, is a critical derived feature generated by such tools.
LAYER 3: TARGET DEVELOPMENT
Smuggling activities are often camouflaged within established trade patterns, yet subtle deviations exist within one or multiple data points. A robust Layer 3 design effectively flags such suspicious import declarations through “targets”, meticulously crafted to identify these deviations or anomalies. Targets are created by machines and humans through the analysis of smuggling trends, previous seizures and emerging threats. They are combined with algorithms which are then encoded into the machine using Layer 2 features.
For example, targets generated by the machine flag instances of misclassification and undervaluation. Pattern matching with the offence database in the system enables automatic targeting. Thus, Layer 3 represents a pivotal step, in which various targets are designed to address and automatically flag different types of risks. A comprehensive array of targets ensures that the model is robust.
LAYER 4: DECISION LAYER – INSIGHTS GENERATION
Layer 4 combines machine-generated risk insights and targets for the precise identification of risky import consignments and creates a “dashboard”. With the targets from the previous layer as inputs, this dashboard is a decision model created through the use of machine learning that highlights risky shipments as an output.
The risk insights form the outcome of the analysis of the Layer 1 data set in combination with data processing in Layers 2 and 3.
The insights are generated automatically by the machine against every targeted import declaration through network associations. They highlight potential risk factors, such as patterns showing the importer “hopping” between various parameters, for instance the port, commodity, supplier, and Customs broker, importer-supplier correlation and supplier risk analysis. Based on the analysis in this layer, a decision is taken on whether the cargo is to be interdicted. The results are then fed into Layer 5.
LAYER 5: ALERT GENERATION
The final layer distils the analysis, insights and findings into easily comprehensible, human-readable reports. Customs officers gain a comprehensive overview of anomalies and patterns detected, making swift decision-making and action possible. The readability of the reports generated by the system is crucial: a well-designed and easily comprehensible report serves as a bridge between the analytical insights derived through the architecture and the Customs officers in the ports, who rely on these insights to take decisive action.
THE DYNAMIC FEEDBACK LOOP: REFINING TARGETING STRATEGIES
A distinctive aspect of this architecture is its ability to respond dynamically through the incorporation of an iterative feedback loop. This iterative process ensures that the analytical model evolves continuously, adapting to new smuggling trends. Data related to import declarations that are labelled as “offences” and stored in the offence database are fed back into layers 2, 3, 4 and 5. All offences are included in the database, whether they were identified through the targeting model or by another method.
For instance, at Layer 2, successful detections of smuggling, as well as false positives, provide valuable insights on features that are most indicative of potential smuggling. The feedback serves to re-evaluate the range of features incorporated into the analysis and potentially introduces new features that have proven to be more effective in accurately identifying illegitimate trade. This, in return, helps in the reconstruction of the expert targets in Layer 3, which subsequently results in refinement of decision-making in Layer 4 and improved precision of the human-readable reports of Layer 5.
PERFORMANCE
This data science architecture, integrating advanced data analytics with human intelligence, has proved to be effective and efficient in detecting instances of smuggling of contraband and prohibited and restricted items through the border. This model enabled the detection of about 3,000 kgs of heroin smuggled from Afghanistan through a maritime import consignment at Mundra port, Gujarat. Other successful detections include those of 7.2 million sticks of foreign brand cigarettes found concealed in an import consignment at Nhava Sheva Port, and poppy seeds concealed in a consignment that had arrived at Chennai port, to name but a few.
The model has demonstrated its ability to enhance Customs administrations’ potential to combat smuggling significantly. A multi-layered architecture and a continuous feedback loop have ensured that the model remains dynamic and relevant. As the threat landscape evolves, this model holds the potential to redefine the fight against illicit trade, thereby safeguarding economies and protecting societies from the menace of smuggling.
More information
sruti.vijayakumar@gov.in
Mramesh.irs@gov.in
18th October 2023
Telegram: https://t.me/teloneum
29.10.2023