Performance analysis of the NER model centered on privacy

Authors:
(1) Anthi Papadopoulou, Language Technology Group, Oslo University, Gaustadalleen 23B, 0373 Oslo, Norway and the corresponding author ([email protected]);
(2) Pierre Leson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
4)
(5) Eldiko Pilan, Language Technology Group, Oslo University, Gaustadalleen 23B, 0373 Oslo, Norway.
Links table
Abstract and 1 introduction
2 background
2.1 Definitions
2.2 NLP approach
2.3 Publish privacy retaining data
2.4 Conversion Privacy
3 data sets and 3.1 standards of non -identification of text (tab)
3.2 CVs Wikipedia
4 The entity’s identifier for privacy
4.1 The characteristics of Wikidata
4.2 Silver Corpus and Model Tuning
4.3 Evaluation
4.4 Unlike the poster
4.5 Identity of the semantic type
5 indicators of privacy risk
5.1 LLM Possibilities
5.2 Extension classification
5.3 Disorders
5.4 Signs Sequence and 5.5 Search on the Internet
6 Analysis of the indicators of privacy and 6.1 evaluation measures
6.2 Experimental results and 6.3 discussions
6.4 sets of risk indicators
7 conclusions and future work
advertisements
Reference
Pursuit
A. Human characteristics from Wikidata
for. Entity recognition training parameters
C. The name agreement
Llm: basic models
E. Training and performance volume
Wow thresholds of turmoil
4.3 Evaluation
The evaluation results are displayed for the provider of identification of the entity directed to privacy in Table 3. The total degrees of accuracy and summons and F1 are provided in two versions: one where we take the types of stickers in mind and one does not do it. In the latter case, only look if the distinctive symbol was distinguished as PII or not[6]. The evaluation is conducted on the TAB test sets and the Wikipedia Group for CV. Since the documents of these test sets were explained by many broadcasters, the results are calculated using the exact average of all the meals.
We note that there is a clear difference between the performance with and without the matching of the stickers. This indicates the opponent of the poster between the model and the golden clarification comments, especially for categories such as Org and LOC. This does not affect, however, the discovery of the text extends. We also note that the model works better for the Wikipedia biography than Tab Corpus, indicating that Pii in Wikipedia is easier to discover on the tab.
Destroying results based on the semantic type, we note that the symbol, person, functional materials and dateteime are constantly presenting strong results on the two data collections. This is likely that these groups are easier to identify and identify surface signals. In contrast, it is difficult to discover Dem and MISC, as well as the quantity, as it includes those that include a larger range of possible distances, many unnamed entities.
In the elaborate group of Wikipedia’s CVs, the DeM -extending length is 9 symbols, with an average of 1.3 symbols, while the maximum number of symbols is 42, with an average of 2.3 symbols. For the tab, the maximum number of symbols is much higher. For Dem, he is 24 years old, while it is 785 (long version), with an average of 1.6 and 4, respectively. In the silver collection, stickers outputs are of a DEM and MISC model with a maximum of 11 and 26 symbols, respectively, while they are 1.1 symbols, as they are closer to Wikipedia standards than TAB standards.
Examples
Below we offer an example of text extensions from twoblack Line) with the disclosure of the NER General Model (red Line) and identify the entity directed to privacy (Dark green line).
In the above example, discovered the identification identifier of privacy, the demographic feature structural engineerCreating with human illustrative comments, while it was ignored by the NER standard.
We note in the above example that the identification of the entity -oriented entity is also called battery and theft As the reasons for conviction (MISC) as well as prison (MISC).
4.4 Unlike the poster
As mentioned in section 4.2, Wikidata properties have been used to suspend the types of entities in the text to train the entity’s identification form. Wikidata pages are created either when creating a corresponding Wikipedia essay or is manually prepared by the human editors. Both make Wikidata pages vulnerable to potential accuracy, which is something we faced while making an error analysis on the form of the form when applied to data collections.
Two cases can be distinguished when there is a dispute between experts and the outputs of the form: either (1) both the system and the golden signs can be considered correct or (2) committed the system wrong, either as a positive or wrong negative. Below we discuss some of these differences.
The spouses and Org are often confused with each other, which is a common mistake in many NER systems as well. In example 5 below, the word underwrings was classified as LOC by the broadcaster but the expected poster was Org.
In example 6, the touch of the line was classified as a quantity by the contestant, and the period may be explained as a period of time. On the other hand, the identifier is described as the same range as MISC due to properties such as penalty or The cause of death That was used to form the official gazette. Likewise, in examples 7 and 8, human broadcasters decided to put a sign of these periods as Dem, and displaying the addresses of function and diagnosis are demographic features. The identifier has discovered the entity of both periods and described it as MICC, since the relevant medical properties or The cause of death And the characteristics that contain words related to the function and which the names have not been used in the Official Gazette.
Figure 4 in the Apiter C shows a detailed collapse of all pairs of sticker disputes, common between the two data collections. Typical miscarriage is often found in many cases of sticker disagreements. In the next section, make an effort to analyze this semantic type with more details.
4.5 Identity of the semantic type
The MISC category was defined in Section 3 as a kind of personal information that cannot be set in any of the other categories, that is, the person, Org, LOC, symbol, dateteime, quantity or Dem. We conducted a qualitative analysis of this category in both Tab Corpus and Wikipedia CVs to understand this type PII type better.
We have found that most MISC periods can be set to the next six sub -categories:
Events: Four men, my husband is mixed, World War II
She quoted: “The need for more exploration to see you in, and the responsibility of the crime of the index and the clear lack of sympathy towards the victim.”
Health: Packing of lead to the face, multiple sclerosis, blood poisoning
Artworks: ((Only in WikipediaStar Trek: The original series, the book and the Brotherhood
Crimes: (Only on the tabDeciding tax assets, seriously, the charge of attempting to kill
Laws: (Only on the tab) Sections 1 and 15 of the theft Law of 1968, Article 125 of the Criminal Code, 19 § 4
Some of these groups are completely general and can be found in a large number of texts, such as events or health. Others are for a field of text, such as crimes, which are likely to be observed in court cases.
Finally, there were also examples of MISC that could not be assembled in the 6 categories above. These include, for example, different types of objects or words related to the occupation, or anything else that is difficult to appoint for a specific group:
Item: Golden metal, weekly newspapers, car, motorcycle, police vehicle
Related profession: science policy, basketball skills, kiss, and coach
Another: empowering the backward layers, responsibility for children’s poverty, child care, and deliberate killing [assassinat] And in the alternative with killing [meurtre]
The above analysis is restricted to the two data collections used in this paper. Of course, it is permissible, of course, to enrich the entity directed to privacy with other sub -categories. Pillaan and others also indicated. (2022), MISC was also the PII category that the most difficult human broadcasters found. Despite this difficulty, MISC periods are still an important part of the purification of the text, as many of these pipes provide detailed information about the individual concerned.
[6] This differs from a previous evaluation in Babadubulo and others. (2022) where only a sub -set of stickers such as Org, LOC or MIC and the quantity are considered to be seen.