Social Media Information extraction using NLP

Shivani Metangale
7 min readJan 15, 2022

Introduction:

Social media is becoming more popular day by day. Facebook is used by nearly half of the world population. The fast advancement in information technology in the last 20 to 30 years has increased the quantity of the information accessible on the internet. It introduced us to the new way of information transferring and sharing between people on different social media.

Information and ideas have given people a way of interaction in which they can develop, share, and interchange in virtual communities by social media. Social Media gives extra knowledge than the resources present on the internet like online news. To utilize this huge sum of data, it is important to abstract structured information from this huge unsorted information. Information Extraction (IE) is the technology field which allows us to utilize such a huge amount of unstructured information in a structured way. Human language text information is analyzed by Extraction techniques such a way that it extracts information about various types of events and entities. Knowledge-bases (KB) is able to store this Structured data which has different facts and connections extracted from the free style text present on social media. Knowledge-bases is an information storage which gives a way to collect, organize, share, search and utilize the information. It could be encoded in machine readable or humans can read it.

Source: htpps://social_media/unspash.com

Challenges:

1)There are many social platforms and new and new platform are emerging day by day. This platform has certain restriction on the post text limit. Hence the users try to shorten the text in order to form meaningful sentences in a less word. This adds to the complexity of extracting the information properly.

2)The different forms offer nearly the same type of post posting method.it consists of the post, location tagging, and images, their views, hashtag trending, etc. But sometimes this are modeled in such a way that the user instead of typing in a full work, prefer to write in short form,

Example:

BFF: “Best Friends Forever”

BRB: “Be Right Back”

BTW: “By the Way”

2nite: Tonight

BC: Because

DM: Direct message

FTW: For the win

IDK: I don’t know

3) There is lot of data available in the internet and people usually are in confusion to know whether the information is correct or not and some blindly follows this information as a true one. The below article from ‘The HINDU’ newspaper best depicts the authenticity of an information on social media.

Source: https://www.thehindu.com/opinion/op-ed/are-social-media-platforms-the-arbiters-of-truth/article31750653.ece

4) Because of the poor educational infrastructure, a lot of people are illiterate (around more than 9 states in India have a literacy rate less than 75%) hence it becomes difficult to have a post written in formal language.

What is the best ways to overcome these obstacles?

Information extraction from social media presents a number of difficulties. These obstacles, however, are surmountable. There are just a few options for dealing with the problems. Before going into detail about the proposed framework, it’s necessary to go over the solution’s most critical components.

1. Noisy text filtering

Social media is a massive platform with a growing number of users every day. Twitter is one such site, with over 140 million tweets sent every day by all million users around the world. And the graph continues to rise day after day. We must filter out the non-informative posts in order to obtain quality content. This could be accomplished by removing posts based on language, domain, relevancy, and a variety of other criteria. As a result, the feed will only contain posts that are relevant and informative. For example, if we want to obtain correct cricket match results from tweets, we must first filter away those with irrelevant material. This way, a subset of appropriate posts can be segregated from those of non-relevant posts.

2. Named Entity Extraction:

People nowadays have developed the tendency of using short forms, or to be more specific, people are quite fond of WhatsApp and Instagram language. As a result, we require new named entity extraction methods that don’t rely on POS and syntactic aspects like capitalization. Existing techniques to name entity recognition suffer from data sparsity issues, making them unsuitable for the task.

3. Named entity disambiguation:

The task of establishing the identity of entities mentioned in text is known as named entity disambiguation (NED) or entity linking in natural language processing. For example, to link the word “California” to the Wikipedia entry “http://en.wikipedia.org/wiki/California,". Named entity extraction is different from named entity disambiguation.

4. Fact Extraction

The fact extraction (FE) module in open IE is responsible for detecting and characterising semantic relationships between entities in text, as well as relationships between entities and values. The purpose of closed domain IE is to use the extracted named entities to fill in a predetermined template.

5. Feedback Extraction

The NED (Named Entity Disambiguation) modules and the FE (fact extraction) modules form a feedback loop. This information aids in the correction of earlier disambiguation errors.

Natural language processing models:

1)Custer and labelling model: When a user talks about his actor like Amitabh Bachchan or Virat Kohli for that matter on social media, people will also appreciate to explore more similar personality like Deepika Padukone or MS Dhoni. Hence in that case we need to do clustering and labelling:

So, the proposed framework will be like:

● Need to define so that we can extract from large messages

● A model/framework to classify the words and finding out their semantic meaning.

● Forming the clusters and assigning them a label.

2)Chunking: It is a process in which parts of speech are identified. In school we have studied the 8 parts of speech i.e. noun, adjective, verb, proverb, adverb, preposition, interjection and conjunction. This process selects the token and did not consider the white spaces. It works in conjunction with POS tagging. It accepts POS-tags as input and outputs chunks.

For example:

Source: https://towardsdatascience.com/chunking-in-nlp-decoded-b4a71b2b4e24

The above figure shows how the sentences are divided into chunks or noun phrases.

Source: https://www.analyticsvidhya.com/blog/2021/10/what-is-chunking-in-natural-language-processing/

3)Text summarization: Text summarization is another application of a NLP hence we can use it for the information extraction. Extraction and Abstraction are the two most used strategies for automated summarization. To form the exact, extraction refers to selecting a subset of existing words, phrases, or sentences from inside the unique textual material. Abstraction, on the other hand, creates an internal semantic depiction before employing herbal language period tactics to produce an exact this. In this approach a matrix is used to assign the probability of a work used frequently and based on this the model outputs the word ->phrases->sentences.

4)Sentiment analysis: This is the most used and popular among the developers. as in this marketing era all focus on the getting attention of the customers quickly hence it is necessary to important to have a tool which not only extract information but also make its classification.in this the bag of worlds are selected as per the sentiment then by using the NLP analysis it is predicted using the probability the work occurred in sentence and hence based on this the sentiment analysis is carried out.

Conclusion:

From the blog we studied that as IOT is evolving and network connectivity is increasing day by day, attracting more users to share their thoughts, emotions, etc. on social media platforms thus making a point in extracting the information from social media to have control over it in order to make interpretation out of it and knowing the changes in the thought of the changing demographic dividends. In this blog we discussed how Natural language processing can be helpful for us. Natural language provides us to extract information about it also possess some ideal condition where it works well and other than that it is trapped between the ambiguity of the words. To overcome these challenges, we discussed the framework that can be used for extracting data using NLP.

It was said that, “an ostrich, when faced with danger, would bend over and push its head down into the sand, so it couldn’t see the danger.”

But we cannot do similar things in this technologically advanced and dynamic world.

Sameeran Pandey

Madhusudan Shinde

Ashish Pawar

PRITHVIRAJ CHAUHAN

--

--