Possible implementation of automated personalized short news fetcher
User Options
- Selecting country of interest.
- Selecting amongst major categories like politics, sports, tech. Consider it more like a filter to remove unwanted results.
Implementation
News fetch module
- Build a list of RSS feeds of news agencies by country and topics (Lists already exist needs to be fetched and stored in a database).
- RSS feeds contain title and short description of article (some contain complete article too but are less preferable for the given task) along with link to the article.
- Monitor the feeds relevant to user and keep the news in last 24/48 Hrs in buffer.
- Perform porter's algorithm and stop word removal using nltk.
Fetching twitter trends
- Twitter trends/location api provides 50 top trending topics in the area in last 24 Hrs and is stored in a buffer.
- Every 30/60 minutes this list is fetched again to monitor difference.
- Tweets regarding all the trending topics in user specified country is fetched and is stored in to a database.
- Break hashtags into list of words (those camel-cased or - separated).
- Performing basic spelling check on tweets, make a frequency table for all the words mis-spelled using nltk.
- If a particular word is appearing with high frequency for a given topic, it might be an unconventional word. Replace all but the unconventional words by nltk suggestions. Might want to drop tweets with too many spelling mistakes.
- Perform porter's algorithm and stop word removal using nltk.
Finding news which are trending
- Find doc2vec embeddings (with word2vec trained on google news data) for the set of news articles to create a clustering amongst the news articles using gensim. K-means clustering can be used to cluster these documents using scikit-learn.
- Find doc2vec to find vectors for tweets.
- Find the average class similarity between the news clusters and trending topics.
- If the similarity is above certain threshold the tweet and news cluster might be about the same topic, add the news articles to new buffer of news to be posted and remove it from the watch buffer. Keep the trending topic in watch buffer for any upcoming news articles which might come in relating to it.
Topic summarization
- Concatenate all the short descriptions of news articles corresponding to a single cluster to form a larger document.
- Many of the descriptions might be very similar but information content might differ, hence perform text summarization using tensorflow model on the document formed relating to each cluster.
- Post the obtained summery to corresponding user.